updated README and introducted float precision argument

FontaineRiant · Jan 12, 2024 · eff5dc4 · eff5dc4
1 parent 2fbe1d8
commit eff5dc4
Show file tree

Hide file tree

Showing 3 changed files with 41 additions and 29 deletions.
diff --git a/README.md b/README.md
@@ -5,14 +5,15 @@ You can write a paragraph, let the AI write the next one, you add another, etc.
 Or you can enable "choice mode" and let the AI make suggestions you can pick
 from for each paragraph.
 
-The AI writer is powered by EleitherAI's GPT-NEO model, a replication of GPT-3.
+The AI writer is powered by any LLM that can be found on [Hugging Face](https://huggingface.co/), easily swappable through a command line argument.
 The suggested model has 2.7 billion parameters
-and was fine-tuned to write light novels.
+and was fine-tuned to write fictional stories.
 
 ## Features
-* State-of-the-art Artificial Intelligence fine-tuned for the specific purpose of writing stories,
+* State-of-the-art LLMs from huggingface fine-tuned for the specific purpose of writing stories,
 * A high quality narrator AI that reads the story out loud (TTS)**,
 * Customizable voice: the narrator will sound like any voice sample of your choice,
+* Multiple speakers: the voice will change between the narrator and different characters,
 * Two modes to build a story: alternating AI-human writing or choosing from AI generated options,
 * Save, load, continue and revert functions,
 * Randomly generated or custom prompts to start new stories.
@@ -23,30 +24,33 @@ and was fine-tuned to write light novels.
 ## Local Installation
 0. (Optional) Set up CUDA 11.1 to enable hardware acceleration if your GPU can take it.
 1. Install python 3.7
-2. Set the PHONEMIZER_ESPEAK_PATH environment variable to `C:\Program Files\eSpeak NG\espeak-ng.exe` or wherever you installed it. (windows)
+2. Set the PHONEMIZER_ESPEAK_PATH environment variable to `C:\Program Files\eSpeak NG\espeak-ng.exe` or wherever you installed it. (windows only)
 3. Download or clone this repository.
 4. Run `install.ps1` (windows powershell) or `install.sh` (shell script).
-5. Download a GPT-NEO model and put its content in `./models/[model name]/`. Here's a link to [finetuneanon's light novel model](https://drive.google.com/file/d/1M1JY459RBIgLghtWDRDXlD4Z5DAjjMwg/view?usp=sharing). 
-6. Play by running `play.ps1` (windows powershell) or `play.sh` (shell script). You can also launch `main.py` directly with your own launch options (model selection, gpu/cpu).
+5. Play by running `play.ps1` (windows powershell) or `play.sh` (shell script). You can also launch `main.py` directly with your own launch options (model selection, gpu/cpu).
 
 
 ## FAQ
 
 _How can I customize the narrator's voice?_
 
-Simply drop a WAV file into `audio/voices/`, then edit `play.sh` and/or `play.ps1` to add the `--voice "<name>"` option,
-where `<name>` is the file's name (without the .wav exension). The file should be a clean and short sample of a single
-person talking. A male a and a female voice samples are already included: "librispeech-f" and "librispeech-m"
+Simply drop WAV files into the directories in `audio/voices/`. Dialogues will alternate between character1 and character2 voices,
+and everything outside quotes will be read by narrator.
+The files should be a clean and samples of a single person talking.
+A male a and a female voice samples are already included: "librispeech-f" and "librispeech-m"
 
 _What kind of hardware do I need?_
 
-You'll need an NVIDIA GPU with at least 8 GB of VRAM, or a lot of patience and 28 GB of RAM (with the --cpugpt flag).
-If you run into video memory issues, you can lower `max_history`
-in `./generator/generator.py` (maximum number of "words" that the AI can read before writing text).
+You'll need an NVIDIA GPU with at least 8 GB of VRAM, or a lot of patience and 28 GB of RAM (with the --cputext flag).
+With 10 GB of VRAM (RTX 3080), you can also run TTS faster by removing the --cputts flag. Feel free to try smaller and 
+bigger variants of the OPT model from huggingface.
+
+The `--precision` also allows you to reduce VRAM usage by reducing float precision to 8 or 4 bits (see `--help`).
 
 _How should I write things in a way that the AI understands?_
 
-You aren't in a dialog with an AI, you're just writing parts of a story, except there's autocompletion for the next ~60 words. Trying to talk to it will just throw it off. Write as if you were the narrator. Avoid typos.
+You aren't in a dialog with an AI like ChatGPT, you're just writing parts of a story except there's autocompletion for the next ~60 words.
+Trying to talk to it will just throw it off. Write as if you were the narrator. Avoid typos.
 
 _The AI is repeating itself, help!_
 
@@ -56,25 +60,26 @@ With this implementation, this shouldn't happen anymore. Still, here are a few t
 * You can `/revert` back to a point in the story before it started failing.
 * If all else fails, just start a new story.
 
-To make this happen less often, try not to be redundant or use the same word twice. GPT's whole schtick is to complete sequences, so it tends to latch onto a pattern whenever it sees one.
+To make this happen less often, try not to be redundant or use the same word twice. LLMs' whole schtick is to complete sequences, so they tend to latch onto a pattern whenever they see one.
 
 _Can I write in first person like in AIdungeon?_
 
-No, AIdungeon converts first person to second person before feeding your input to its model, which was trained for second person narration. Writing in first person on wrAIter will probably result in a first person response.
+No, AIdungeon converts first person to second person before feeding your input to its model, which was trained for second person narration.
+Writing in first person on wrAIter will probably result in a first person response.
 
 _Does the AI learn from my inputs?_
 
 While the AI remembers the last thousand words of the story, it doesn't learn from it. Playing or saving a story won't affect the way it plays another.
 
 _Does the AI forget parts of the story?_
 
-Yes. Because the model can only take 1024 words in the input, the oldest events can be dropped to make the story fit. However, the context of the story (first paragraph) is never "forgotten".
+Yes. Because the model can only take 2048 words in the input (number depends on the model), the oldest events can be dropped to make the story fit. However, the context of the story (first paragraph) is never "forgotten".
 
-Until you hit 1024 words, longer stories get progressively better results.
+Until you hit 2048 words, longer stories get progressively better results.
 
 _Can I fine-tune the AI on a corpus of my choice?_
 
-I didn't bother with fine-tuning GPT-NEO. The model is just too large to fit into my machine or any free cloud GPU.
+I didn't bother with fine-tuning LLMs larger than 355M parameters. The models are just too large to fit into my machine or any free cloud GPU.
 So you're on your own if you want to try.
 
 _wrAIter is a terrible name._
@@ -92,5 +97,5 @@ No. First, wrAIter doesn't adjust based on your or other players' inputs. The mo
 
 ## Credits
 * [Latitude](https://github.com/Latitude-Archives/AIDungeon) for AIDungeon that I used as a starting point,
-* [EleutherAI](https://www.eleuther.ai/projects/gpt-neo/) for GPT-NEO,
+* [Hugging Face](https://huggingface.co/) for Language Models,
 * [coqui-ai](https://github.com/coqui-ai/TTS) for the TTS models.
diff --git a/generator/generator.py b/generator/generator.py
@@ -6,20 +6,24 @@ class Generator:
     def __init__(self,
                  model_name='KoboldAI/OPT-2.7B-Nerys-v2',
                  length=80,
-                 gpu=True):
+                 gpu=True,
+                 precision=16):
         """
         :model_name='KoboldAI/OPT-2.7B-Nerys-v2' : String, which model to use from huggingface
         :length=None : Number of tokens in generated text
         :gpu: use gpu (default: True)
         """
         self.device = 'cuda' if gpu else 'cpu'
 
-        self.model = AutoModelForCausalLM.from_pretrained(
-            model_name,
-            # load_in_8bit=gpu,
-            torch_dtype=torch.float16 if gpu else torch.float32
-        )
-        self.model.to(self.device)
+        if precision == 16:
+            self.model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float16 if gpu else torch.float32)
+            self.model.to(self.device)
+        elif precision == 8:
+            self.model = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=gpu)
+        elif precision == 4:
+            self.model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=gpu)
+        else:
+            raise ValueError(f'float precision {precision} not supported')
 
         self.enc = AutoTokenizer.from_pretrained(model_name)
 

diff --git a/main.py b/main.py
@@ -19,7 +19,7 @@
 
 class Game:
     def __init__(self):
-        self.gen = Generator(model_name=args.model[0], gpu=not args.cpugpt)
+        self.gen = Generator(model_name=args.model[0], gpu=not args.cputext, precision=args.precision)
         self.tts = None if args.jupyter else Dub(gpu=not args.cputts)
         self.style = style_from_dict({
             Token.Separator: '#cc5454',
@@ -342,10 +342,13 @@ def pprint(self, highlight=None):
     parser.add_argument('-t', '--cputts', action='store_true',
                         default=False,
                         help='force TTS to run on CPU')
-    parser.add_argument('-g', '--cpugpt', action='store_true',
+    parser.add_argument('-x', '--cputext', action='store_true',
                         default=False,
                         help='force text generation to run on CPU')
-    parser.add_argument("--local_rank", type=int, default=0)
+    parser.add_argument('-p', "--precision", type=int, default=16, help='float precision, only available'
+                                                                        'with GPU enabled for text generation,'
+                                                                        'possible values are 4, 8, 16 (default 16),'
+                                                                        'lower values reduce VRAM usage')
 
     args = parser.parse_args()