-
Notifications
You must be signed in to change notification settings - Fork 5
LLM backend
LLM: This is not a lightweight wrapper on your favorite RP fine-tune. Characters have state, memories, needs, and intentions, all maintained by the LLM, often through meta-analysis of it's own output. The world ('Context') also has multiple evolving aspects to maintain.
I generally don't see good results with less than a 6 bit quant, even of the biggest models. I've gotten pretty good results with 8Bs like (I run both 16 bit, but an 8bit quant would pbly work):
dolphin-2.9-llama3-8b (both run 16 bit, but an 8bit quant would pbly work) Hermes-2-Theta-Llama-3-8B llama-3-Smaug-8B Smaug
I usually run with:
- dolphin-2.9.1-yi-1.5-34b-exl2 (my own 8 bit quant), but of course it's slower, but still faster than you can follow w/o stepping. (RTX6000 Ada)
Oh, and yeah, sorry, no capability to follow an hf path, you have to download models yourself. I should fix that. If you need it, post an issue. Same with OpenAI/Anthropic(Claude)/Mistral - I have interfaces for all of them, but stripped them out along with lot's of other stuff to make this standalone port.
TabbyAPI: ATWAP uses the TabbyAPI on port 5000, but doesn't work well will Tabby. Looks like most models derived from Llama-3 use the same chat_template, one that references, BUT DOESN'T DEFINE, 'add_generation_prompt'. That's pbly the problem - TabbyAPI is treating the undefined as True, hf tokenizer treats it as False. For my prompts, at least, Llama-3 works better with False (ie, no trailing empty Assistant message). So if you want to use Tabby or OLlama, you could probably just edit your tokenizer_config.json chat-template jinja to define the variable as False. And if you don't understand what I just said, I pbly don't either. Not my area.
Sorry about the custom server wrapper with no GGUF support, I'll fix that in upcoming releases, support for llama.cpp is pretty easy also. Holler if you want it.