LLM backend

LLM: This is not a lightweight wrapper on your favorite RP fine-tune. Characters have state, memories, needs, and intentions, all maintained by the LLM, often through meta-analysis of it's own output. That means lots of prompts. I've tried to be robust across model variations, but in general bigger models work better. The world ('Context') also has multiple evolving aspects to maintain. On the other hand, ATWAP only needs ~ 3k context at the moment.

I usually run with:

dolphin-2.9.1-yi-1.5-34b-exl2 (my own 8 bit quant), faster than you can follow w/o stepping. (RTX6000 Ada) Wonderful dialog, good action mix, insightful characters.

Llama-3-70B pbly good too. I generally don't see good results with less than a 6 bit quant, even of the biggest models.

I've sometimes gotten ok results with 8Bs. If a scenario isn't going well, just kill it and start over. I run below 16 bit, but an 8bit quant would pbly work):

dolphin-2.9-llama3-8b - exl2 8 bit - yay! Hermes-2-Theta-Llama-3-8B - exl2 8 bit runs fast and well. llama-3-Smaug-8B Smaug

I usually run with:

dolphin-2.9.1-yi-1.5-34b-exl2 (my own 8 bit quant), but of course it's slower, but still faster than you can follow w/o stepping. (RTX6000 Ada) Wonderful dialog, good action mix, insightful characters.

Oh, and yeah, sorry, no capability to follow an hf path, you have to download models yourself. I should fix that. If you need it, post an issue. Same with OpenAI/Anthropic(Claude)/Mistral - I have interfaces for all of them, but stripped them out along with lot's of other stuff to make this standalone port.

TabbyAPI: ATWAP uses the TabbyAPI on port 5000, but doesn't work well will Tabby. Looks like most models derived from Llama-3 use the same chat_template, one that references, BUT DOESN'T DEFINE, 'add_generation_prompt'. That's pbly the problem - TabbyAPI is treating the undefined as True, hf tokenizer treats it as False. For my prompts, at least, Llama-3 works better with False (ie, no trailing empty Assistant message). So if you want to use Tabby or OLlama, you could probably just edit your tokenizer_config.json chat-template jinja to define the variable as False. And if you don't understand what I just said, I pbly don't either. Not my area.

Sorry about the custom server wrapper with no GGUF support, I'll fix that in upcoming releases, support for llama.cpp is pretty easy also. Holler if you want it.

LLMs I've tried:

llama-3-8B-Instruct - OK
Hermes-2-Theta-Llama-3-8B, Ok
llama-3-70B-Instruct-exl2 (8bit) - ok
dolphin-2.9-llama3-8b, ok
llama-3-70B-Instruct-GPTQ, Great dialog!
Smaug-Mixtral-8.0bpw-exl2, No, Exception, jinja error on server
dolphin-2.9.1-yi-1.5-34b-exl2 - ok
command-r-plus-6.0bpw-exl2: can't load, (need to do explicit vRam allocation across gpus)
llama-3-Smaug-8B, OK.
Yi-1.5-34B-Chat-exl2, ok
mistral-7B-v0.2, can't load, tbd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM backend

Clone this wiki locally