Tune down max_seq_len #197

maxwellu13 · 2023-12-01T13:58:20Z

maxwellu13
Dec 1, 2023

I am playing around with ExLlama in combination with model "TheBloke_Llama-2-70B-Chat-GPTQ/". I want to configure a model with a maximum sequence length of 100 tokens.

If I call

from exllamav2 import(
    ExLlamaV2,
    ExLlamaV2Config,
    ExLlamaV2Cache,
    ExLlamaV2Tokenizer,
)

from exllamav2.generator import (
    ExLlamaV2BaseGenerator,
    ExLlamaV2Sampler
)

model_directory = /mydir/TheBloke_Llama-2-70B-Chat-GPTQ/

config = ExLlamaV2Config()
config.model_dir = model_directory
config.prepare()
config.max_seq_len = 100

model = ExLlamaV2(config)
print("Loading model: " + model_directory)

cache = ExLlamaV2Cache(model, lazy = True)
model.load_autosplit(cache)

I recieve a RuntimeError start (0) + length (2048) exceeds dimension size (100). When I use the examples/chat.py and provide -l 100 as an argument, it works like a charm and the model doesn't accept more than 10 tokens. Any idea why this doesn't work in my script and why it throws an error?

turboderp · 2023-12-01T14:10:45Z

turboderp
Dec 1, 2023
Maintainer

This happens because the sequence length is lower than the maximum input length. The max_input_len parameter in the config defines the maximum number of tokens the model will process in one forward pass. Longer sequences than max_input_len will be transparently chopped up and processed in multiple passes.

I guess there should be some logic to make sure max_input_len is never longer than max_seq_len, since that only leads to some buffers being larger than they need to be and, in the case of the autosplit loader, the error you're seeing.

In the meantime, setting config.max_input_len=100 and config.max_attn_size=100**2 before loading the model should fix your issue.

2 replies

maxwellu13 Dec 1, 2023
Author

Makes complete sense, thank you! One more question in that regard: If I do not specify the max_seq_len (the default should be 2048) but pass a prompt that is way longer than 2048 tokens, the model will still run and give me a somewhat reasonable result. How does that happen? Does it just ignore part of the input prompt?

turboderp Dec 1, 2023
Maintainer

If you use the generate_simple function then yes, it will truncate the input to max_seq_len - num_tokens before it starts generating. It's not what I would have chosen, but it was requested a lot. And there isn't really much else you can do with an input sequence that's too long, besides raising an exception.

The streaming generator gives you more control, taking token IDs as input instead. There's an example in TabbyAPI of how you could use it to generate indefinitely by shifting the context window periodically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune down max_seq_len #197

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Tune down max_seq_len #197

maxwellu13 Dec 1, 2023

Replies: 1 comment · 2 replies

turboderp Dec 1, 2023 Maintainer

maxwellu13 Dec 1, 2023 Author

turboderp Dec 1, 2023 Maintainer

maxwellu13
Dec 1, 2023

Replies: 1 comment 2 replies

turboderp
Dec 1, 2023
Maintainer

maxwellu13 Dec 1, 2023
Author

turboderp Dec 1, 2023
Maintainer