Tune down max_seq_len #197
Replies: 1 comment 2 replies
-
This happens because the sequence length is lower than the maximum input length. The max_input_len parameter in the config defines the maximum number of tokens the model will process in one forward pass. Longer sequences than max_input_len will be transparently chopped up and processed in multiple passes. I guess there should be some logic to make sure max_input_len is never longer than max_seq_len, since that only leads to some buffers being larger than they need to be and, in the case of the autosplit loader, the error you're seeing. In the meantime, setting |
Beta Was this translation helpful? Give feedback.
-
I am playing around with ExLlama in combination with model "TheBloke_Llama-2-70B-Chat-GPTQ/". I want to configure a model with a maximum sequence length of 100 tokens.
If I call
I recieve a RuntimeError
start (0) + length (2048) exceeds dimension size (100).
When I use the examples/chat.py and provide-l 100
as an argument, it works like a charm and the model doesn't accept more than 10 tokens. Any idea why this doesn't work in my script and why it throws an error?Beta Was this translation helpful? Give feedback.
All reactions