Allow Tokens to Span Multiple Terminals in CFG #683

lapp0 · 2024-01-23T18:55:42Z

lapp0
Jan 23, 2024

What behavior of the library made you think about the improvement?

Currently generated tokens must be part of a terminal, or a complete terminal. A token cannot start at one terminal and end at another.

E.g. in the gpt2 tokenizer, {" is a valid token. However if { and " are separate terminals, as in the case of a typical json grammar, { is allowed in the initial states CFGFSM.allowed_token_ids(0) but {" is not.

This approach not only deviates technically from correct grammar representation, but also adversely affects generation quality. For example in the arithmetic grammar from README.md, using mistralai/Mistral-7B-v0.2, the most probable second token is + (space-prefixed), however because space is a separate terminal this token isn't legal, it selects + instead. In scenarios like this, spaces, though grammatically valid and model-preferred, are seldom produced. This is because the model would have to select the space as a standalone token to incorporate any spaces.

How would you like it to behave?

Permit the generation of any token that complies with a grammar's production rules and is valid in the context of the preceding sequence of tokens, regardless of whether it spans multiple tokens.

This will require careful engineering and benchmarking to ensure the new trie-of-RegexFSM described at the end of section 4.2 of the outlines paper works properly.

brandonwillard · 2024-02-20T04:37:22Z

brandonwillard
Feb 20, 2024
Maintainer

Again, see the code in the parsing module.

1 reply

ekagra-ranjan May 14, 2024

Hi @brandonwillard - could you pls share what do you mean by checking the parsing module? Does parsing module fix this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Tokens to Span Multiple Terminals in CFG #683

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Allow Tokens to Span Multiple Terminals in CFG #683

lapp0 Jan 23, 2024

What behavior of the library made you think about the improvement?

How would you like it to behave?

Replies: 1 comment · 1 reply

brandonwillard Feb 20, 2024 Maintainer

ekagra-ranjan May 14, 2024

lapp0
Jan 23, 2024

Replies: 1 comment 1 reply

brandonwillard
Feb 20, 2024
Maintainer