Replies: 4 comments 2 replies
-
Hi @lapp0, Thanks for the detailed proposal. I took the time to try reading and understanding it and have a couple questions.
|
Beta Was this translation helpful? Give feedback.
-
Thanks so much for reviewing and helping me make it more clear! Please let me know if you have any other questions.
These terminals don't impact the parsing state and can be consumed at any time. They're not included in
Sorry it's not super clear yet. Here is an example and a prototype implementation. Concatenation can't discover more than two tokens at a time (and I don't think it needs to), so here's an example of it concatenating "Hap" and "py Ne":
log output:
Here is the prototype implementation:
|
Beta Was this translation helpful? Give feedback.
-
Converting these design/approach conversations to discussions... |
Beta Was this translation helpful? Give feedback.
-
Is there any recent movement here? |
Beta Was this translation helpful? Give feedback.
-
(Rough Draft)
Seeking comments, questions, and critique for the design below.
This is a problem I came across early into contributing to Outlines and I've been thinking about it for a while. Here's an overview of my thought.
Problem
The goal of this design spec is to "complete" CFGFSM. That is to say we want Outlines to never be the bottleneck for CFGFSM generation, ensure the set of valid subsequent generated tokens is exactly equal to the set of subsequent strings legal within the grammar, ensure CFGFSM is LALR(1), and provide a path towards ambiguous grammars.
This design spec intends to ensure two criteria are met
1) Every sequence of tokens which can comprise a valid generation can be produced.
Fixes #573
Consider the following grammar
On
main
, even if "AB" is in the tokenizers vocabulary it cannot be generated, only "A" can be generated as the next token. The problem is that prior to this PR, Outlines constructed and filtered the tokenizers vocabulary based on a FSM for only the immediate next terminal, which "B" isn't a part of.With this change-set, tokens can span multiple terminals, and the "AB" token can be generated.
2) the set of valid generations produced by a
CFGFSM
using grammarG
matches the set of sequences deemed valid by underG
with LarksLALR_Parser
+ContextualLexer
.Fixes #588
Consider the following grammar
On
main
, if our current generation is comprised of "AAB", then it is impossible to generate "D". The problem lies in the fact that once "AAB" is generated, Outlines parsing logic decides that there is no need to consider that "AA" is a complete terminal because "B" is still valid within the existing FSM.With this change-set, both "AABD" and "AABC" can be generated.
Solution
(Note: I will be incorrectly referring to Larks lexical tokens as "terminals" throughout to avoid confusion with LLM tokens)
Lark Behavior
To achienve lookahead and multi-terminal LLM tokens we must first know which terminals may succeed the current terminal. We can speculate futures state by simulating the application of state transitions to the parsers automata.
Lark makes this simple:
We can construct the speculative terminal tree, and only need to update the tree when we consume a new terminal. Additionally, this is easy to cache using the push-down automata's (PDA) stack as a key.
RegexFSM
BehaviorEvery terminal has a corresponding
RegexFSM
, and each state within aRegexFSM
has a precomputed set of legal LLM tokens. To determine the legal set of LLM tokens beyond the bounds of the currentRegexFSM
we need to merge theRegexFSM
as we traverse the parser automata. This requires the implementation of two operations:RegexFSM() + RegexFSM()
RegexFSM() | RegexFSM()
New
RegexFSM
ConstructionCurrently
RegexFSM
is unable to be concatenated or unioned. It has all states labelled with LLM tokens valid to complete or partially complete the current terminal, but it has no knowledge of LLM tokens which make be legal if concatenated. We must design an efficient index which doesn't rely on recomputing the allowed token cache from scratch on the combinedRegexFSM
.RegexFSM
currently precomputes:RegexFSM
must be updated to precompute:Formalism
The goal is to find V', a subset of V, consisting of strings formed by concatenating exactly one element of P, and one element of Sn for each S, such that the concatenation is in the set V.
New
RegexFSM
MergingThe naive approach is to take the cartesian product of P x S0 x S1 x ... x Sn
We can improve this approach with some precomputed vectors as follows.
Base Case: Only two
RegexFSM
concatenatedFor
RegexFSM
construction, we calculate a prefix suffix vector from the vocabulary. This is calculated once for a vocabulary.For
RegexFSM
construction, we calculate the legal set of suffixes for a RegexFSM given its prefixesAdditionally we must provide the set of suffixes the
RegexFSM
can produceFor
RegexFSM
concatenation, we calculate the set of legal tokens given each matched suffixExtended Case: Concatenating more than two
RegexFSM
We can extend
vocabulary_vector
such that all prefixes are also labelled. This doesn't increase the size of the vector.Legal token IDs can be be determined by the
vocabulary_vector
s first N IDs, and the new legal prefix set can be determined by the IDs thereafterAnalysis
Space Complexity (Construction)
Provided a FSM with S states and a tokenizer having vocabulary size V,
RegexFSM
space complexity was originally O(S * V).Provided an word length L, the new
RegexFSM
worst case space complexity is O(S * V * L).In practice, Mistrals tokenizer has an average word length of 5.029875. Will analyze further mitigation during implementation. Real world benchmarking will be critical as well.
Time Complexity (Construction)
Constructing the index requires that the vocabulary be traversed in its entirety. Construction will have the same theoretical time complexity, but because the entire vocabulary will be traversed, it will be worse.
This will be mitigated through the use of a DAFSA.
Time Complexity (Concatenation)
Concatenation requires multiplying two vectors. The vectors are the size of the set of (prefix, suffix) pairs for the vocabulary. O(S * V * L)
In practice, for Mistrals tokenizer generating and applying the concatenated vocab mask could be performed 44191 times per second on a single core of my i5.
Etc
Replaces #588, #573
Bonus:
After implementation, we can make subtle changes to the speculative terminal tree to convert it into a NDPDA. In other words, this implementation provides an easy path towards enabling context-free grammars!
Beta Was this translation helpful? Give feedback.
All reactions