Draft: Efficient RegexFSM Concatenation / Union, Enabling LALR(1), Multi-Terminal Tokens #682

lapp0 · 2024-02-16T00:05:45Z

lapp0
Feb 16, 2024

(Rough Draft)

Seeking comments, questions, and critique for the design below.

This is a problem I came across early into contributing to Outlines and I've been thinking about it for a while. Here's an overview of my thought.

Problem

The goal of this design spec is to "complete" CFGFSM. That is to say we want Outlines to never be the bottleneck for CFGFSM generation, ensure the set of valid subsequent generated tokens is exactly equal to the set of subsequent strings legal within the grammar, ensure CFGFSM is LALR(1), and provide a path towards ambiguous grammars.

This design spec intends to ensure two criteria are met

1) Every sequence of tokens which can comprise a valid generation can be produced.

Fixes #573

Consider the following grammar

start: a b
a: "A"
b: "B"

On main, even if "AB" is in the tokenizers vocabulary it cannot be generated, only "A" can be generated as the next token. The problem is that prior to this PR, Outlines constructed and filtered the tokenizers vocabulary based on a FSM for only the immediate next terminal, which "B" isn't a part of.

With this change-set, tokens can span multiple terminals, and the "AB" token can be generated.

2) the set of valid generations produced by a `CFGFSM` using grammar `G` matches the set of sequences deemed valid by under `G` with Larks `LALR_Parser` + `ContextualLexer`.

Fixes #588

Consider the following grammar

start: "AABC"
       | "AA" "BD"

On main, if our current generation is comprised of "AAB", then it is impossible to generate "D". The problem lies in the fact that once "AAB" is generated, Outlines parsing logic decides that there is no need to consider that "AA" is a complete terminal because "B" is still valid within the existing FSM.

With this change-set, both "AABD" and "AABC" can be generated.

Solution

(Note: I will be incorrectly referring to Larks lexical tokens as "terminals" throughout to avoid confusion with LLM tokens)

Lark Behavior

To achienve lookahead and multi-terminal LLM tokens we must first know which terminals may succeed the current terminal. We can speculate futures state by simulating the application of state transitions to the parsers automata.

Lark makes this simple:

new_interactive_parser = copy(interactive_parser)
new_interactive_parser.feed_token(terminal)
child_terminals = new_interactive_parser.accepts() | ignored_termsinals

We can construct the speculative terminal tree, and only need to update the tree when we consume a new terminal. Additionally, this is easy to cache using the push-down automata's (PDA) stack as a key.

`RegexFSM` Behavior

Every terminal has a corresponding RegexFSM, and each state within a RegexFSM has a precomputed set of legal LLM tokens. To determine the legal set of LLM tokens beyond the bounds of the current RegexFSM we need to merge the RegexFSM as we traverse the parser automata. This requires the implementation of two operations:

Concatenation: RegexFSM() + RegexFSM()
Union: RegexFSM() | RegexFSM()

combined_fsm = current_terminal_fsm + (next_fsm_0 | next_fsm_1 | ...)
combined_fsm.allowed_token_ids()

New `RegexFSM` Construction

Currently RegexFSM is unable to be concatenated or unioned. It has all states labelled with LLM tokens valid to complete or partially complete the current terminal, but it has no knowledge of LLM tokens which make be legal if concatenated. We must design an efficient index which doesn't rely on recomputing the allowed token cache from scratch on the combined RegexFSM.

RegexFSM currently precomputes:

for each state
- the set of valid LLM tokens

RegexFSM must be updated to precompute:

for each state
- the set of valid LLM tokens
- the set of LLM token completing suffixes
for the initial state
- the set of token suffixes

Formalism

Let P be the prefix set.
Let S be the suffix set
Let V be the vocabulary

The goal is to find V', a subset of V, consisting of strings formed by concatenating exactly one element of P, and one element of S_n for each S, such that the concatenation is in the set V.

$$V' = {v \in V \mid v = p + s_1 + s_2 + \ldots + s_n, p \in P, s_j \in S_j}$$

New `RegexFSM` Merging

The naive approach is to take the cartesian product of P x S₀ x S₁ x ... x S_n

We can improve this approach with some precomputed vectors as follows.

Base Case: Only two `RegexFSM` concatenated

For RegexFSM construction, we calculate a prefix suffix vector from the vocabulary. This is calculated once for a vocabulary.

import numpy as np

# from transformers import AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
# vocabulary = tokenizer.get_vocab()

vocabulary = {"hello": 0, "world": 1, "alice": 2, "bob": 3}
inverse_vocab = {v: k for k, v in vocabulary.items()}

# note: the size prefix_suffix_set is 128,956 with Mistrals 32,000 token vocab
prefix_suffix_list = sorted({(tok[:i], tok[i:]) for tok in vocabulary for i in range(1, len(tok))})

# print(sorted(prefix_suffix_list))
# [('a', 'lice'), ('al', 'ice'), ('ali', 'ce'), ('alic', 'e'), ('b', 'ob'), ('bo', 'b'), ('h', 'ello'), ('he', 'llo'), ('hel', 'lo'), ('hell', 'o'), ('w', 'orld'), ('wo', 'rld'), ('wor', 'ld'), ('worl', 'd')]

# create list and vectors corresponding to prefix_suffix_list
prefixes = [pair[0] for pair in prefix_suffix_list]
suffixes = [pair[1] for pair in prefix_suffix_list]

vocabulary_vector = np.array([vocabulary["".join(prefix_suffix)] for prefix_suffix in prefix_suffix_list])

For RegexFSM construction, we calculate the legal set of suffixes for a RegexFSM given its prefixes

legal_prefixes = ["al", "ali", "he", "wo"]  # generated by traversing the FSM

# (inefficient method, only for demonstration)
# create mask which is True at indices of prefix_suffix_list where legal_prefixes is found
# this will help us later when we're concatenating against another RegexFSMs prefix list
legal_prefix_suffix_pair_vector = np.array([prefix in legal_prefixes for prefix, _ in prefix_suffix_list])

# print([suffixes[i] for i in np.where(legal_prefix_suffix_pair_vector)[0]])
# ['ce', 'ice', 'llo', 'rld']

Additionally we must provide the set of suffixes the RegexFSM can produce

generatable_suffixes = ["ice", "rld", "e"]
generatable_suffix_vector = np.array([suffix in generatable_suffixes for _, suffix in prefix_suffix_list])

For RegexFSM concatenation, we calculate the set of legal tokens given each matched suffix

# when concatenating a RegexFSM which allows the prefixes ["al", "ali", "he", "wo"]
# with a RegexFSM which allows the suffixes ["ice", "rld", "e"]
# we get a legal vocab of set(["alice", "world"])
legal_vocab_mask = legal_prefix_suffix_pair_vector * generatable_suffix_vector
legal_token_ids = vocabulary_vector[legal_vocab_mask]

# print([inverse_vocab[token_id] for token_id in legal_token_ids])
# ['alice', 'world']

Extended Case: Concatenating more than two `RegexFSM`

We can extend vocabulary_vector such that all prefixes are also labelled. This doesn't increase the size of the vector.

Legal token IDs can be be determined by the vocabulary_vectors first N IDs, and the new legal prefix set can be determined by the IDs thereafter

legal_vocab_ids = vocabulary_vector[legal_vocab_mask]
valid_token_ids = legal_vocab_ids[legal_vocab_ids < len(vocabulary)]
valid_prefix_ids = legal_vocab_ids[legal_vocab_ids >= len(vocabulary)]

Analysis

Space Complexity (Construction)

Provided a FSM with S states and a tokenizer having vocabulary size V, RegexFSM space complexity was originally O(S * V).

Provided an word length L, the new RegexFSM worst case space complexity is O(S * V * L).

In practice, Mistrals tokenizer has an average word length of 5.029875. Will analyze further mitigation during implementation. Real world benchmarking will be critical as well.

Time Complexity (Construction)

Constructing the index requires that the vocabulary be traversed in its entirety. Construction will have the same theoretical time complexity, but because the entire vocabulary will be traversed, it will be worse.

This will be mitigated through the use of a DAFSA.

Time Complexity (Concatenation)

Concatenation requires multiplying two vectors. The vectors are the size of the set of (prefix, suffix) pairs for the vocabulary. O(S * V * L)

In practice, for Mistrals tokenizer generating and applying the concatenated vocab mask could be performed 44191 times per second on a single core of my i5.

Etc

Replaces #588, #573

Bonus:

After implementation, we can make subtle changes to the speculative terminal tree to convert it into a NDPDA. In other words, this implementation provides an easy path towards enabling context-free grammars!

Mattross45 · 2024-02-19T14:33:25Z

Mattross45
Feb 19, 2024

Hi @lapp0,

Thanks for the detailed proposal. I took the time to try reading and understanding it and have a couple questions.

In the following code block, what does ignored_terminals reference ?

new_interactive_parser = copy(interactive_parser)
new_interactive_parser.feed_token(terminal)
child_terminals = new_interactive_parser.accepts() | ignored_termsinals

I really like the idea you have to combine RegexFSMs, without recalculating the whole index and this seems like an elegant solution.
However, I am having trouble wrapping my head around it enough to give a comment or critique. More precisely I am unclear on what the suffixes and prefixes correspond to.
Detailing on a simple example would really help clear things out, for example on : Grammar : start: "Happy" "New Year" and vocabulary : {"Hap", "py Ne", "w Year"}.

0 replies

lapp0 · 2024-02-20T03:17:59Z

lapp0
Feb 20, 2024
Author

Thanks so much for reviewing and helping me make it more clear! Please let me know if you have any other questions.

In the following code block, what does ignored_terminals reference ?

These terminals don't impact the parsing state and can be consumed at any time. They're not included in .accepts() so need to be manually included. %ignore WS results in WS never being in .accepts(), but is always allowed as the next terminal.

I really like the idea you have to combine RegexFSMs, without recalculating the whole index and this seems like an elegant solution.
However, I am having trouble wrapping my head around it enough to give a comment or critique. More precisely I am unclear on what the suffixes and prefixes correspond to.
Detailing on a simple example would really help clear things out, for example on : Grammar : start: "Happy" "New Year" and vocabulary : {"Hap", "py Ne", "w Year"}.

Sorry it's not super clear yet. Here is an example and a prototype implementation.

Concatenation can't discover more than two tokens at a time (and I don't think it needs to), so here's an example of it concatenating "Hap" and "py Ne":

def log(name, pre_idx):
    print(name)
    vocab = sorted([vocab_index.vocabulary[i] for i, t in enumerate(pre_idx.legal_vocab) if t])
    pairs = sorted([vocab_index.prefix_suffix_list[i] for i, pm in enumerate(pre_idx.legal_pair_mask) if pm])
    prefixes = sorted(set([p[0] for p in pairs]))
    print("  vocab:", vocab)
    print("  legal next pairs:", pairs)
    print("  prefixes:", prefixes)


vocabulary = ["Hap", "py Ne", "w Year"]
vocab_index = VocabularyIndex(vocabulary)


# In practice, this would be calculated by crawling the FSM
def get_all_substrings(input_string):
    length = len(input_string)
    return [input_string[i:j+1] for i in range(length) for j in range(i, length)]
# get indexes for FSMs
happy_prefix_index = vocab_index.get_prefix_index(get_all_substrings("Happy "))
new_suffix_index = vocab_index.get_suffix_index(get_all_substrings("New "))


log("happy_prefix_index", happy_prefix_index)
happy_new_concat_index = vocab_index.concat(
    happy_prefix_index,
    new_suffix_index
)
log("happy_new_concat_index", happy_new_concat_index)

log output:

happy_prefix_index
  vocab: ['Hap']
  legal next pairs: [('H', 'a'), ('H', 'ap'), ('Ha', 'p'), ('p', 'y'), ('p', 'y '), ('p', 'y N'), ('p', 'y Ne'), ('py', ' '), ('py', ' N'), ('py', ' Ne'), ('py ', 'N'), ('py ', 'Ne')]
  prefixes: ['H', 'Ha', 'p', 'py', 'py ']
happy_new_concat_index
  vocab: ['Hap', 'py Ne']
  legal next pairs: [('py ', 'N'), ('py ', 'Ne'), ('py N', 'e')]
  prefixes: ['py ', 'py N']

Here is the prototype implementation:

from typing import List
import numpy as np


class SuffixIndex:
    def __init__(self, legal_pair_mask):
        self.legal_pair_mask = legal_pair_mask


class PrefixIndex:
    def __init__(self, legal_pair_mask, legal_vocab):
        self.legal_pair_mask = legal_pair_mask
        self.legal_vocab = legal_vocab


class VocabularyIndex:
    """
    Core data structure:
    - Prefix-suffix pairs for entire vocabulary of length L.
    - A vector of length L with integers where each integer a token ID or prefix ID.

    This data structure allows us to resolve PrefixSuffixIndex based on vector multiplication.

    Prefix IDs correspond to indices in the universe of valid vocabulary prefixes
    Valid prefixes for "apple" are ["a", "ap", "app", "appl", "apple"]

    self._vocab_length allows us to determine which integers are tokens and which are not
    """
    def __init__(self, vocabulary: List[str]):
        self.vocabulary = vocabulary

        # all valid prefixes for every token str in vocabulary
        prefixes = {s[:i] for s in vocabulary for i in range(1, len(s) + 1)}

        # Create a list of all possible unique prefix-suffix pairs from the vocabulary
        self.prefix_suffix_list = sorted({
            (tok[:i], tok[i:]) for tok in prefixes for i in range(1, len(tok))
        })

        # extend vocab to include prefix IDs
        prefix_vocabulary = sorted(set(prefixes) - set(vocabulary))
        self.extended_vocabulary = vocabulary + prefix_vocabulary
        inverse_extended_vocab = {
            token_str: token_id
            for token_id, token_str in enumerate(self.extended_vocabulary)
        }

        # create vector mapping, vector idx represents prefix-suffix, maps to prefix / token ID2
        self.concat_vector = np.array([
            inverse_extended_vocab["".join(prefix_suffix)]
            for prefix_suffix in self.prefix_suffix_list
        ])

        # create vector mapping , vector idx represents prefix-suffix, maps to the used prefix's ID
        self.concat_input_prefix_vector = np.array([
            inverse_extended_vocab[prefix] for prefix, suffix in self.prefix_suffix_list
        ])


    def get_prefix_index(self, legal_strings: List[str]):
        # get a boolean mask with indices corresponding to a legal_prefix marked True
        legal_pair_mask = np.array([
            prefix in legal_strings
            for prefix, _ in self.prefix_suffix_list
        ])
        legal_vocab = np.array([(token in legal_strings) for token in vocabulary])
        return PrefixIndex(legal_pair_mask, legal_vocab)

    def get_suffix_index(self, legal_strings: List[str]):
        # get a boolean mask with indices corresponding to a legal_suffix marked True
        legal_pair_mask = np.array([
            suffix in legal_strings
            for _, suffix in self.prefix_suffix_list
        ])
        return SuffixIndex(legal_pair_mask)

    def concat(self, prefix_index, suffix_index):
        assert isinstance(prefix_index, PrefixIndex)
        assert isinstance(suffix_index, SuffixIndex)

        # get legal concatenation pairs and their corresponding token ids
        legal_concat_mask = prefix_index.legal_pair_mask * suffix_index.legal_pair_mask
        legal_concat_ids = self.concat_vector[legal_concat_mask]

        # calculate new legal tokens
        legal_vocab = prefix_index.legal_vocab.copy()
        legal_vocab[legal_concat_ids[legal_concat_ids < len(legal_vocab)]] = True

        new_legal_pair_mask = np.isin(self.concat_input_prefix_vector, legal_concat_ids)

        return PrefixIndex(new_legal_pair_mask, legal_vocab)


vocabulary = ["hello", "world", "alice", "bob", "al", "ali", "alic"]
vocab_index = VocabularyIndex(vocabulary)


def log(name, pre_idx):
    print(name)
    vocab = sorted([vocab_index.vocabulary[i] for i, t in enumerate(pre_idx.legal_vocab) if t])
    pairs = sorted([vocab_index.prefix_suffix_list[i] for i, pm in enumerate(pre_idx.legal_pair_mask) if pm])
    prefixes = sorted(set([p[0] for p in pairs]))
    print("  vocab:", vocab)
    print("  legal next pairs:", pairs)
    print("  prefixes:", prefixes)

1 reply

Mattross45 Feb 21, 2024

Thank you @lapp0 for the replies, helped me a lot to figure things out

These terminals don't impact the parsing state and can be consumed at any time.  

That makes a lot of sense, totally overthought this part. 

Concatenation can't discover more than two tokens at a time (and I don't think it needs to) 

I am not sure I understand what you are trying to say here.
The way I interpret this is a token can’t span over more than 2 terminals. However from my experimentation, I have found that some tokens span over 4 terminals ( for example !”); at the end of a console.log(“hello!”); spans over STRING QUOTE RPAR SEMI in most programming language grammars.

However you seem to have taken this into account in your Formalism section. ( maybe it was that section I didn’t understand )

Nevertheless, I don’t think this impacts the indexing process, but is useful to keep in mind for the next steps

Overall seems well thought out to me, can’t wait to see how it plays out in practice 

brandonwillard · 2024-02-20T04:32:58Z

brandonwillard
Feb 20, 2024
Maintainer

Converting these design/approach conversations to discussions...

0 replies

Reichenbachian · 2024-09-19T05:57:35Z

Reichenbachian
Sep 19, 2024

Is there any recent movement here?

1 reply

lapp0 Sep 19, 2024
Author

Improvements to CFG correctness are in main per #1067

There's still some work to improve compliance with the lark grammar: #1151

and in terms of performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Efficient RegexFSM Concatenation / Union, Enabling LALR(1), Multi-Terminal Tokens #682

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Draft: Efficient RegexFSM Concatenation / Union, Enabling LALR(1), Multi-Terminal Tokens #682

lapp0 Feb 16, 2024

Problem

1) Every sequence of tokens which can comprise a valid generation can be produced.

2) the set of valid generations produced by a CFGFSM using grammar G matches the set of sequences deemed valid by under G with Larks LALR_Parser + ContextualLexer.

Solution

Lark Behavior

RegexFSM Behavior

New RegexFSM Construction

Formalism

New RegexFSM Merging

Base Case: Only two RegexFSM concatenated

Extended Case: Concatenating more than two RegexFSM

Analysis

Space Complexity (Construction)

Time Complexity (Construction)

Time Complexity (Concatenation)

Etc

Bonus:

Replies: 4 comments · 2 replies

Mattross45 Feb 19, 2024

lapp0 Feb 20, 2024 Author

Mattross45 Feb 21, 2024

brandonwillard Feb 20, 2024 Maintainer

Reichenbachian Sep 19, 2024

lapp0 Sep 19, 2024 Author

lapp0
Feb 16, 2024

2) the set of valid generations produced by a `CFGFSM` using grammar `G` matches the set of sequences deemed valid by under `G` with Larks `LALR_Parser` + `ContextualLexer`.

`RegexFSM` Behavior

New `RegexFSM` Construction

New `RegexFSM` Merging

Base Case: Only two `RegexFSM` concatenated

Extended Case: Concatenating more than two `RegexFSM`

Replies: 4 comments 2 replies

Mattross45
Feb 19, 2024

lapp0
Feb 20, 2024
Author

brandonwillard
Feb 20, 2024
Maintainer

Reichenbachian
Sep 19, 2024

lapp0 Sep 19, 2024
Author