How to manage LLMs' special tokens? #80

l-k-11235 · 2024-07-24T14:31:31Z

l-k-11235
Jul 24, 2024

Valentin and I noticed that eole's tokenizer wasn't correctly preserving most of llama3's special tokens, such as <|start_header_id|>, <|start_header_id|> or <|start_header_id|>. Valentin found a solution that I've pushed here.
l-k-11235#1
It works well, but it's not robust. How can we find a better way to handle special tokens?

francoishernandez · 2024-08-20T16:26:36Z

francoishernandez
Aug 20, 2024
Maintainer

This is quite a topic.
I had a similar patch in the works, but it did not make its way onto main yet. Good that you could work it out.

A more generic approach would probably be to handle this at the tokenizer level, for which I see two solutions:

support HF tokenizers, but that would defeat a bit the purpose of having streamlined solutions not relying on somewhat tentacular HF implementations;
adapt the OpenNMT Tokenizer, which has served its purpose more than fine to support some "protected" tokens which need to be retained as is (on top of the current placeholders mechanism).

For the latter, I have been trying a few things with the help of Claude, but maintaining it long term might still prove difficult, especially with the various platforms specificities.
See my branch here -- OpenNMT/Tokenizer@master...francoishernandez:Tokenizer:protected_tokens
TL;DR; I added the protected_tokens(kwarg)/--protected-tokens(flag) to support a list/file of tokens not to be split.
Not fully tested, and probably not super clean. If we find this enough as a short/mid term solution, we can PR this to OpenNMT/Tokenizer and try and make it stable enough.

If neither are deemed good enough, we could:
3. implement a generic patch similar to yours and mine, but to handle more cases (e.g. based on tokenizer configs retrieved from models) -- and accept the fact that it's not super clean;
4. rethink and revamp the whole tokenization logic, with a new custom tokenizer (good luck).

0 replies

vince62s · 2024-09-09T13:34:19Z

vince62s
Sep 9, 2024
Maintainer

I had a similar situation with llama 2 modified for TowerInstruct.

I think we need a solution to cover all cases and the best I can see is:

For each model we define a map_table like:
[ ("<im_start>", "((im_start))"), ("<im_end>", "((im_end))") ]

in the tokenize transform we perform a simple sentence.replace by looping over this table at the beg of tokenize_string and before returning "segmented" we reverse the replace in the tokens list.

it will preserve these tokens without touching the vocab which remains as in the HF model.

0 replies

francoishernandez · 2024-09-16T08:15:17Z

francoishernandez
Sep 16, 2024
Maintainer

Following-up on #102 and #103 here to keep this topic in one place.

On one hand, I am not super fan of this explicit mapped_tokens solution, as it is quite a hassle to maintain such things explicitly in the config files. On the other hand, the "implicit" patch is not necessarily better because it might induce some unwanted transformations in some contexts. That's a tricky topic.

That being said, let's try and make it work.

When iterating on the draft inference server (#42) I tested some "embedded inference config" tricks which could probably be useful here, if we want to automatically populate our mapped_tokens with HF's added_tokens.

Retrieve some inference related settings in convert_HF, and save some inference.json config in the model_dir
https://github.com/eole-nlp/eole/pull/42/files#diff-fe182c94492e3a828a680a52f0723ef2d6f75f4cd563efed58397f7d43a0364bR943-R954
Load the "basis" inference settings from inference.json (and potentially override it) -- only supported in server in the linked PR, but we could probably make it work for all cases
https://github.com/eole-nlp/eole/pull/42/files#diff-bc00a54866a6ce1307f26c9ee460d968c36e165717c0d3e1b1f48051d0eec268R109-R125

Note: we might want to rename inference.json to something more generic, e.g. if we want to load such transforms transparently for simple finetuning setups.

We still might need some conditions on the tokens formatting though, if we want to replace <|xyz|> with ｟xyz｠. Or, we completely ignore the formatting, and wrap every added_token like ｟<|xyz|>｠. (Not sure of potential implications of the latter on other specific handling of some of these added_tokens, e.g. eos/bos, we'll probably need to be careful and test various models.)

In any case, we need to converge on a v0 of the inference server, so we can probably dig up #42 and try such logic there.

0 replies

vince62s · 2024-09-16T08:30:24Z

vince62s
Sep 16, 2024
Maintainer

not super fan of an additional inference.json file. maybe we can include that in the config.json file.
but even if so, do we extend to concept up to dropping the need of a yaml file for inference ?

1 reply

francoishernandez Sep 16, 2024
Maintainer

not super fan of an additional inference.json file. maybe we can include that in the config.json file.

Probably not an issue, we can probably support some "inference" section in the main config.json.

but even if so, do we extend to concept up to dropping the need of a yaml file for inference ?

Yes, the initial idea for this in #42 was to have a "basis" configuration stored within the model, to facilitate inference server configuration without redefining lots of things manually.
So if we extend this to all inference cases, the yaml config would become optional, e.g. if we want to play around with some settings (beam size, sampling, etc.) or anything. (Similarly to finetuning, where we "pre-load" the model config, and override only what's needed.)

Kai-Piontek · 2024-09-17T06:24:53Z

Kai-Piontek
Sep 17, 2024

Quick question: are these special tokens a way to translate texts with xml tags with LLM? I am using the attention to restore xml tags when working with opennmt-py seq2seq models and it worked great. Now I am trying to find a way to preserve the xml tags when translating with llama3.1. Would the special tokens help there? Best, Kai

1 reply

francoishernandez Sep 17, 2024
Maintainer

Yes and no.

The tokens mentioned here are used to "guide" the model in its "reasoning" (weights) rather than in the logic (implemented code) itself (see note below). The initial example in this discussion mentions the LLAMA3 prompt templating tokens, which are used to indicate to the model that a specific input is specifically system/user/assistant content. These tokens need to be "preserved" by any preprocessing (mainly tokenization) happening to the text in order to be considered properly by the model. (Else, they might be split in several subwords, which would break the signal given to the model.)

Special tokens are not doing anything per themselves, it depends on what the model has been shown in training basically. This can be generalized to various situations depending on the needs. E.g. if you want to indicate to the model that "Apple" is a brand, you could technically have some <|brand|> special token inserted before hand to give a specific signal to the model.

With that in mind, you can technically use such tricks to handle XML content, e.g. by converting XML tags to some special "placeholder" tokens with some preprocessing, running inference, and converting the "placeholders" back to the XML content. This is quite simple in theory, but a bit trickier in practice, since it depends on the robustness of the model to re-insert the placeholders properly in the output. You can check out the inline_tags transform for some training tricks to help your model.
No time to find all sources, but this is quite a widely known technique and is discussed in a lot of places, e.g. https://forum.opennmt.net/t/suggestions-for-translating-xml/4409/2.

Note

For reference "special tokens" are initially something else (e.g. #45).
Special tokens are some kinds of "architecture level" tokens, often mentioned as BOS (beginning of sentence), PAD (padding), EOS (end of sentence), UNK (unknown), which are used for various reasons:

BOS indicates the model it's a new sentence;
EOS means it's the end, which is useful to know when to stop decoding;
PAD is to have some "dummy" tokens if some batches are uneven;
UNK is to have a fallback if a token is not mapping to anything in the known vocabulary).
These tokens are mostly used "programmatically", to apply specific logic to specific inputs/outputs.
(The issue here is mainly that different models use different tokens, which we need a proper way of supporting.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to manage LLMs' special tokens? #80

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to manage LLMs' special tokens? #80

l-k-11235 Jul 24, 2024

Replies: 5 comments · 2 replies

francoishernandez Aug 20, 2024 Maintainer

vince62s Sep 9, 2024 Maintainer

francoishernandez Sep 16, 2024 Maintainer

vince62s Sep 16, 2024 Maintainer

francoishernandez Sep 16, 2024 Maintainer

Kai-Piontek Sep 17, 2024

francoishernandez Sep 17, 2024 Maintainer

Note

l-k-11235
Jul 24, 2024

Replies: 5 comments 2 replies

francoishernandez
Aug 20, 2024
Maintainer

vince62s
Sep 9, 2024
Maintainer

francoishernandez
Sep 16, 2024
Maintainer

vince62s
Sep 16, 2024
Maintainer

francoishernandez Sep 16, 2024
Maintainer

Kai-Piontek
Sep 17, 2024

francoishernandez Sep 17, 2024
Maintainer