Replies: 5 comments 2 replies
-
This is quite a topic. A more generic approach would probably be to handle this at the tokenizer level, for which I see two solutions:
For the latter, I have been trying a few things with the help of Claude, but maintaining it long term might still prove difficult, especially with the various platforms specificities. If neither are deemed good enough, we could: |
Beta Was this translation helpful? Give feedback.
-
I had a similar situation with llama 2 modified for TowerInstruct. I think we need a solution to cover all cases and the best I can see is: For each model we define a map_table like: in the tokenize transform we perform a simple sentence.replace by looping over this table at the beg of tokenize_string and before returning "segmented" we reverse the replace in the tokens list. it will preserve these tokens without touching the vocab which remains as in the HF model. |
Beta Was this translation helpful? Give feedback.
-
Following-up on #102 and #103 here to keep this topic in one place. On one hand, I am not super fan of this explicit That being said, let's try and make it work. When iterating on the draft inference server (#42) I tested some "embedded inference config" tricks which could probably be useful here, if we want to automatically populate our
Note: we might want to rename We still might need some conditions on the tokens formatting though, if we want to replace In any case, we need to converge on a v0 of the inference server, so we can probably dig up #42 and try such logic there. |
Beta Was this translation helpful? Give feedback.
-
not super fan of an additional inference.json file. maybe we can include that in the config.json file. |
Beta Was this translation helpful? Give feedback.
-
Quick question: are these special tokens a way to translate texts with xml tags with LLM? I am using the attention to restore xml tags when working with opennmt-py seq2seq models and it worked great. Now I am trying to find a way to preserve the xml tags when translating with llama3.1. Would the special tokens help there? Best, Kai |
Beta Was this translation helpful? Give feedback.
-
Valentin and I noticed that eole's tokenizer wasn't correctly preserving most of llama3's special tokens, such as <|start_header_id|>, <|start_header_id|> or <|start_header_id|>. Valentin found a solution that I've pushed here.
l-k-11235#1
It works well, but it's not robust. How can we find a better way to handle special tokens?
Beta Was this translation helpful? Give feedback.
All reactions