-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<|endoftext|> token isn't encoded correctly #140
Comments
I noticed that you've found that '<|endoftext|>' is encoded in two different ways. If you look in the config files, there is a difference in the number of entries between the "encoder.json" (50256) and "vocab.bpe" (50002). Part of the explanation for this discrepancy maybe that the "encoder.json" file might contain additional entries for special tokens used by the BPE tokenizer. Taking "<|endoftext|>" as an example: Subword Breakdown: The tokenizer breaks down the special token "<|endoftext|>" into a sequence of subwords it has learned during BPE training. In your example, it results in "[ 27, 91, 437, 1659, 5239, 91, 29]". These subwords might represent parts of the special token or characters within it. There are a couple of reasons why BPE tokenizers might use this dual representation for special tokens: Efficiency: While the subword breakdown provides some information about the special token's structure, it might not be the most efficient way to represent it during processing. Using a unique index allows for faster lookup and manipulation within the model. I hope this helps. |
The GPT-2 tokenizer does not differentiate special tokens from regular tokens during encoding, as mentioned in this issue. However, in implementations like Hugging Face's (as seen here), special tokens are treated separately when splitting text into chunks. |
The text was updated successfully, but these errors were encountered: