Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<|endoftext|> token isn't encoded correctly #140

Open
ttumiel opened this issue Jun 27, 2024 · 2 comments
Open

<|endoftext|> token isn't encoded correctly #140

ttumiel opened this issue Jun 27, 2024 · 2 comments

Comments

@ttumiel
Copy link

ttumiel commented Jun 27, 2024

import torch
from mingpt.bpe import BPETokenizer

tokenizer = BPETokenizer()
print(tokenizer("<|endoftext|>")) # tensor([[  27,   91,  437, 1659, 5239,   91,   29]])
print(tokenizer.decode(torch.tensor([50256]))) # '<|endoftext|>'
print(tokenizer(tokenizer.decode(torch.tensor([50256])))) # tensor([[  27,   91,  437, 1659, 5239,   91,   29]])
@ttumiel ttumiel changed the title <|end_of_text|> token isn <|end_of_text|> token isn't encoded correctly Jun 27, 2024
@ttumiel ttumiel changed the title <|end_of_text|> token isn't encoded correctly <|endoftext|> token isn't encoded correctly Jun 27, 2024
@sorgina13
Copy link

I noticed that you've found that '<|endoftext|>' is encoded in two different ways. If you look in the config files, there is a difference in the number of entries between the "encoder.json" (50256) and "vocab.bpe" (50002). Part of the explanation for this discrepancy maybe that the "encoder.json" file might contain additional entries for special tokens used by the BPE tokenizer. Taking "<|endoftext|>" as an example:

Subword Breakdown: The tokenizer breaks down the special token "<|endoftext|>" into a sequence of subwords it has learned during BPE training. In your example, it results in "[ 27, 91, 437, 1659, 5239, 91, 29]". These subwords might represent parts of the special token or characters within it.
Unique Index: Additionally, the tokenizer assigns a unique index (50256 in your case) to the entire special token itself. This index allows for efficient encoding and decoding during text processing. In fact it is the last token in the encoder.json file.

There are a couple of reasons why BPE tokenizers might use this dual representation for special tokens:

Efficiency: While the subword breakdown provides some information about the special token's structure, it might not be the most efficient way to represent it during processing. Using a unique index allows for faster lookup and manipulation within the model.
Clarity: Having a separate index for the special token makes it easier to identify and handle these tokens within the encoded sequence. This can be helpful for tasks like identifying sentence boundaries or performing specific operations on special tokens.

I hope this helps.

@zhuzihan728
Copy link

zhuzihan728 commented Sep 28, 2024

The GPT-2 tokenizer does not differentiate special tokens from regular tokens during encoding, as mentioned in this issue.

However, in implementations like Hugging Face's (as seen here), special tokens are treated separately when splitting text into chunks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants