<|endoftext|> token isn't encoded correctly #140

ttumiel · 2024-06-27T11:17:46Z

import torch
from mingpt.bpe import BPETokenizer

tokenizer = BPETokenizer()
print(tokenizer("<|endoftext|>")) # tensor([[  27,   91,  437, 1659, 5239,   91,   29]])
print(tokenizer.decode(torch.tensor([50256]))) # '<|endoftext|>'
print(tokenizer(tokenizer.decode(torch.tensor([50256])))) # tensor([[  27,   91,  437, 1659, 5239,   91,   29]])

sorgina13 · 2024-07-15T07:52:52Z

I noticed that you've found that '<|endoftext|>' is encoded in two different ways. If you look in the config files, there is a difference in the number of entries between the "encoder.json" (50256) and "vocab.bpe" (50002). Part of the explanation for this discrepancy maybe that the "encoder.json" file might contain additional entries for special tokens used by the BPE tokenizer. Taking "<|endoftext|>" as an example:

Subword Breakdown: The tokenizer breaks down the special token "<|endoftext|>" into a sequence of subwords it has learned during BPE training. In your example, it results in "[ 27, 91, 437, 1659, 5239, 91, 29]". These subwords might represent parts of the special token or characters within it.
Unique Index: Additionally, the tokenizer assigns a unique index (50256 in your case) to the entire special token itself. This index allows for efficient encoding and decoding during text processing. In fact it is the last token in the encoder.json file.

There are a couple of reasons why BPE tokenizers might use this dual representation for special tokens:

Efficiency: While the subword breakdown provides some information about the special token's structure, it might not be the most efficient way to represent it during processing. Using a unique index allows for faster lookup and manipulation within the model.
Clarity: Having a separate index for the special token makes it easier to identify and handle these tokens within the encoded sequence. This can be helpful for tasks like identifying sentence boundaries or performing specific operations on special tokens.

I hope this helps.

zhuzihan728 · 2024-09-28T13:39:44Z

The GPT-2 tokenizer does not differentiate special tokens from regular tokens during encoding, as mentioned in this issue.

However, in implementations like Hugging Face's (as seen here), special tokens are treated separately when splitting text into chunks.

ttumiel changed the title ~~<|end_of_text|> token isn~~ <|end_of_text|> token isn't encoded correctly Jun 27, 2024

ttumiel changed the title ~~<|end_of_text|> token isn't encoded correctly~~ <|endoftext|> token isn't encoded correctly Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<|endoftext|> token isn't encoded correctly #140

<|endoftext|> token isn't encoded correctly #140

ttumiel commented Jun 27, 2024 •

edited

Loading

sorgina13 commented Jul 15, 2024

zhuzihan728 commented Sep 28, 2024 •

edited

Loading

<|endoftext|> token isn't encoded correctly #140

<|endoftext|> token isn't encoded correctly #140

Comments

ttumiel commented Jun 27, 2024 • edited Loading

sorgina13 commented Jul 15, 2024

zhuzihan728 commented Sep 28, 2024 • edited Loading

ttumiel commented Jun 27, 2024 •

edited

Loading

zhuzihan728 commented Sep 28, 2024 •

edited

Loading