Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

With unigram algorithm, constant piece at end of each sentences does not become a token #1047

Open
jogardi opened this issue Aug 29, 2024 · 0 comments
Labels

Comments

@jogardi
Copy link

jogardi commented Aug 29, 2024

Hi thanks for your great work on this. I noticed a subtle issue when playing with synthetic examples.

The bpe algorithm works as expected but the unigram algorithm does not make this constant piece a token in the vocabulary.
I generate synthetic data where each sentence is a random string followed by a constant piece.

constant_piece = 'helloWorld'
def rand_str(n=10):
    return ''.join(
        np.random.choice(list('bcegijklmnoqruvwxyz'), n)
    )

data = [rand_str() + constant_piece for _ in range(1000)]
model = io.BytesIO()
spm.SentencePieceTrainer.train(
      sentence_iterator=iter(data), model_writer=model, 
    vocab_size=1000,
    minloglevel=5, 
)
sp = spm.SentencePieceProcessor(model_proto=model.getvalue())

ex = data[20]
print([
    sp.IdToPiece(x)
    for x in sp.encode(ex, emit_unk_piece=True)
])

outputs: ['▁uy', 'vx', 'yf', 'p', 'gmn', 'he', 'llo', 'W', 'or', 'ld']

It mostly just gets random tokens. I think it gets 'he', 'llo', 'or' and 'ld' not because it noticed the repeating pattern but just by coincidently seeing it in the random strings. If I change constant_piece to '123456' then i get no tokens for the repeating pattern and only tokens for the random string: ['▁', 'gll', 'imq', 'xc', 'df', '1', '2', '3', '4', '5', '6']

This specifically because the constant_piece at the end. If I change data so that constant_piece is at the beginning of each sentence: data = [constant_piece + rand_str() for _ in range(1000)] then i get the expected result ['▁123456', 'uzb', 'ek', 'hoe', 'wr'].

TLDR;
Unexpected result under the following conditions:

  • same string at end of each sentence in the training data
  • using unigram algorithm
@taku910 taku910 added the bug label Sep 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants