With unigram algorithm, constant piece at end of each sentences does not become a token #1047

jogardi · 2024-08-29T21:35:34Z

Hi thanks for your great work on this. I noticed a subtle issue when playing with synthetic examples.

The bpe algorithm works as expected but the unigram algorithm does not make this constant piece a token in the vocabulary.
I generate synthetic data where each sentence is a random string followed by a constant piece.

constant_piece = 'helloWorld'
def rand_str(n=10):
    return ''.join(
        np.random.choice(list('bcegijklmnoqruvwxyz'), n)
    )

data = [rand_str() + constant_piece for _ in range(1000)]
model = io.BytesIO()
spm.SentencePieceTrainer.train(
      sentence_iterator=iter(data), model_writer=model, 
    vocab_size=1000,
    minloglevel=5, 
)
sp = spm.SentencePieceProcessor(model_proto=model.getvalue())

ex = data[20]
print([
    sp.IdToPiece(x)
    for x in sp.encode(ex, emit_unk_piece=True)
])

outputs: ['▁uy', 'vx', 'yf', 'p', 'gmn', 'he', 'llo', 'W', 'or', 'ld']

It mostly just gets random tokens. I think it gets 'he', 'llo', 'or' and 'ld' not because it noticed the repeating pattern but just by coincidently seeing it in the random strings. If I change constant_piece to '123456' then i get no tokens for the repeating pattern and only tokens for the random string: ['▁', 'gll', 'imq', 'xc', 'df', '1', '2', '3', '4', '5', '6']

This specifically because the constant_piece at the end. If I change data so that constant_piece is at the beginning of each sentence: data = [constant_piece + rand_str() for _ in range(1000)] then i get the expected result ['▁123456', 'uzb', 'ek', 'hoe', 'wr'].

TLDR;
Unexpected result under the following conditions:

same string at end of each sentence in the training data
using unigram algorithm

The text was updated successfully, but these errors were encountered:

taku910 added the bug label Sep 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

With unigram algorithm, constant piece at end of each sentences does not become a token #1047

With unigram algorithm, constant piece at end of each sentences does not become a token #1047

jogardi commented Aug 29, 2024 •

edited

Loading

With unigram algorithm, constant piece at end of each sentences does not become a token #1047

With unigram algorithm, constant piece at end of each sentences does not become a token #1047

Comments

jogardi commented Aug 29, 2024 • edited Loading

jogardi commented Aug 29, 2024 •

edited

Loading