You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using model_type="word" as argument in spm.SentencePieceTrainer.train, it seems that tokens listed in user_defined_symbols for example user_defined_symbols=["<s>", "</s>", "."], are still encoded to the unk_id. Using BPE, and Char works.
Is this intended for word models?
The text was updated successfully, but these errors were encountered:
When using
model_type="word"
as argument inspm.SentencePieceTrainer.train
, it seems that tokens listed inuser_defined_symbols
for exampleuser_defined_symbols=["<s>", "</s>", "."],
are still encoded to theunk_id
. Using BPE, and Char works.Is this intended for word models?
The text was updated successfully, but these errors were encountered: