You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some tokenizers use a special character for whitespaces. BPE, for instance, uses Ġ. For anything that corresponds to a match on the strings we thus need to make sure that we properly handle this leading space.
This is hardly an isolated phenomenon, 66% of the tokens created by a BPE tokenizer include (leading) spaces:
By default, encoding and decoding the string is idempotent:
print(with_spaces[0])
# Ġailprint(tokenizer.decode(tokenizer.encode(with_spaces[0])))
# Ġailprint(tokenizer.decode(tokenizer.encode(" ail")))
# ailprint(tokenizer.decode(tokenizer.encode("ail")))
# ail
When matching text to tokens we need to replace this character with a space, which can be achieved with the convert_tokens_to_string method of the tokenizers:
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Some tokenizers use a special character for whitespaces. BPE, for instance, uses
Ġ
. For anything that corresponds to a match on the strings we thus need to make sure that we properly handle this leading space.This is hardly an isolated phenomenon, 66% of the tokens created by a BPE tokenizer include (leading) spaces:
By default, encoding and decoding the string is idempotent:
When matching text to tokens we need to replace this character with a space, which can be achieved with the
convert_tokens_to_string
method of the tokenizers:This may be useful in #161 for instance.
How this may affect generation
Here is an example where adding a space at the end of a prompt affects the sequence being generated:
Beta Was this translation helpful? Give feedback.
All reactions