Tokenizers and spaces #193

rlouf · 2023-07-17T08:52:52Z

rlouf
Jul 17, 2023
Maintainer

Some tokenizers use a special character for whitespaces. BPE, for instance, uses Ġ. For anything that corresponds to a match on the strings we thus need to make sure that we properly handle this leading space.

This is hardly an isolated phenomenon, 66% of the tokens created by a BPE tokenizer include (leading) spaces:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
vocabulary = tokenizer.get_vocab()
with_spaces = [v for v in vocabulary.keys() if "Ġ" in v]
print(len(with_spaces)/len(vocabulary.keys()))
# 0.6593111407366138

By default, encoding and decoding the string is idempotent:

print(with_spaces[0])
# Ġail
print(tokenizer.decode(tokenizer.encode(with_spaces[0])))
# Ġail
print(tokenizer.decode(tokenizer.encode(" ail")))
#  ail
print(tokenizer.decode(tokenizer.encode("ail")))
# ail

When matching text to tokens we need to replace this character with a space, which can be achieved with the convert_tokens_to_string method of the tokenizers:

print(tokenizer.convert_tokens_to_string(["ĠSuggest"]))
#  Suggest

This may be useful in #161 for instance.

How this may affect generation

Here is an example where adding a space at the end of a prompt affects the sequence being generated:

import outlines.models as models
import outlines.text.generate as generate

model = models.transformers("gpt2-medium")

prompt = "Is regex-guided generation useful? "
guided = generate.regex(model, r"(Yes|No)", max_tokens=30)(prompt)
print(guided)
# Is regex-guided generation useful?No

prompt = "Is regex-guided generation useful?  "
guided = generate.regex(model, r"(Yes|No)", max_tokens=30)(prompt)
print(guided)
# Is regex-guided generation useful? Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizers and spaces #193

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Tokenizers and spaces #193

rlouf Jul 17, 2023 Maintainer

How this may affect generation

Replies: 0 comments

rlouf
Jul 17, 2023
Maintainer