Explanations and visualisations:
- Jurafsky-Martin 2.3
- Andrej Karpathy, Let's build the GPT Tokenizer YouTube video
- Hugging Face Tokenizers library
- Tiktokenizer
Source: Khan Academuy
- Text is segmented into tokens (compared to frames in sound processing, pixels in image processing)
- How should we split texts into tokens?
- Word as a token: too naive, overestimating the size of the vocabulary
- fast and faster equally distinct as fast and water
- what is a word?
- much less clear in languages other than English
- Follows from Zipf's law: most of words are rare
- Follows from information theory: rare words are long
- Subword tokenization as a solution: split words into smaller segments
- New problem: How to split words? What should be subword units?
- One possibility: linguistic morphology
- Other possibilities: substrings that are not necessarily linguistic units: subword tokens
- Byte-Pair Encoding (BPE)
- Starts with Unicode characters as symbols and pre-tokenization (word-level)
- Iterates over data, in each iteration creates one new symbol
- Each new symbol is introduced as a replacement for the most frequent bigram of symbols
- WordPiece
- Starts with Unicode characters as symbols and pre-tokenization (word-level)
- Iterates over data, in each iteration creates one new symbol
- Each new symbol is introduced as a replacement for the bigram of symbols with the highest association score (similar to mutual information)
- Start with all possible splits in theory, in practice, from a sample of all possible splits
- Eliminate symbols that contribute least to increasing the log probability of the data
- Morfessor
- More popular in earlier work on morphological segmentation
- Can be tuned to put more weight on minimising either vocabulary or data size
- Unigram model
- Currently very popular
- Vocabulary size an explicit hyper-parameter
- Following from information theory: the shorter the symbol the more re-occurrence
- If there are regular patterns, they will re-occur, in this sense structure is recurrence
- If symbols are short and re-occurring -> small vocabulary, more evidence for estimating probability, but data longer
- If symbols are long -> big vocabulary, little evidence for estimating probability, but data shorter
- The goal of subword segmentation: find the optimal symbols minimising both sizes (data and vocabulary)
- BPE good for consistent, more regular data
- Unigram better for noisy data
- WordPiece merges more lexical items (roots)
- Vocabulary size often decided as a function of the data size, sometimes as a proportion of the word-level vocabulary
- BPE and Unigram implemented in the SentencePiece library
- WordPiece used for BERT