Skip to content

Latest commit

 

History

History
108 lines (66 loc) · 3.65 KB

8.md

File metadata and controls

108 lines (66 loc) · 3.65 KB

8. Subword tokenization

Explanations and visualisations:

 

Why is text segmentation not trivial?

data types

Source: Khan Academuy

 

nlpflow

  • Text is segmented into tokens (compared to frames in sound processing, pixels in image processing)
  • How should we split texts into tokens?
  • Word as a token: too naive, overestimating the size of the vocabulary
    • fast and faster equally distinct as fast and water
    • what is a word?
    • much less clear in languages other than English

 

The problem of out-of-vocabulary (OOV) words

  • Follows from Zipf's law: most of words are rare
  • Follows from information theory: rare words are long
  • Subword tokenization as a solution: split words into smaller segments
  • New problem: How to split words? What should be subword units?
  • One possibility: linguistic morphology
  • Other possibilities: substrings that are not necessarily linguistic units: subword tokens

 

Compression algorithms

subword options

 

  • Byte-Pair Encoding (BPE)
    • Starts with Unicode characters as symbols and pre-tokenization (word-level)
    • Iterates over data, in each iteration creates one new symbol
    • Each new symbol is introduced as a replacement for the most frequent bigram of symbols
  • WordPiece
    • Starts with Unicode characters as symbols and pre-tokenization (word-level)
    • Iterates over data, in each iteration creates one new symbol
    • Each new symbol is introduced as a replacement for the bigram of symbols with the highest association score (similar to mutual information)

 

Probability models

data-represent

  • Start with all possible splits in theory, in practice, from a sample of all possible splits
  • Eliminate symbols that contribute least to increasing the log probability of the data
  • Morfessor
    • More popular in earlier work on morphological segmentation
    • Can be tuned to put more weight on minimising either vocabulary or data size
  • Unigram model
    • Currently very popular
    • Vocabulary size an explicit hyper-parameter

 

The trade-off between data (=text) size and vocabulary size

  • Following from information theory: the shorter the symbol the more re-occurrence
  • If there are regular patterns, they will re-occur, in this sense structure is recurrence
  • If symbols are short and re-occurring -> small vocabulary, more evidence for estimating probability, but data longer
  • If symbols are long -> big vocabulary, little evidence for estimating probability, but data shorter
  • The goal of subword segmentation: find the optimal symbols minimising both sizes (data and vocabulary)

 

Practical tips

  • BPE good for consistent, more regular data
  • Unigram better for noisy data
  • WordPiece merges more lexical items (roots)
  • Vocabulary size often decided as a function of the data size, sometimes as a proportion of the word-level vocabulary
  • BPE and Unigram implemented in the SentencePiece library
  • WordPiece used for BERT