Quick notes
- Learning contextual representaiton (similar to what ELMo did)
- Using the encoder from a Transformer network
- Can be fine-tuned with just one additional output layer
- question answering
- language inference
Problem: Language Models only use left context or right context, but language understanding is bidirectional
Why are LMs unidirectional?
- Directionality is needed to generate a well-formed probability distribution
- Words can "see themselves" in a bidirectional encoder
Train on Wikipedia + BookCorpus
- Masked LM
- Next Sentence Prediction
Mask out
$k%$ of the input words, and then predict the masked words (where$k = 15%$ )
- Too little masking: Too expensive to train
- Too much masking: Not enough context
In detail:
Rather than always replacing the chosen words with [MASK] token
- 80% of the time: Replace the word with the [MASK] token
- 10% of the time: Replace the word with a random word
- 10% of the time: Keep the word unchanged
- the purpose of this is to bias the representation towards the actual observed word.
To learn relationship between sentences
Predict whether:
-
IsNextSentence
: Sentence B is actual sentence that proceeds Sentence A (A$\rightarrow$ B) -
NotNextSentence
: Sentence B is just a random sentence (A$\rightarrow$ random)
- Transformer encoder
2 model sizes
- BERT-Base: 12-layer (transformers), 768-hidden, 12-head
- BERT-Large: 24-layer, 1024-hidden, 16-head
Input Embedding =
- Token Embeddings - Token embeddings are word pieces
- Segment Embeddings - Learned segmented embedding represents each sentence
- Where the [CLS] to the first [SEP] will have the same embedding
- And the second sentence till the ending [SEP] will have another same embedding
- Positional Embedding - Positional embedding is as for other Transformer architectures
Simply learn a classifier built on the top layer for each task that you fine tune for
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
- 預訓練小模型也能拿下13項NLP任務,谷歌ALBERT三大改造登頂GLUE基準
- TDLS: BERT, Pretranied Deep Bidirectional Transformers for Language Understanding (algorithm) - TODO
- BERT to the rescue! - Towards Data Science
- terrifyzhao/bert-utils - Generate sentence vector, document classification, document similarity
- hanxiao/bert-as-service: Mapping a variable-length sentence to a fixed-length vector using BERT model
- CyberZHG/keras-bert: Implementation of BERT that could load official pre-trained models for feature extraction and prediction
- bojone/bert4keras: Our light reimplement of bert for keras
- kaushaltrivedi/fast-bert: Super easy library for BERT based NLP models
- brightmart/albert_zh: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS, 海量中文預訓練ALBERT模型
- bhoov/exbert: A Visual Analysis Tool to Explore Learned Representations in Transformers Models
- kaushaltrivedi/fast-bert: Super easy library for BERT based NLP models
- tomohideshibata/BERT-related-papers: BERT-related papers