BERT: Bidirectional Encoder Representations from Transformers

Quick notes

Learning contextual representaiton (similar to what ELMo did)
Using the encoder from a Transformer network
Can be fine-tuned with just one additional output layer
- question answering
- language inference

Background

Problem: Language Models only use left context or right context, but language understanding is bidirectional

Why are LMs unidirectional?

Directionality is needed to generate a well-formed probability distribution
Words can "see themselves" in a bidirectional encoder

Pre-training Tasks

Train on Wikipedia + BookCorpus

Masked LM
Next Sentence Prediction

1. Masked LM

Mask out $k%$ of the input words, and then predict the masked words (where $k = 15%$)

Too little masking: Too expensive to train
Too much masking: Not enough context

In detail:

Rather than always replacing the chosen words with [MASK] token

80% of the time: Replace the word with the [MASK] token
10% of the time: Replace the word with a random word
10% of the time: Keep the word unchanged
- the purpose of this is to bias the representation towards the actual observed word.

2. Next Sentence Prediction

To learn relationship between sentences

Predict whether:

IsNextSentence: Sentence B is actual sentence that proceeds Sentence A (A $\rightarrow$ B)
NotNextSentence: Sentence B is just a random sentence (A $\rightarrow$ random)

Model

Transformer encoder

2 model sizes

BERT-Base: 12-layer (transformers), 768-hidden, 12-head
BERT-Large: 24-layer, 1024-hidden, 16-head

Sentence Pair Embedding

Input Embedding =

Token Embeddings - Token embeddings are word pieces
Segment Embeddings - Learned segmented embedding represents each sentence
- Where the [CLS] to the first [SEP] will have the same embedding
- And the second sentence till the ending [SEP] will have another same embedding
Positional Embedding - Positional embedding is as for other Transformer architectures

Fine-tuning

Simply learn a classifier built on the top layer for each task that you fine tune for

Resources

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
預訓練小模型也能拿下13項NLP任務，谷歌ALBERT三大改造登頂GLUE基準

Tutorial

TDLS: BERT, Pretranied Deep Bidirectional Transformers for Language Understanding (algorithm) - TODO
BERT to the rescue! - Towards Data Science

Application

terrifyzhao/bert-utils - Generate sentence vector, document classification, document similarity

Implementation

hanxiao/bert-as-service: Mapping a variable-length sentence to a fixed-length vector using BERT model
CyberZHG/keras-bert: Implementation of BERT that could load official pre-trained models for feature extraction and prediction
bojone/bert4keras: Our light reimplement of bert for keras
kaushaltrivedi/fast-bert: Super easy library for BERT based NLP models
brightmart/albert_zh: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS, 海量中文預訓練ALBERT模型
bhoov/exbert: A Visual Analysis Tool to Explore Learned Representations in Transformers Models
kaushaltrivedi/fast-bert: Super easy library for BERT based NLP models
tomohideshibata/BERT-related-papers: BERT-related papers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERT.md

BERT.md

BERT: Bidirectional Encoder Representations from Transformers

Background

Pre-training Tasks

1. Masked LM

2. Next Sentence Prediction

Model

Sentence Pair Embedding

Fine-tuning

Resources

Tutorial

Application

Implementation

Files

BERT.md

Latest commit

History

BERT.md

File metadata and controls

BERT: Bidirectional Encoder Representations from Transformers

Background

Pre-training Tasks

1. Masked LM

2. Next Sentence Prediction

Model

Sentence Pair Embedding

Fine-tuning

Resources

Tutorial

Application

Implementation