Skip to content

Latest commit

 

History

History
99 lines (63 loc) · 4 KB

File metadata and controls

99 lines (63 loc) · 4 KB

BERT: Bidirectional Encoder Representations from Transformers

Quick notes

  • Learning contextual representaiton (similar to what ELMo did)
  • Using the encoder from a Transformer network
  • Can be fine-tuned with just one additional output layer
    • question answering
    • language inference

Background

Problem: Language Models only use left context or right context, but language understanding is bidirectional

Why are LMs unidirectional?

  1. Directionality is needed to generate a well-formed probability distribution
  2. Words can "see themselves" in a bidirectional encoder

Pre-training Tasks

Train on Wikipedia + BookCorpus

  1. Masked LM
  2. Next Sentence Prediction

1. Masked LM

Mask out $k%$ of the input words, and then predict the masked words (where $k = 15%$)

  • Too little masking: Too expensive to train
  • Too much masking: Not enough context

In detail:

Rather than always replacing the chosen words with [MASK] token

  • 80% of the time: Replace the word with the [MASK] token
  • 10% of the time: Replace the word with a random word
  • 10% of the time: Keep the word unchanged
    • the purpose of this is to bias the representation towards the actual observed word.

2. Next Sentence Prediction

To learn relationship between sentences

Predict whether:

  1. IsNextSentence: Sentence B is actual sentence that proceeds Sentence A (A $\rightarrow$ B)
  2. NotNextSentence: Sentence B is just a random sentence (A $\rightarrow$ random)

Model

  • Transformer encoder

2 model sizes

  • BERT-Base: 12-layer (transformers), 768-hidden, 12-head
  • BERT-Large: 24-layer, 1024-hidden, 16-head

Sentence Pair Embedding

Input Embedding =

  • Token Embeddings - Token embeddings are word pieces
  • Segment Embeddings - Learned segmented embedding represents each sentence
    • Where the [CLS] to the first [SEP] will have the same embedding
    • And the second sentence till the ending [SEP] will have another same embedding
  • Positional Embedding - Positional embedding is as for other Transformer architectures

Fine-tuning

Simply learn a classifier built on the top layer for each task that you fine tune for

Resources

Tutorial

Application

Implementation