In this lab we will build a Question Answering model for SQuAD based on BERT using AllenNLP.
Make sure that the script setup_dependencies.sh
installed every package and downloaded every data file. In particular,
make sure that pytorch-transformers
is installed and that you have in your resources
folder the following files:
bert-base-uncased/pytorch_model.bin
(pretrained BERT model)bert-base-uncased/config.json
(pretrained BERT model parameters)bert-base-uncased/vocab.txt
(vocabulary for the pretrained model)
We will fine-tune the BERT model on the SQuAD 2.0 dataset. In this tutorial will use just a small part of it, as we will be running things on your laptop; you are more than welcome to try out the full dataset on a GPU-enabled machine too.
The portion of the data that we will be using in this tutorial is available in the folder data/squad/
. We will use the file
train.json
for model training and test.json
for the evaluation phase.
BERT is a large-scale language model trained using several masking-based loss functions. In this tutorial we will show you how to fine-tune a BERT model trained on millions of text documents to complete a question answering task like SQuAD. Particularly, we will rely on BERT's encoder to represent both the question and the reference paragraph. Imagine that you have the following example:
- Paragraph:
The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.
- Question: In what country is Normandy located?
We will encode the question and the paragraph following the BERT encoder scheme:
[CLS] in what country is normandy located ? [SEP] the norman ##s ( norman : no ##ur ##man ##ds ; french : norman ##ds ; latin : norman ##ni ) were the people who in the 10th and 11th centuries (...)'
More details about this fine-tuning procedure and the BERT encoding scheme can be found in the original paper from Devlin et. al 2018. We created an AllenNLP dataset reader that is able to process the SQuAD dataset examples and format them following BERT encoding scheme.
Note: BERT by default performs word-piece tokenization. See how words like normans
gets split to norman ##s
,
and nourmands
gets split to more than two wordpieces: no ##ur ##man ##ds
.
The dataset reader provided takes care of that.
In this exercise you will be creating a model that is able to predict the boundary of the answer span given an encoded representation generated by BERT. The task consists of merely predicting two integer values:
start_position
: start position of the answer in the reference documentend_position
: end position of the answer in the reference document
To complete the exercise we provide you with a basic template of the QA model in the file athnlp/models/qa_bert.py
. Every method should be implemented to complete the exercise. The model definition contains 3 main methods:
__init__
: The constructor of the main class that is used to initialise all the model parameters. You are supposed to initialise the layer used to predict the span in here.forward
: The forward pass of the model. We want to encode the input representation using BERT and then use a linear layer to predict the start and the end of the answer.decode
: Given the model predictions convert them to tokens for visualisation and evaluation purposes.get_metrics
: evaluates the metricsstart position accuracy
,end position accuracy
andspan position accuracy
.
In this tutorial, we will be using the BERT API provided py pytorch-transformers
. Particularly, we are interested
in using the class BertModel. By using the configuration file
that we created (see resources/bert-base-uncased/config.json
) for details), the BERT model will generate the following outputs:
- last_hidden_state: Sequence of hidden-states at the output of the last layer of the model.
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
- pooler_output: Last layer hidden-state of the first token of the sequence (classification token)
further processed by a Linear layer and a Tanh activation function. The Linear
layer weights are trained from the next sentence prediction (classification)
objective during Bert pretraining.
torch.FloatTensor
of shape(batch_size, hidden_size)
- attentions: Attention weights after the attention softmax, used to compute the weighted average in the self-attention heads.
List of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
Given the output of the BERT model, it is possible to use two different strategies to predict the answer span:
- Learn a Linear layer that predicts the answer span given the average of the hidden states contained in
last_hidden_states
; - Learn a Linear layer that predicts the answer span given the
pooler_output
.
You are free to experiment with both strategies but we recommend the first one. The reason behind our preference is that
the BERT pooler_output
is usually not a good summary of the semantic content of the input, because it is used
during the original BERT training phase for a different task.
For the sake of consistency we have already provided a specific signature for the forward that your model implementation should follow. We define details of required inputs and ouputs as follows:
- tokens : Dict[str, torch.LongTensor]
From a
TextField
(that has a bert-pretrained token indexer) - span_start : torch.IntTensor, optional (default = None)
A tensor of shape (batch_size, 1) which contains the start_position of the answer
in the passage, or 0 if impossible. This is an
inclusive
token index. If this is given, we will compute a loss that gets included in the output dictionary. - span_end : torch.IntTensor, optional (default = None)
A tensor of shape (batch_size, 1) which contains the end_position of the answer
in the passage, or 0 if impossible. This is an
inclusive
token index. If this is given, we will compute a loss that gets included in the output dictionary.
An output dictionary consisting of:
- logits : torch.FloatTensor
A tensor of shape
(batch_size, num_tokens)
representing unnormalized log probabilities of the label. - start_probs : torch.FloatTensor
A tensor of shape
(batch_size, num_tokens)
representing probabilities of the label obtained applying softmax on the predicted logits. - end_probs : torch.FloatTensor
A tensor of shape
(batch_size, num_tokens)
representing probabilities of the label obtained applying softmax on the predicted logits. - best_span: torch.LongTensor
A tensor of shape
(batch_size, 2)
representing the predicted start and end position for the answer associated to a specific element of the batch. We suggest to use the function get_best_span already implemented in AllenNLP. - loss : torch.FloatTensor, optional Loss function to be optimised. This will be the average of the losses computed for the start position predictions and for the end position predictions.
Note: We recommend that you train and predict with the built-in commands using allennlp train/predict
. If you
need to debug your code you can programmatically execute the training process from: athnlp/qa.py
.
We will be reporting performance using SQuAD official evaluation metrics (please see Rajpurkar and Jia et. al 2018 for details).
BERT incorporates a stack of multi-head attention layers (12 layers) which are used to learn a contextualised representation
of every token in the input utterance. In this second exercise we want to add an additional output to our model
that represents the attention values for every layer of the BERT model. The default implementation of BertModel does not return the attention scores generated by BERT. In order to have access to the attention scores you need to add to the BERT configuration file (resources/bert-base-uncased/config.json
) the following key-value pair: "output_attentions":true
.
In your AllenNLP model implementation you will add a new key to the output dictionary:
- question_passage_attentions: list of
torch.FloatTensor
of shape(batch_size, num_heads, sequence_length, sequence_length)
Use some of the test examples to visualise the model attentions. You might want to visualise the attention values for just the last layer of BERT, or you can create a grid containing all the attention scores for all BERT 12 layers. In order to visualise the attention values, you can reuse the code provided with the Neural Machine Translation predictor in Lab 5 and adapt it for BERT.