Lab - Question Answering

Introduction

In this lab we will build a Question Answering model for SQuAD based on BERT using AllenNLP.

Requirements

Make sure that the script setup_dependencies.sh installed every package and downloaded every data file. In particular, make sure that pytorch-transformers is installed and that you have in your resources folder the following files:

bert-base-uncased/pytorch_model.bin (pretrained BERT model)
bert-base-uncased/config.json (pretrained BERT model parameters)
bert-base-uncased/vocab.txt (vocabulary for the pretrained model)

We will fine-tune the BERT model on the SQuAD 2.0 dataset. In this tutorial will use just a small part of it, as we will be running things on your laptop; you are more than welcome to try out the full dataset on a GPU-enabled machine too.

The portion of the data that we will be using in this tutorial is available in the folder data/squad/. We will use the file train.json for model training and test.json for the evaluation phase.

Exercises

BERT is a large-scale language model trained using several masking-based loss functions. In this tutorial we will show you how to fine-tune a BERT model trained on millions of text documents to complete a question answering task like SQuAD. Particularly, we will rely on BERT's encoder to represent both the question and the reference paragraph. Imagine that you have the following example:

Paragraph:

The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.

Question: In what country is Normandy located?

We will encode the question and the paragraph following the BERT encoder scheme:

[CLS] in what country is normandy located ? [SEP] the norman ##s ( norman : no ##ur ##man ##ds ; french : norman ##ds ; latin : norman ##ni ) were the people who in the 10th and 11th centuries (...)'

More details about this fine-tuning procedure and the BERT encoding scheme can be found in the original paper from Devlin et. al 2018. We created an AllenNLP dataset reader that is able to process the SQuAD dataset examples and format them following BERT encoding scheme.

Note: BERT by default performs word-piece tokenization. See how words like normans gets split to norman ##s, and nourmands gets split to more than two wordpieces: no ##ur ##man ##ds. The dataset reader provided takes care of that.

1. Span prediction for Question Answering

In this exercise you will be creating a model that is able to predict the boundary of the answer span given an encoded representation generated by BERT. The task consists of merely predicting two integer values:

start_position: start position of the answer in the reference document
end_position: end position of the answer in the reference document

To complete the exercise we provide you with a basic template of the QA model in the file athnlp/models/qa_bert.py. Every method should be implemented to complete the exercise. The model definition contains 3 main methods:

__init__: The constructor of the main class that is used to initialise all the model parameters. You are supposed to initialise the layer used to predict the span in here.
forward: The forward pass of the model. We want to encode the input representation using BERT and then use a linear layer to predict the start and the end of the answer.
decode: Given the model predictions convert them to tokens for visualisation and evaluation purposes.
get_metrics: evaluates the metrics start position accuracy, end position accuracy and span position accuracy.

In this tutorial, we will be using the BERT API provided py pytorch-transformers. Particularly, we are interested in using the class BertModel. By using the configuration file that we created (see resources/bert-base-uncased/config.json) for details), the BERT model will generate the following outputs:

last_hidden_state: Sequence of hidden-states at the output of the last layer of the model. torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)
pooler_output: Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during Bert pretraining. torch.FloatTensor of shape (batch_size, hidden_size)
attentions: Attention weights after the attention softmax, used to compute the weighted average in the self-attention heads. List of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)

Given the output of the BERT model, it is possible to use two different strategies to predict the answer span:

Learn a Linear layer that predicts the answer span given the average of the hidden states contained in last_hidden_states;
Learn a Linear layer that predicts the answer span given the pooler_output.

You are free to experiment with both strategies but we recommend the first one. The reason behind our preference is that the BERT pooler_output is usually not a good summary of the semantic content of the input, because it is used during the original BERT training phase for a different task.

For the sake of consistency we have already provided a specific signature for the forward that your model implementation should follow. We define details of required inputs and ouputs as follows:

Parameters

tokens : Dict[str, torch.LongTensor] From a TextField (that has a bert-pretrained token indexer)
span_start : torch.IntTensor, optional (default = None) A tensor of shape (batch_size, 1) which contains the start_position of the answer in the passage, or 0 if impossible. This is an inclusive token index. If this is given, we will compute a loss that gets included in the output dictionary.
span_end : torch.IntTensor, optional (default = None) A tensor of shape (batch_size, 1) which contains the end_position of the answer in the passage, or 0 if impossible. This is an inclusive token index. If this is given, we will compute a loss that gets included in the output dictionary.

Returns

An output dictionary consisting of:

logits : torch.FloatTensor A tensor of shape (batch_size, num_tokens) representing unnormalized log probabilities of the label.
start_probs : torch.FloatTensor A tensor of shape (batch_size, num_tokens) representing probabilities of the label obtained applying softmax on the predicted logits.
end_probs : torch.FloatTensor A tensor of shape (batch_size, num_tokens) representing probabilities of the label obtained applying softmax on the predicted logits.
best_span: torch.LongTensor A tensor of shape (batch_size, 2) representing the predicted start and end position for the answer associated to a specific element of the batch. We suggest to use the function get_best_span already implemented in AllenNLP.
loss : torch.FloatTensor, optional Loss function to be optimised. This will be the average of the losses computed for the start position predictions and for the end position predictions.

Note: We recommend that you train and predict with the built-in commands using allennlp train/predict. If you need to debug your code you can programmatically execute the training process from: athnlp/qa.py. We will be reporting performance using SQuAD official evaluation metrics (please see Rajpurkar and Jia et. al 2018 for details).

2. Attention Mechanism

BERT incorporates a stack of multi-head attention layers (12 layers) which are used to learn a contextualised representation of every token in the input utterance. In this second exercise we want to add an additional output to our model that represents the attention values for every layer of the BERT model. The default implementation of BertModel does not return the attention scores generated by BERT. In order to have access to the attention scores you need to add to the BERT configuration file (resources/bert-base-uncased/config.json) the following key-value pair: "output_attentions":true.

In your AllenNLP model implementation you will add a new key to the output dictionary:

question_passage_attentions: list of torch.FloatTensor of shape (batch_size, num_heads, sequence_length, sequence_length)

Use some of the test examples to visualise the model attentions. You might want to visualise the attention values for just the last layer of BERT, or you can create a grid containing all the attention scores for all BERT 12 layers. In order to visualise the attention values, you can reuse the code provided with the Neural Machine Translation predictor in Lab 5 and adapt it for BERT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question-answering.md

question-answering.md

Lab - Question Answering

Introduction

Requirements

Exercises

1. Span prediction for Question Answering

Parameters

Returns

2. Attention Mechanism

Files

question-answering.md

Latest commit

History

question-answering.md

File metadata and controls

Lab - Question Answering

Introduction

Requirements

Exercises

1. Span prediction for Question Answering

Parameters

Returns

2. Attention Mechanism