To do:
- create a section with the code to reproduce the MRR scores
- train with other Hugging Face model like Electra
- continue the training with more training pairs
Code for train a re-ranking model on MS MARCO dataset using Hugging Face library.
Table of contents |
---|
Task |
My model |
Results |
Getting started |
Description from the offical repository of the task:
Given a query q and a the 1000 most relevant passages P = p1, p2, p3,... p1000, as retrieved by BM25 a succeful system is expected to rerank the most relevant passage as high as possible. For this task not all 1000 relevant items have a human labeled relevant passage. Evaluation will be done using MRR
All the data are available on the official repository of the task.
The leaderboard is availaible on this link.
The model is composed of a Hugging Face transformer model, a dense layer and a classification layer. The idea is to compare several models of transformers for the re-ranking task.
These are the results and model weights after a training with parameters :
learning_rate = 3e-6
batch_size = 6
num_samples = 50000
(number of triples) (correspond to 100000 training query-passage pairs)
It tooks around 6 hours on google colab GPU.
Model | bert-large-cased |
albert-large-v2 |
roberta-large |
---|---|---|---|
Classification accuracy | 0.90 | 0.91 | 0.93 |
@MRR 10 | 0.278 | 0.291 | 0.303 |
Saved weights | .h5 file | .h5 file | .h5 file |
The MRR scores obtained are below those of the leaderboard. We can explain this by the low number of training steps. By comparing with the evolution of the MRR score according to the number of training pairs for bert-large-cased
(see figure below from this paper), my result corresponds. We can assume that with more training step my final MRR score for bert-large-cased
will correspond to that obtained in this paper (paper github). In addition roberta-large
seems to have a better result so it would be interesting to continue the training with more training pairs.
python3 -m venv msmarco_env
source msmarco_env/bin/activate
pip install -r requirements.txt
You can download preprocess data for MRR evaluation and training on a small part of the dataset available here. Then you just have to put files into corresponding folder in data/
If you want all the dataset, refer to the official depot
For reproduce the data preprocessing refer to this repository. It contains most of the necessary steps.
python train.py --model_name 'roberta-large' \
--train_path "data/train/triples.train.tiny.tsv" \
--batch_size 6 \
--num_samples 50000 \
--learning_rate 3e-6 \
--n_queries_to_evaluate 1000
To see full usage of train.py
, run python train.py --help
.
We can download .h5 file (link above) for the model you want to use. Then put the .h5 file into model/saved_weights/
folder.
Here is an example of use with roberta-large
:
''' import '''
import tensorflow as tf
import numpy as np
from transformers import TFAutoModel, AutoTokenizer
from model.scorer import Scorer
''' parameters '''
model_name = 'roberta-large'
max_length = 256
num_classes = 2
weights_path = 'model/saved_weights/model_roberta-large_mrr_0.303.h5'
''' load model '''
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = Scorer(tokenizer, TFAutoModel, max_length, num_classes)
model.from_pretrained(model_name) # need to optimize this step by loading config instead of weights
model(tf.zeros([1, 3, 256], tf.int32))
model.load_weights(weights_path)
model.compile(run_eagerly=True)
''' score passages '''
query = 'query'
passages = ['relevant passage', 'non-relevant passage']
scores = model.score_query_passages(query, passages, 2)
print(scores)
# [1, 0]