This is the PyTerrier plugin for the ANCE dense passage retriever.
This repostory can be installed using Pip.
pip install --upgrade git+https://github.com/terrierteam/pyterrier_ance.git
You will need FAISS (cpu or gpu) installed:
On Colab:
!pip install faiss-cpu
On Anaconda:
# CPU-only version
$ conda install -c pytorch faiss-cpu
# GPU(+CPU) version
$ conda install -c pytorch faiss-gpu
For ANCE, the CPU version is sufficient (For small usage.)
You will need a pre-trained ANCE checkpoint. There are several available from the ANCE repository.
The files has been deleted in the main ANCE Repo. You can download the ANCE checkpoitn from Google Drive (the link is provided from an issue in the main repo link)
Then, indexing is as easy as instantiating the indexer, pointing at the (unzipped) checkpoint and the directory in which you wish to create an index
dataset = pt.get_dataset("irds:vaswani")
import pyterrier_ance
indexer = pyterrier_ance.ANCEIndexer("/path/to/checkpoint", "/path/to/anceindex")
indexer.index(dataset.get_corpus_iter())
You can instantiate the retrieval transformer, again by specifying the checkpoint location and the index location:
anceretr = pyterrier_ance.ANCERetrieval("/path/to/checkpoint", "/path/to/anceindex")
Thereafter, you can use it in the normal PyTerrier way, for instance in an experiment:
pt.Experiment(
[anceretr],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map"]
)
You can also use ANCE as a re-ranker to score text (e.g., as a re-ranker) using ANCETextScorer
.
ance_text_scorer = pyterrier_ance.ANCETextScorer("/path/to/checkpoint")
# You'll need to use this in a retrieval pipeline that includes the document text, e.g.:
# bm25 >> pt.text.get_text(dataset, 'text') >> ance_text_scorer
If your documents are longer than passages, you should apply passaging to them before indexing, and max passage (say) during retrieval:
# indexing
dataset = pt.get_dataset("irds:vaswani")
import pyterrier_ance
indexer = pt.text.sliding("text", prepend_attr=None) >> pyterrier_ance.ANCEIndexer("/path/to/checkpoint", "/path/to/anceindex")
indexer.index(dataset.get_corpus_iter())
# retrieval
ance_maxp = pyterrier_ance.ANCERetrieval("/path/to/checkpoint", "/path/to/anceindex") >> pt.text.max_passage()
Checkout out the notebooks, even on Colab:
The Terrier data repository contains ANCE indices for several corpora, including Vaswani and MSMARCO Passage v1.
We use a fork-ed copy of ANCE that makes it pip installable, and addresses other quibbles.
- [Xiong20] Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, Arnold Overwijk. https://arxiv.org/pdf/2007.00808.pdf
- [Macdonald20]: Craig Macdonald, Nicola Tonellotto. Declarative Experimentation inInformation Retrieval using PyTerrier. Craig Macdonald and Nicola Tonellotto. In Proceedings of ICTIR 2020. https://arxiv.org/abs/2007.14271
- Craig Macdonald, University of Glasgow
- Nicola Tonellotto, University of Pisa
- Sean MacAvaney, University of Glasgow
- Dany Haddad, University of Texas