Skip to content

Latest commit

 

History

History
71 lines (54 loc) · 2.86 KB

experiments-slim.md

File metadata and controls

71 lines (54 loc) · 2.86 KB

Pyserini: SLIM on MS MARCO V1 Passage Ranking

This guide describes how to reproduce the SLIM experiments in the following paper:

Minghan Li, Sheng-Chieh Lin, Xueguang Ma, Jimmy Lin. SLIM: Sparsified Late Interaction for Multi-Vector Retrieval with Inverted Indexes. arXiv:2302.06587.

The training code is provided here.

Due to a naming conflict with the "slim" version of Lucene indexes, we use slimr to denote our model, which stands for "slim retrieval".

To reproduce the non-distilled version of SLIM, we run retrieval using the castorini/slimr-msmarco-passage model available on Huggingface's model hub:

python -m pyserini.search.lucene \
  --index msmarco-v1-passage-slimr \
  --topics msmarco-passage-dev-subset \
  --encoder castorini/slimr-msmarco-passage \
  --encoded-corpus scipy-sparse-vectors.msmarco-v1-passage-slimr \
  --output runs/run.msmarco-passage.slimr.tsv \
  --output-format msmarco \
  --batch 36 --threads 12 \
  --hits 1000 \
  --impact --min-idf 3

Here, we are using the transformer model to encode the queries on the fly using the CPU. Note that the important option here is --impact, where we specify impact scoring. With these impact scores, query evaluation is already slower than bag-of-words BM25; on top of that we're adding neural inference on the CPU.

The output is in MS MARCO output format, so we can directly evaluate:

$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.slimr.tsv

#####################
MRR @10: 0.3581149656615276
QueriesRanked: 6980
#####################

For the distilled version, we could follow the similar procedure of indexing and retrieval:

Retrieval

python -m pyserini.search.lucene \
  --index msmarco-v1-passage-slimr-pp \
  --topics msmarco-passage-dev-subset \
  --encoder castorini/slimr-pp-msmarco-passage \
  --encoded-corpus scipy-sparse-vectors.msmarco-v1-passage-slimr-pp \
  --output runs/run.msmarco-passage.slimr-pp.tsv \
  --output-format msmarco \
  --batch 36 --threads 12 \
  --hits 1000 \
  --impact --min-idf 3

Evaluation

$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.slimr-pp.tsv

#####################
MRR @10: 0.40315936689862253
QueriesRanked: 6974
#####################

The final QueriesRanked is less than 6980, which results from the excessive pruning using min-idf=3, and therefore some queries' representations are completely pruned and therefore they return no ranking list. To avoid this, use smaller min-idf which, however, might increase the search latency.

Reproduction Log*