Pyserini: BM25 Baseline for MS MARCO Passage Retrieval

This guide contains instructions for running BM25 baselines on the MS MARCO passage ranking task, which is nearly identical to a similar guide in Anserini, except that everything is in Python here (no Java). Note that there is a separate guide for the MS MARCO document ranking task.

Setup Note: If you're instantiating an Ubuntu VM on your system or on cloud (AWS and GCP), try to provision enough resources such as RAM > 6GB and storage ~ 100 (can also be around 70 - 80 for this task) GB (SSD). This will prevent going back and fixing machine configuration again and again. If you get a configuration which works for Anserini on this task, it will work with Pyserini as well.

Data Prep

The guide requires the development installation for additional resource that are not shipped with the Python module; for the (more limited) runs that directly work from the Python module installed via pip, see this guide.

We're going to use collections/msmarco-passage/ as the working directory. First, we need to download and extract the MS MARCO passage dataset:

mkdir collections/msmarco-passage

wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage

# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage

tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage

To confirm, collectionandqueries.tar.gz should have MD5 checksum of 31644046b18952c1386cd4564ba2ae69.

Next, we need to convert the MS MARCO tsv collection into Anserini's jsonl files (which have one json object per line):

python tools/scripts/msmarco/convert_collection_to_jsonl.py \
 --collection-path collections/msmarco-passage/collection.tsv \
 --output-folder collections/msmarco-passage/collection_jsonl

The above script should generate 9 jsonl files in collections/msmarco-passage/collection_jsonl, each with 1M lines (except for the last one, which should have 841,823 lines).

We can now index these docs as a JsonCollection using Anserini:

python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
 -threads 9 -input collections/msmarco-passage/collection_jsonl \
 -index indexes/lucene-index-msmarco-passage -storePositions -storeDocvectors -storeRaw

Note that the indexing program simply dispatches command-line arguments to an underlying Java program, and so we use the Java single dash convention, e.g., -index and not --index.

Upon completion, we should have an index with 8,841,823 documents. The indexing speed may vary; on a modern desktop with an SSD, indexing takes a couple of minutes.

Performing Retrieval on the Dev Queries

The 6980 queries in the development set are already stored in the repo. Let's take a peek:

$ head tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
1048585	what is paula deen's brother
2	 Androgen receptor define
524332	treating tension headaches without medication
1048642	what is paranoid sc
524447	treatment of varicose veins in legs
786674	what is prime rate in canada
1048876	who plays young dr mallard on ncis
1048917	what is operating system misconfiguration
786786	what is priority pass
524699	tricare service number
$ wc tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
    6980   48335  290193 tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt

Each line contains a tab-delimited (query id, query) pair. Conveniently, Pyserini already knows how to load and iterate through these pairs. We can now perform retrieval using these queries:

python -m pyserini.search --topics msmarco-passage-dev-subset \
 --index indexes/lucene-index-msmarco-passage \
 --output runs/run.msmarco-passage.bm25tuned.txt \
 --bm25 --output-format msmarco --hits 1000 --k1 0.82 --b 0.68

Here, we set the BM25 parameters to k1=0.82, b=0.68 (tuned by grid search). The option --output-format msmarco says to generate output in the MS MARCO output format. The option --hits specifies the number of documents to return per query. Thus, the output file should have approximately 6980 × 1000 = 6.9M lines.

Retrieval speed will vary by hardware: On a reasonably modern CPU with an SSD, we might get around 13 qps (queries per second), and so the entire run should finish in under ten minutes (using a single thread). We can perform multi-threaded retrieval by using the --threads and --batch-size arguments. For example, setting --threads 16 --batch-size 64 on a CPU with sufficient cores, the entire run will finish in a couple of minutes.

After the run finishes, we can evaluate the results using the official MS MARCO evaluation script:

$ python tools/scripts/msmarco/msmarco_passage_eval.py \
   tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.bm25tuned.txt
#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################

We can also use the official TREC evaluation tool, trec_eval, to compute metrics other than MRR@10. For that we first need to convert the run file into TREC format:

$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \
   --input runs/run.msmarco-passage.bm25tuned.txt --output runs/run.msmarco-passage.bm25tuned.trec
$ python tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
   --input tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt --output collections/msmarco-passage/qrels.dev.small.trec

And then run the trec_eval tool:

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \
   collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.bm25tuned.trec
map                   	all	0.1957
recall_1000           	all	0.8573

Average precision or AP (also called mean average precision, MAP) and recall@1000 (recall at rank 1000) are the two metrics we care about the most. AP captures aspects of both precision and recall in a single metric, and is the most common metric used by information retrieval researchers. On the other hand, recall@1000 provides the upper bound effectiveness of downstream reranking modules (i.e., rerankers are useless if there isn't a relevant document in the results).

Reproduction Log*

Results reproduced by @JeffreyCA on 2020-09-14 (commit 49fd7cb)
Results reproduced by @jhuang265 on 2020-09-14 (commit 2ed2acc)
Results reproduced by @Dahlia-Chehata on 2020-11-11 (commit 8172015)
Results reproduced by @rakeeb123 on 2020-12-07 (commit 3bcd4e5)
Results reproduced by @jrzhang12 on 2021-01-03 (commit 7caedfc)
Results reproduced by @HEC2018 on 2021-01-04 (commit 46a6d47)
Results reproduced by @KaiSun314 on 2021-01-08 (commit aeec31f)
Results reproduced by @yemiliey on 2021-01-18 (commit 98f3236)
Results reproduced by @larryli1999 on 2021-01-22 (commit 74a87e4)
Results reproduced by @ArthurChen189 on 2021-04-08 (commit 7261223)
Results reproduced by @printfCalvin on 2021-04-12 (commit 0801f7f)
Results reproduced by @saileshnankani on 2021-04-26 (commit 6d48609)
Results reproduced by @andrewyguo on 2021-04-30 (commit ecfed61)
Results reproduced by @mayankanand007 on 2021-05-04 (commit a9d6f66)
Results reproduced by @rootofallevii on 2021-05-14 (commit e764797)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments-msmarco-passage.md

experiments-msmarco-passage.md

Pyserini: BM25 Baseline for MS MARCO Passage Retrieval

Data Prep

Performing Retrieval on the Dev Queries

Reproduction Log*

Files

experiments-msmarco-passage.md

Latest commit

History

experiments-msmarco-passage.md

File metadata and controls

Pyserini: BM25 Baseline for MS MARCO Passage Retrieval

Data Prep

Performing Retrieval on the Dev Queries

Reproduction Log*