This page contains instructions for running BM25 baselines on the MS MARCO document ranking task. Note that there is a separate MS MARCO passage ranking task.
We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO document dataset:
mkdir collections/msmarco-doc
wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.trec.gz -P collections/msmarco-doc
# Alternative mirror:
# wget https://www.dropbox.com/s/w6caao3sfx9nluo/msmarco-docs.trec.gz -P collections/msmarco-doc
To confirm, msmarco-docs.trec.gz
should have MD5 checksum of d4863e4f342982b51b9a8fc668b2d0c0
.
There's no need to uncompress the file, as Anserini can directly index gzipped files. Build the index with the following command:
sh target/appassembler/bin/IndexCollection -threads 1 -collection CleanTrecCollection \
-generator DefaultLuceneDocumentGenerator -input collections/msmarco-doc \
-index indexes/msmarco-doc/lucene-index-msmarco -storePositions -storeDocvectors -storeRaw
On a modern desktop with an SSD, indexing takes around 40 minutes. There should be a total of 3,213,835 documents indexed.
After indexing finishes, we can do a retrieval run. The dev queries are already stored in our repo:
target/appassembler/bin/SearchCollection -topicreader TsvInt \
-index indexes/msmarco-doc/lucene-index-msmarco \
-topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-output runs/run.msmarco-doc.dev.bm25.txt -bm25
On a modern desktop with an SSD, the run takes around 12 minutes.
After the run completes, we can evaluate with trec_eval
:
$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -mrecall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.bm25.txt
map all 0.2310
recall_1000 all 0.8856
Let's compare to the baselines provided by Microsoft. First, download:
wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docdev-top100.gz -P runs
gunzip runs/msmarco-docdev-top100.gz
Then, run trec_eval
to compare.
Note that to be fair, we restrict evaluation to top 100 hits per topic (which is what Microsoft provides):
$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/msmarco-docdev-top100
map all 0.2219
$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.bm25.txt
map all 0.2303
We see that "out of the box" Anserini is already better!
This dataset is part of the MS MARCO Document Ranking Leaderboard. Let's try to replicate runs on there!
A few minor details to pay attention to: the official metric is MRR@100, so we want to only return the top 100 hits, and the submission files to the leaderboard have a slightly different format.
So, we use SearchMsmarco
instead of SearchCollection
:
sh target/appassembler/bin/SearchMsmarco -hits 100 -threads 1 \
-index indexes/msmarco-doc/lucene-index-msmarco/ \
-queries src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-output runs/run.msmarco-doc.leaderboard-dev.bm25base.txt -k1 0.9 -b 0.4
The command above uses the default BM25 parameters (k1=0.9
, b=0.4
), and note we set -hits 100
.
Command for evaluation:
$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/run.msmarco-doc.leaderboard-dev.bm25base.txt
#####################
MRR @100: 0.23005723505603573
QueriesRanked: 5193
#####################
The above run corresponds to "Anserini's BM25, default parameters (k1=0.9, b=0.4)" on the leaderboard.
Here's the invocation for BM25 with parameters optimized for recall@100 (k1=4.46
, b=0.82
):
sh target/appassembler/bin/SearchMsmarco -hits 100 -threads 1 \
-index indexes/msmarco-doc/lucene-index-msmarco/ \
-queries src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-output runs/run.msmarco-doc.leaderboard-dev.bm25tuned.txt -k1 4.46 -b 0.82
Command for evaluation:
$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/run.msmarco-doc.leaderboard-dev.bm25tuned.txt
#####################
MRR @100: 0.2770296928568702
QueriesRanked: 5193
#####################
More details on tuning BM25 parameters below...
It is well known that BM25 parameter tuning is important.
The setting of k1=0.9
, b=0.4
is often used as a default.
Let's try to do better!
We tuned BM25 using the queries found here: these are five different sets of 10k samples from the training queries (using the shuf
command).
The basic approach is grid search of parameter values in tenth increments.
We tuned on each individual set and then averaged parameter values across all five sets (this has the effect of regularization).
In separate trials, we optimized for:
- recall@1000, since Anserini output serves as input to downstream rerankers (e.g., based on BERT), and we want to maximize the number of relevant documents the rerankers have to work with;
- MRR@10, for the case where Anserini output is directly presented to users (i.e., no downstream reranking).
It turns out that optimizing for MRR@10 and MAP yields the same settings.
Here's the comparison between different parameter settings:
Setting | MRR@100 | MAP | Recall@1000 |
---|---|---|---|
Default (k1=0.9 , b=0.4 ) |
0.2301 | 0.2310 | 0.8856 |
Optimized for MRR@100/MAP (k1=3.8 , b=0.87 ) |
0.2784 | 0.2789 | 0.9326 |
Optimized for recall@100 (k1=4.46 , b=0.82 ) |
0.2770 | 0.2775 | 0.9357 |
As expected, BM25 tuning makes a big difference!
Note that MRR@100 is computed with the leaderboard eval script (with 100 hits per query), while the other two metrics are computed with trec_eval
(with 1000 hits per query).
So, we need to use different search programs, for example:
$ target/appassembler/bin/SearchCollection -topicreader TsvInt \
-index indexes/msmarco-doc/lucene-index-msmarco \
-topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-output runs/run.msmarco-doc.dev.opt-mrr.txt -bm25 -bm25.k1 3.8 -bm25.b 0.87
$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -mrecall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.opt-mrr.txt
map all 0.2789
recall_1000 all 0.9326
$ sh target/appassembler/bin/SearchMsmarco -hits 100 -threads 1 \
-index indexes/msmarco-doc/lucene-index-msmarco/ \
-queries src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-output runs/run.msmarco-doc.leaderboard-dev.opt-mrr.txt -k1 3.8 -b 0.87
$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/run.msmarco-doc.leaderboard-dev.opt-mrr.txt
#####################
MRR @100: 0.27836767424339787
QueriesRanked: 5193
#####################
That's it!
- Results replicated by @edwinzhng on 2020-01-14 (commit
3964169
) - Results replicated by @nikhilro on 2020-01-21 (commit
631589e
) - Results replicated by @yuki617 on 2020-03-29 (commit
074723c
) - Results replicated by @HangCui0510 on 2020-04-23 (commit
0ae567d
) - Results replicated by @x65han on 2020-04-25 (commit
f5496b9
) - Results replicated by @y276lin on 2020-04-26 (commit
8f48f8e
) - Results replicated by @stephaniewhoo on 2020-04-26 (commit
8f48f8e
) - Results replicated by @YimingDou on 2020-05-14 (commit
3b0a642
) - Results replicated by @richard3983 on 2020-05-14 (commit
a65646f
) - Results replicated by @MXueguang on 2020-05-20 (commit
3b2751e
) - Results replicated by @shaneding on 2020-05-23 (commit
b6e0367
) - Results replicated by @kelvin-jiang on 2020-05-24 (commit
b6e0367
) - Results replicated by @adamyy on 2020-05-28 (commit
a1ecfa4
) - Results replicated by @TianchengY on 2020-05-28 (commit
2947a16
) - Results replicated by @stariqmi on 2020-05-28 (commit
4914305
) - Results replicated by @justinborromeo on 2020-06-11 (commit
7954eab
) - Results replicated by @yxzhu16 on 2020-07-03 (commit
68ace26
) - Results replicated by @LizzyZhang-tutu on 2020-07-13 (commit
8c98d5b
) - Results replicated by @estella98 on 2020-08-05 (commit
99092a8
) - Results replicated by @tangsaidi on 2020-08-19 (commit
aba846
) - Results replicated by @qguo96 on 2020-09-07 (commit
e16b3c1
) - Results replicated by @yuxuan-ji on 2020-09-08 (commit
0f9a8ec
) - Results replicated by @wiltan-uw on 2020-09-09 (commit
93d913f
) - Results replicated by @JeffreyCA on 2020-09-13 (commit
bc2628b
) - Results replicated by @jhuang265 on 2020-10-15 (commit
66711b9
) - Results replicated by @rayyang29 on 2020-10-27 (commit
ad8cc5a
) - Results replicated by @Dahlia-Chehata on 2020-11-12 (commit
22c0ad3
) - Results replicated by @rakeeb123 on 2020-12-07 (commit
f50dcce
) - Results replicated by @jrzhang12 on 2021-01-02 (commit
be4e44d
) - Results replicated by @HEC2018 on 2021-01-04 (commit
4de21ec
) - Results replicated by @KaiSun314 on 2021-01-08 (commit
113f1c7
) - Results replicated by @yemiliey on 2021-01-18 (commit
179c242
) - Results replicated by @larryli1999 on 2021-01-22 (commit
3f9af5
)