Add more Rocchio conditions for MS MARCO v1 and V2 (#1921)

Additional changes: + Tweaks to experiments-msmarco-passage.md and experiments-msmarco-doc.md + Fixed (some) incorrect dates on when tuning was performed for MS MARCO v1/v2 doc/passage (and d2q-T5) + Added missing tuned2 conditions to dl19-doc + Added missing ax/bm25prf conditions to dl20-doc and msmarco-doc + Fixed bug in neg Rocchio condition on passage d2q (-rerankCutoff 1000)
castorini · Jun 28, 2022 · 8010d5c · 8010d5c
1 parent e90bb3c
commit 8010d5c
Show file tree

Hide file tree

Showing 43 changed files with 1,269 additions and 232 deletions.
diff --git a/docs/experiments-msmarco-doc.md b/docs/experiments-msmarco-doc.md
@@ -3,7 +3,7 @@
 This page contains instructions for running BM25 baselines on the [MS MARCO *document* ranking task](https://microsoft.github.io/msmarco/).
 Note that there is a separate [MS MARCO *passage* ranking task](experiments-msmarco-passage.md).
 
-**Setup Note:** If you're instantiating an Ubuntu VM on your system or on cloud (AWS and GCP), try to provision enough resources as the tasks such as building the index could take some time to finish such as RAM > 8GB and storage > 100 GB (SSD). This will prevent going back and fixing machine configuration again and again.
+This exercise will require a machine with >8 GB RAM and at least 40 GB free disk space.
 
 If you're a Waterloo undergraduate going through this guide as the [screening exercise](https://github.com/lintool/guide/blob/master/ura.md) of joining my research group, make sure you do the [passage ranking exercise](experiments-msmarco-passage.md) first.
 Similarly, try to understand what you're actually doing, instead of simply [cargo culting](https://en.wikipedia.org/wiki/Cargo_cult_programming) (i.e., blinding copying and pasting commands into a shell).
@@ -13,7 +13,7 @@ Similarly, try to understand what you're actually doing, instead of simply [carg
 We're going to use the repository's root directory as the working directory.
 First, we need to download and extract the MS MARCO document dataset:
 
-```
+```bash
 mkdir collections/msmarco-doc
 
 wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.trec.gz -P collections/msmarco-doc
@@ -30,10 +30,14 @@ To confirm, `msmarco-docs.trec.gz` should have MD5 checksum of `d4863e4f342982b5
 There's no need to uncompress the file, as Anserini can directly index gzipped files.
 Build the index with the following command:
 
-```
-sh target/appassembler/bin/IndexCollection -threads 1 -collection CleanTrecCollection \
- -generator DefaultLuceneDocumentGenerator -input collections/msmarco-doc \
- -index indexes/msmarco-doc/lucene-index-msmarco -storePositions -storeDocvectors -storeRaw
+```bash
+target/appassembler/bin/IndexCollection \
+  -collection CleanTrecCollection \
+  -input collections/msmarco-doc \
+  -index indexes/msmarco-doc/lucene-index-msmarco \
+  -generator DefaultLuceneDocumentGenerator \
+  -threads 1 \
+  -storePositions -storeDocvectors -storeRaw
 ```
 
 On a modern desktop with an SSD, indexing takes around 40 minutes.
@@ -45,11 +49,14 @@ There should be a total of 3,213,835 documents indexed.
 After indexing finishes, we can do a retrieval run.
 The dev queries are already stored in our repo:
 
-```
-target/appassembler/bin/SearchCollection -hits 1000 -parallelism 4 \
- -index indexes/msmarco-doc/lucene-index-msmarco \
- -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
- -output runs/run.msmarco-doc.dev.bm25.txt -bm25
+```bash
+target/appassembler/bin/SearchCollection \
+  -index indexes/msmarco-doc/lucene-index-msmarco \
+  -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
+  -topicreader TsvInt \
+  -output runs/run.msmarco-doc.dev.bm25.txt \
+  -parallelism 4 \
+  -bm25 -hits 1000
 ```
 
 Retrieval speed will vary by machine:
@@ -58,28 +65,31 @@ Adjust the parallelism by changing the `-parallelism` argument.
 
 After the run completes, we can evaluate with `trec_eval`:
 
-```
-$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -mrecall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.bm25.txt
+```bash
+$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -mrecall.1000 \
+    src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.bm25.txt
 map                   	all	0.2310
 recall_1000           	all	0.8856
 ```
 
 Let's compare to the baselines provided by Microsoft.
 First, download:
 
-```
+```bash
 wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docdev-top100.gz -P runs
 gunzip runs/msmarco-docdev-top100.gz
 ```
 
 Then, run `trec_eval` to compare.
 Note that to be fair, we restrict evaluation to top 100 hits per topic (which is what Microsoft provides):
 
-```
-$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/msmarco-docdev-top100
+```bash
+$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 \
+    src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/msmarco-docdev-top100
 map                   	all	0.2219
 
-$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.bm25.txt
+$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 \
+    src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.bm25.txt
 map                   	all	0.2303
 ```
 
@@ -91,18 +101,22 @@ Let's try to reproduce runs on there!
 A few minor details to pay attention to: the official metric is MRR@100, so we want to only return the top 100 hits, and the submission files to the leaderboard have a slightly different format.
 
 ```bash
-target/appassembler/bin/SearchCollection -hits 100 -parallelism 4 \
- -index indexes/msmarco-doc/lucene-index-msmarco \
- -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
- -output runs/run.msmarco-doc.leaderboard-dev.bm25base.txt -format msmarco \
- -bm25 -bm25.k1 0.9 -bm25.b 0.4
+target/appassembler/bin/SearchCollection \
+  -index indexes/msmarco-doc/lucene-index-msmarco \
+  -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
+  -topicreader TsvInt \
+  -output runs/run.msmarco-doc.leaderboard-dev.bm25base.txt -format msmarco \
+  -parallelism 4 \
+  -bm25 -bm25.k1 0.9 -bm25.b 0.4 -hits 100
 ```
 
 The command above uses the default BM25 parameters (`k1=0.9`, `b=0.4`), and note we set `-hits 100`.
 Command for evaluation:
 
 ```bash
-$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/run.msmarco-doc.leaderboard-dev.bm25base.txt 
+$ python tools/scripts/msmarco/msmarco_doc_eval.py \
+    --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \
+    --run runs/run.msmarco-doc.leaderboard-dev.bm25base.txt
 #####################
 MRR @100: 0.23005723505603573
 QueriesRanked: 5193
@@ -114,17 +128,21 @@ The above run corresponds to "Anserini's BM25, default parameters (k1=0.9, b=0.4
 Here's the invocation for BM25 with parameters optimized for recall@100 (`k1=4.46`, `b=0.82`):
 
 ```bash
-target/appassembler/bin/SearchCollection -hits 100 -parallelism 4 \
- -index indexes/msmarco-doc/lucene-index-msmarco \
- -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
- -output runs/run.msmarco-doc.leaderboard-dev.bm25tuned.txt -format msmarco \
- -bm25 -bm25.k1 4.46 -bm25.b 0.82
+target/appassembler/bin/SearchCollection \
+  -index indexes/msmarco-doc/lucene-index-msmarco \
+  -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
+  -topicreader TsvInt \
+  -output runs/run.msmarco-doc.leaderboard-dev.bm25tuned.txt -format msmarco \
+  -parallelism 4 \
+  -bm25 -bm25.k1 4.46 -bm25.b 0.82 -hits 100
 ```
 
 Command for evaluation:
 
 ```bash
-$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/run.msmarco-doc.leaderboard-dev.bm25tuned.txt 
+$ python tools/scripts/msmarco/msmarco_doc_eval.py \
+    --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \
+    --run runs/run.msmarco-doc.leaderboard-dev.bm25tuned.txt
 #####################
 MRR @100: 0.2770296928568702
 QueriesRanked: 5193
@@ -139,7 +157,7 @@ It is well known that BM25 parameter tuning is important.
 The setting of `k1=0.9`, `b=0.4` is often used as a default.
 
 Let's try to do better!
-We tuned BM25 using the queries found [here](https://github.com/castorini/Anserini-data/tree/master/MSMARCO): these are five different sets of 10k samples from the training queries (using the `shuf` command).
+We tuned BM25 using the queries found [here](https://github.com/castorini/anserini-data/tree/master/MSMARCO): these are five different sets of 10k samples from the training queries (using the `shuf` command).
 The basic approach is grid search of parameter values in tenth increments.
 We tuned on each individual set and then averaged parameter values across all five sets (this has the effect of regularization).
 In separate trials, we optimized for:
@@ -151,35 +169,42 @@ It turns out that optimizing for MRR@10 and MAP yields the same settings.
 
 Here's the comparison between different parameter settings:
 
-Setting                                                                | MRR@100 | MAP    | Recall@1000 |
-:----------------------------------------------------------------------|--------:|-------:|------------:|
-Default (`k1=0.9`, `b=0.4`)                                            | 0.2301  | 0.2310 | 0.8856 |
-Optimized for MRR@100/MAP (`k1=3.8`, `b=0.87`)                         | 0.2784  | 0.2789 | 0.9326 |
-Optimized for recall@100 (`k1=4.46`, `b=0.82`)                         | 0.2770  | 0.2775 | 0.9357 |
+| Setting                                        | MRR@100 |    MAP | Recall@1000 |
+|:-----------------------------------------------|--------:|-------:|------------:|
+| Default (`k1=0.9`, `b=0.4`)                    |  0.2301 | 0.2310 |      0.8856 |
+| Optimized for MRR@100/MAP (`k1=3.8`, `b=0.87`) |  0.2784 | 0.2789 |      0.9326 |
+| Optimized for recall@100 (`k1=4.46`, `b=0.82`) |  0.2770 | 0.2775 |      0.9357 |
 
 As expected, BM25 tuning makes a big difference!
 
 Note that MRR@100 is computed with the leaderboard eval script (with 100 hits per query), while the other two metrics are computed with `trec_eval` (with 1000 hits per query).
 So, we need to use different search programs, for example:
 
-```
-$ target/appassembler/bin/SearchCollection -hits 1000 -parallelism 4 \
-   -index indexes/msmarco-doc/lucene-index-msmarco \
-   -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-   -output runs/run.msmarco-doc.dev.opt-mrr.txt \
-   -bm25 -bm25.k1 3.8 -bm25.b 0.87
-
-$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -mrecall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.opt-mrr.txt
+```bash
+$ target/appassembler/bin/SearchCollection \
+    -index indexes/msmarco-doc/lucene-index-msmarco \
+    -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
+    -topicreader TsvInt \
+    -output runs/run.msmarco-doc.dev.opt-mrr.txt \
+    -parallelism 4 \
+    -bm25 -bm25.k1 3.8 -bm25.b 0.87 -hits 1000
+
+$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -mrecall.1000 \
+    src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.opt-mrr.txt
 map                   	all	0.2789
 recall_1000           	all	0.9326
 
-$ target/appassembler/bin/SearchCollection -hits 100 -parallelism 4 \
-   -index indexes/msmarco-doc/lucene-index-msmarco \
-   -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-   -output runs/run.msmarco-doc.leaderboard-dev.opt-mrr.txt -format msmarco \
-   -bm25 -bm25.k1 3.8 -bm25.b 0.87
-
-$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/run.msmarco-doc.leaderboard-dev.opt-mrr.txt
+$ target/appassembler/bin/SearchCollection \
+    -index indexes/msmarco-doc/lucene-index-msmarco \
+    -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
+    -topicreader TsvInt \
+    -output runs/run.msmarco-doc.leaderboard-dev.opt-mrr.txt -format msmarco \
+    -parallelism 4 \
+    -bm25 -bm25.k1 3.8 -bm25.b 0.87 -hits 100
+
+$ python tools/scripts/msmarco/msmarco_doc_eval.py \
+    --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \
+    --run runs/run.msmarco-doc.leaderboard-dev.opt-mrr.txt
 #####################
 MRR @100: 0.27836767424339787
 QueriesRanked: 5193

diff --git a/docs/experiments-msmarco-passage.md b/docs/experiments-msmarco-passage.md
@@ -2,9 +2,8 @@
 
 This page contains instructions for running BM25 baselines on the [MS MARCO *passage* ranking task](https://microsoft.github.io/msmarco/).
 Note that there is a separate [MS MARCO *document* ranking task](experiments-msmarco-doc.md).
-We also have a [separate page](experiments-doc2query.md) describing document expansion experiments (doc2query) for this task.
 
-**Setup Note:** If you're instantiating an Ubuntu VM on your system or on cloud (AWS and GCP) for this particular task, try to provision enough resources as the tasks could take some time to finish such as RAM > 6GB and storage ~ 100 GB (SSD). This will prevent going back and fixing machine configuration again and again.
+This exercise will require a machine with >8 GB RAM and at least 15 GB free disk space .
 
 If you're a Waterloo undergraduate going through this guide as the [screening exercise](https://github.com/lintool/guide/blob/master/ura.md) of joining my research group, try to understand what you're actually doing, instead of simply [cargo culting](https://en.wikipedia.org/wiki/Cargo_cult_programming) (i.e., blinding copying and pasting commands into a shell).
 In particular, you'll want to pay attention to the "What's going on here?" sections.
@@ -58,8 +57,8 @@ Next, we need to convert the MS MARCO tsv collection into Anserini's jsonl files
 
 ```bash
 python tools/scripts/msmarco/convert_collection_to_jsonl.py \
- --collection-path collections/msmarco-passage/collection.tsv \
- --output-folder collections/msmarco-passage/collection_jsonl
+  --collection-path collections/msmarco-passage/collection.tsv \
+  --output-folder collections/msmarco-passage/collection_jsonl
 ```
 
 The above script should generate 9 jsonl files in `collections/msmarco-passage/collection_jsonl`, each with 1M lines (except for the last one, which should have 841,823 lines).
@@ -70,9 +69,12 @@ The above script should generate 9 jsonl files in `collections/msmarco-passage/c
 We can now index these docs as a `JsonCollection` using Anserini:
 
 ```bash
-sh target/appassembler/bin/IndexCollection -threads 9 -collection JsonCollection \
- -generator DefaultLuceneDocumentGenerator -input collections/msmarco-passage/collection_jsonl \
- -index indexes/msmarco-passage/lucene-index-msmarco -storePositions -storeDocvectors -storeRaw 
+target/appassembler/bin/IndexCollection \
+  -collection JsonCollection \
+  -input collections/msmarco-passage/collection_jsonl \
+  -index indexes/msmarco-passage/lucene-index-msmarco \
+  -generator DefaultLuceneDocumentGenerator \
+  -threads 9 -storePositions -storeDocvectors -storeRaw 
 ```
 
 Upon completion, we should have an index with 8,841,823 documents.
@@ -85,9 +87,9 @@ Since queries of the set are too many (+100k), it would take a long time to retr
 
 ```bash
 python tools/scripts/msmarco/filter_queries.py \
- --qrels collections/msmarco-passage/qrels.dev.small.tsv \
- --queries collections/msmarco-passage/queries.dev.tsv \
- --output collections/msmarco-passage/queries.dev.small.tsv
+  --qrels collections/msmarco-passage/qrels.dev.small.tsv \
+  --queries collections/msmarco-passage/queries.dev.tsv \
+  --output collections/msmarco-passage/queries.dev.small.tsv
 ```
 
 The output queries file should contain 6980 lines.
@@ -119,11 +121,13 @@ These queries are taken from Bing search logs, so they're "realistic" web querie
 We can now perform a retrieval run using this smaller set of queries:
 
 ```bash
-sh target/appassembler/bin/SearchCollection -hits 1000 -parallelism 4 \
- -index indexes/msmarco-passage/lucene-index-msmarco \
- -topicreader TsvInt -topics collections/msmarco-passage/queries.dev.small.tsv \
- -output runs/run.msmarco-passage.dev.small.tsv -format msmarco \
- -bm25 -bm25.k1 0.82 -bm25.b 0.68
+target/appassembler/bin/SearchCollection \
+  -index indexes/msmarco-passage/lucene-index-msmarco \
+  -topics collections/msmarco-passage/queries.dev.small.tsv \
+  -topicreader TsvInt \
+  -output runs/run.msmarco-passage.dev.small.tsv -format msmarco \
+  -parallelism 4 \
+  -bm25 -bm25.k1 0.82 -bm25.b 0.68 -hits 1000
 ```
 
 The above command uses BM25 with tuned parameters `k1=0.82`, `b=0.68`.
@@ -244,19 +248,19 @@ For that we first need to convert runs and qrels files to the TREC format:
 
 ```bash
 python tools/scripts/msmarco/convert_msmarco_to_trec_run.py \
- --input runs/run.msmarco-passage.dev.small.tsv \
- --output runs/run.msmarco-passage.dev.small.trec
+  --input runs/run.msmarco-passage.dev.small.tsv \
+  --output runs/run.msmarco-passage.dev.small.trec
 
 python tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
- --input collections/msmarco-passage/qrels.dev.small.tsv \
- --output collections/msmarco-passage/qrels.dev.small.trec
+  --input collections/msmarco-passage/qrels.dev.small.tsv \
+  --output collections/msmarco-passage/qrels.dev.small.trec
 ```
 
 And run the `trec_eval` tool:
 
 ```bash
 tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \
- collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.dev.small.trec
+  collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.dev.small.trec
 ```
 
 The output should be:
@@ -296,13 +300,11 @@ It turns out that optimizing for MRR@10 and MAP yields the same settings.
 
 Here's the comparison between the Anserini default and optimized parameters:
 
-Setting                     | MRR@10 | MAP    | Recall@1000 |
-:---------------------------|-------:|-------:|------------:|
-Default (`k1=0.9`, `b=0.4`) | 0.1840 | 0.1926 | 0.8526
-Optimized for recall@1000 (`k1=0.82`, `b=0.68`) | 0.1874 | 0.1957 | 0.8573
-Optimized for MRR@10/MAP (`k1=0.60`, `b=0.62`)  | 0.1892 | 0.1972 | 0.8555
-
-To reproduce these results, the `SearchMsmarco` class above takes `k1` and `b` parameters as command-line arguments, e.g., `-k1 0.60 -b 0.62` (note that the default setting is `k1=0.82` and `b=0.68`).
+| Setting                                         | MRR@10 |    MAP | Recall@1000 |
+|:------------------------------------------------|-------:|-------:|------------:|
+| Default (`k1=0.9`, `b=0.4`)                     | 0.1840 | 0.1926 |      0.8526 |
+| Optimized for recall@1000 (`k1=0.82`, `b=0.68`) | 0.1874 | 0.1957 |      0.8573 |
+| Optimized for MRR@10/MAP (`k1=0.60`, `b=0.62`)  | 0.1892 | 0.1972 |      0.8555 |
 
 As mentioned above, the BM25 run with `k1=0.82`, `b=0.68` corresponds to the entry "BM25 (Lucene8, tuned)" dated 2019/06/26 on the [MS MARCO Passage Ranking Leaderboard](https://microsoft.github.io/msmarco/).
 The BM25 run with default parameters `k1=0.9`, `b=0.4` roughly corresponds to the entry "BM25 (Anserini)" dated 2019/04/10 (but Anserini was using Lucene 7.6 at the time).