Update doc for MS MARCO uniCOIL experiments (#1191)

Bring pages into alignment with repro matrix and results presented in SIGIR 2022 paper.
castorini · Jun 1, 2022 · b7bcf51 · b7bcf51
1 parent e641d4d
commit b7bcf51
Show file tree

Hide file tree

Showing 3 changed files with 96 additions and 157 deletions.
diff --git a/README.md b/README.md
@@ -624,6 +624,20 @@ The following guides provide step-by-step instructions:
 
 + Reproducing [uniCOIL + TCT-ColBERTv2 experiments on the MS MARCO V2 Collections](docs/experiments-msmarco-v2-hybrid.md)
 
+### Available Corpora
+
+| Corpora | Size | Checksum |
+|:--------|-----:|:---------|
+| [MS MARCO V1 passage: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil-noexp.tar) | 2.7 GB | `f17ddd8c7c00ff121c3c3b147d2e17d8` |
+| [MS MARCO V1 passage: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil.tar) | 3.4 GB | `78eef752c78c8691f7d61600ceed306f` |
+| [MS MARCO V1 doc: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil-noexp.tar) | 11 GB | `11b226e1cacd9c8ae0a660fd14cdd710` |
+| [MS MARCO V1 doc: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar) | 19 GB | `6a00e2c0c375cb1e52c83ae5ac377ebb` |
+| [MS MARCO V2 passage: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_noexp_0shot.tar) | 24 GB | `d9cc1ed3049746e68a2c91bf90e5212d` |
+| [MS MARCO V2 passage: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_0shot.tar) | 41 GB | `1949a00bfd5e1f1a230a04bbc1f01539` |
+| [MS MARCO V2 doc: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2.tar) | 55 GB | `97ba262c497164de1054f357caea0c63` |
+| [MS MARCO V2 doc: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_0shot_v2.tar) | 72 GB | `c5639748c2cbad0152e10b0ebde3b804` |
+
+
 ## FAQs
 
 + [How do I configure search?](docs/usage-interactive-search.md#how-do-i-configure-search) (Guide to Interactive Search)

diff --git a/docs/experiments-msmarco-v2-unicoil.md b/docs/experiments-msmarco-v2-unicoil.md
@@ -1,33 +1,41 @@
 # Pyserini: uniCOIL w/ doc2query-T5 on MS MARCO V2
 
-This page describes how to reproduce retrieval experiments with the uniCOIL model on the MS MARCO V2 collections.
+This guide describes how to reproduce retrieval experiments with the uniCOIL model on the MS MARCO V2 collections.
 Details about our model can be found in the following paper:
 
 > Jimmy Lin and Xueguang Ma. [A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques.](https://arxiv.org/abs/2106.14807) _arXiv:2106.14807_.
 
-For uniCOIL, we make the corpus (sparse vectors) as well as the pre-built indexes available to download.
+And further detailed in:
 
-## Passage Ranking (No Expansion)
+> Xueguang Ma, Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. [Document Expansions and Learned Sparse Lexical Representations for MS MARCO V1 and V2.](https://cs.uwaterloo.ca/~jimmylin/publications/Ma_etal_SIGIR2022.pdf) _Proceedings of the 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022)_, July 2022.
+
+Here, we start with versions of the MS MARCO V2 corpora that have already been processed with uniCOIL, i.e., we have applied model inference on every document and stored the output sparse vectors.
+
+Quick links:
+
++ [Passage Ranking (No Expansion)](#passage-ranking-no-expansion)
++ [Passage Ranking (With doc2query-T5 Expansion)](#passage-ranking-with-doc2query-t5-expansion)
++ [Document Ranking (No Expansion)](#document-ranking-no-expansion)
++ [Document Ranking (With doc2query-T5 Expansion)](#document-ranking-with-doc2query-t5-expansion)
 
-> You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
+## Passage Ranking (No Expansion)
 
 For the TREC 2021 Deep Learning Track, we did not have time to train a new uniCOIL model on V2 data and we did not have time to finish doc2query-T5 expansions.
-Thus, we applied uniCOIL without expansions in a zero-shot manner using a model trained on the MS MARCO (V1) passage corpus.
+Thus, we applied uniCOIL without expansions in a zero-shot manner using the model trained on the MS MARCO V1 passage corpus.
 
-Here, we start from MS MARCO V2 [passage corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#passage-collection) that has already been processed with uniCOIL, i.e., gone through term reweighting.
-As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).
+To reproduce these runs directly from our pre-built indexes, see our [two-click reproduction matrix for MS MARCO V2 passage](https://castorini.github.io/pyserini/2cr/msmarco-v2-passage.html).
+The passage ranking experiments here correspond to row (3a) for pre-encoded queries, and a corresponding condition for on-the-fly query inference.
 
-Download the sparse representation of the corpus generated by uniCOIL:
+To build the indexes from scratch, download the sparse representation of the corpus generated by uniCOIL:
 
 ```bash
 wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_noexp_0shot.tar -P collections/
-
 tar -xvf collections/msmarco_v2_passage_unicoil_noexp_0shot.tar -C collections/
 ```
 
 To confirm, `msmarco_v2_passage_unicoil_noexp_0shot.tar` is 24 GB and has an MD5 checksum of `d9cc1ed3049746e68a2c91bf90e5212d`.
 
-Index the sparse vectors:
+To index the sparse vectors:
 
 ```bash
 python -m pyserini.index.lucene \
@@ -39,9 +47,7 @@ python -m pyserini.index.lucene \
   --impact --pretokenized
 ```
 
-> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-passage-unicoil-noexp-0shot` in the command below.
-
-Sparse retrieval with uniCOIL:
+To perform retrieval:
 
 ```bash
 python -m pyserini.search.lucene \
@@ -80,21 +86,21 @@ To reproduce the Anserini results, use pre-encoded queries with `--topics msmarc
 
 ## Passage Ranking (With doc2query-T5 Expansion)
 
-> You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
-
 After the TREC 2021 Deep Learning Track submissions, we were able to complete doc2query-T5 expansions.
 
-Download the sparse representation of the corpus generated by uniCOIL:
+To reproduce these runs directly from our pre-built indexes, see our [two-click reproduction matrix for MS MARCO V2 passage](https://castorini.github.io/pyserini/2cr/msmarco-v2-passage.html).
+The passage ranking experiments here correspond to row (3b) for pre-encoded queries, and a corresponding condition for on-the-fly query inference.
+
+To build the indexes from scratch, download the sparse representation of the corpus generated by uniCOIL:
 
 ```bash
 wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_0shot.tar -P collections/
-
 tar -xvf collections/msmarco_v2_passage_unicoil_0shot.tar -C collections/
 ```
 
 To confirm, `msmarco_v2_passage_unicoil_0shot.tar` is 41 GB and has an MD5 checksum of `1949a00bfd5e1f1a230a04bbc1f01539`.
 
-Index the sparse vectors:
+To index the sparse vectors:
 
 ```bash
 python -m pyserini.index.lucene \
@@ -106,9 +112,7 @@ python -m pyserini.index.lucene \
   --impact --pretokenized
 ```
 
-> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-passage-unicoil-0shot` in the command below.
-
-Sparse retrieval with uniCOIL:
+To perform retrieval:
 
 ```bash
 python -m pyserini.search.lucene \
@@ -147,46 +151,43 @@ To reproduce the Anserini results, use pre-encoded queries with `--topics msmarc
 
 ## Document Ranking (No Expansion)
 
-> You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
-
 For the TREC 2021 Deep Learning Track, we did not have time to train a new uniCOIL model on V2 data and we did not have time to finish doc2query-T5 expansions.
-Thus, we applied uniCOIL without expansions in a zero-shot manner using a model trained on the MS MARCO (V1) passage corpus.
+Thus, we applied uniCOIL without expansions in a zero-shot manner using the model trained on the MS MARCO V1 passage corpus.
+When performing inference on the documents using the uniCOIL model here, we prepended the document title to provide context.
+This is more effective than not prepending the title, which is also a condition that we have tried.
 
-Here, we start from MS MARCO V2 [segmented document corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#document-collection-segmented) that has already been processed with uniCOIL, i.e., gone through term reweighting.
-As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).
+To reproduce these runs directly from our pre-built indexes, see our [two-click reproduction matrix for MS MARCO V2 doc](https://castorini.github.io/pyserini/2cr/msmarco-v2-doc.html).
+The passage ranking experiments here correspond to row (3a) for pre-encoded queries, and a corresponding condition for on-the-fly query inference.
 
-Download the sparse representation of the corpus generated by uniCOIL:
+To build the indexes from scratch, download the sparse representation of the corpus generated by uniCOIL:
 
 ```bash
-wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_noexp_0shot.tar -P collections/
-
-tar -xvf collections/msmarco_v2_doc_segmented_unicoil_noexp_0shot.tar -C collections/
+wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2.tar -P collections/
+tar -xvf collections/msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2.tar -C collections/
 ```
 
-To confirm, `msmarco_v2_doc_segmented_unicoil_noexp_0shot.tar` is 54 GB and has an MD5 checksum of `28261587d6afde56efd8df4f950e7fb4`.
+To confirm, `msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2.tar` is 55 GB and has an MD5 checksum of `97ba262c497164de1054f357caea0c63`.
 
-Index the sparse vectors:
+To index the sparse vectors:
 
 ```bash
 python -m pyserini.index.lucene \
   --collection JsonVectorCollection \
-  --input collections/msmarco_v2_doc_segmented_unicoil_noexp_0shot/ \
-  --index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-noexp-0shot/ \
+  --input collections/msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2/ \
+  --index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-noexp-0shot-v2/ \
   --generator DefaultLuceneDocumentGenerator \
   --threads 32 \
   --impact --pretokenized
 ```
 
-> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-doc-segmented-unicoil-noexp-0shot` in the command below.
-
-Sparse retrieval with uniCOIL:
+To perform retrieval:
 
 ```bash
 python -m pyserini.search.lucene \
-  --index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-noexp-0shot/ \
+  --index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-noexp-0shot-v2/ \
   --topics msmarco-v2-doc-dev \
   --encoder castorini/unicoil-noexp-msmarco-passage \
-  --output runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot.dev.txt \
+  --output runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot-v2.dev.txt \
   --batch 144 --threads 36 \
   --hits 10000 --max-passage --max-passage-hits 1000 \
   --impact
@@ -198,18 +199,18 @@ To evaluate, using `trec_eval`:
 
 ```bash
 $ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev \
-    runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot.dev.txt
+    runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot-v2.dev.txt
 
 Results:
-map                   	all	0.2047
-recip_rank            	all	0.2066
+map                   	all	0.2206
+recip_rank            	all	0.2232
 
 $ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev \
-    runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot.dev.txt
+    runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot-v2.dev.txt
 
 Results:
-recall_100            	all	0.7198
-recall_1000           	all	0.8854
+recall_100            	all	0.7460
+recall_1000           	all	0.8987
 ```
 
 We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics.
@@ -220,88 +221,23 @@ To reproduce the Anserini results, use pre-encoded queries with `--topics msmarc
 
 ## Document Ranking (With doc2query-T5 Expansion)
 
-> You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
-
 After the TREC 2021 Deep Learning Track submissions, we were able to complete doc2query-T5 expansions.
+When performing inference on the documents using the uniCOIL model here, we prepended the document title to provide context.
+This is more effective than not prepending the title, which is also a condition that we have tried.
 
-Download the sparse representation of the corpus generated by uniCOIL:
-
-```bash
-wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_0shot.tar -P collections/
-
-tar -xvf collections/msmarco_v2_doc_segmented_unicoil_0shot.tar -C collections/
-```
-
-To confirm, `msmarco_v2_doc_segmented_unicoil_0shot.tar` is 62 GB and has an MD5 checksum of `889db095113cc4fe152382ccff73304a`.
-
-Index the sparse vectors:
-
-```bash
-python -m pyserini.index.lucene \
-  --collection JsonVectorCollection \
-  --input collections/msmarco_v2_doc_segmented_unicoil_0shot/ \
-  --index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-0shot/ \
-  --generator DefaultLuceneDocumentGenerator \
-  --threads 32 \
-  --impact --pretokenized
-```
-
-> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-doc-segmented-unicoil-0shot` in the command below.
-
-Sparse retrieval with uniCOIL:
-
-```bash
-python -m pyserini.search.lucene \
-  --index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-0shot/ \
-  --topics msmarco-v2-doc-dev \
-  --encoder castorini/unicoil-msmarco-passage \
-  --output runs/run.msmarco-doc-v2-segmented-unicoil-0shot.dev.txt \
-  --batch 144 --threads 36 \
-  --hits 10000 --max-passage --max-passage-hits 1000 \
-  --impact
-```
-
-For the document corpus, since we are searching the segmented version, we retrieve the top 10k _segments_ and perform MaxP to obtain the top 1000 _documents_.
-
-To evaluate, using `trec_eval`:
-
-```bash
-$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev \
-    runs/run.msmarco-doc-v2-segmented-unicoil-0shot.dev.txt
-
-Results:
-map                     all     0.2217
-recip_rank              all     0.2242
-
-$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev \
-    runs/run.msmarco-doc-v2-segmented-unicoil-0shot.dev.txt
-
-Results:
-recall_100              all     0.7556
-recall_1000             all     0.9056
-```
-
-We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics.
-However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
-
-These results differ slightly from [the regressions in Anserini](https://github.com/castorini/anserini/blob/master/docs/regressions-msmarco-v2-doc-segmented-unicoil-0shot.md) because here we are performing on-the-fly query encoding, whereas the Anserini indexes use pre-encoded queries.
-To reproduce the Anserini results, use pre-encoded queries with `--topics msmarco-v2-doc-dev-unicoil`.
-
+To reproduce these runs directly from our pre-built indexes, see our [two-click reproduction matrix for MS MARCO V2 doc](https://castorini.github.io/pyserini/2cr/msmarco-v2-doc.html).
+The passage ranking experiments here correspond to row (3b) for pre-encoded queries, and a corresponding condition for on-the-fly query inference.
 
-
-## Document Ranking (With doc2query-T5 Expansion, title prepended)
-
-Download the sparse representation of the corpus generated by uniCOIL:
+To build the indexes from scratch, download the sparse representation of the corpus generated by uniCOIL:
 
 ```bash
 wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_0shot_v2.tar -P collections/
-
 tar -xvf collections/msmarco_v2_doc_segmented_unicoil_0shot_v2.tar -C collections/
 ```
 
 To confirm, `msmarco_v2_doc_segmented_unicoil_0shot_v2.tar` is 72 GB and has an MD5 checksum of `c5639748c2cbad0152e10b0ebde3b804`.
 
-Index the sparse vectors:
+To index the sparse vectors:
 
 ```bash
 python -m pyserini.index.lucene \
@@ -313,9 +249,7 @@ python -m pyserini.index.lucene \
   --impact --pretokenized
 ```
 
-> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-doc-segmented-unicoil-0shot-v2` in the command below.
-
-Sparse retrieval with uniCOIL:
+To perform retrieval:
 
 ```bash
 python -m pyserini.search.lucene \