Skip to content

Commit

Permalink
Update doc for MS MARCO uniCOIL experiments (#1191)
Browse files Browse the repository at this point in the history
Bring pages into alignment with repro matrix and results presented in SIGIR 2022 paper.
  • Loading branch information
lintool authored Jun 1, 2022
1 parent e641d4d commit b7bcf51
Show file tree
Hide file tree
Showing 3 changed files with 96 additions and 157 deletions.
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -624,6 +624,20 @@ The following guides provide step-by-step instructions:

+ Reproducing [uniCOIL + TCT-ColBERTv2 experiments on the MS MARCO V2 Collections](docs/experiments-msmarco-v2-hybrid.md)

### Available Corpora

| Corpora | Size | Checksum |
|:--------|-----:|:---------|
| [MS MARCO V1 passage: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil-noexp.tar) | 2.7 GB | `f17ddd8c7c00ff121c3c3b147d2e17d8` |
| [MS MARCO V1 passage: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil.tar) | 3.4 GB | `78eef752c78c8691f7d61600ceed306f` |
| [MS MARCO V1 doc: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil-noexp.tar) | 11 GB | `11b226e1cacd9c8ae0a660fd14cdd710` |
| [MS MARCO V1 doc: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar) | 19 GB | `6a00e2c0c375cb1e52c83ae5ac377ebb` |
| [MS MARCO V2 passage: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_noexp_0shot.tar) | 24 GB | `d9cc1ed3049746e68a2c91bf90e5212d` |
| [MS MARCO V2 passage: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_0shot.tar) | 41 GB | `1949a00bfd5e1f1a230a04bbc1f01539` |
| [MS MARCO V2 doc: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2.tar) | 55 GB | `97ba262c497164de1054f357caea0c63` |
| [MS MARCO V2 doc: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_0shot_v2.tar) | 72 GB | `c5639748c2cbad0152e10b0ebde3b804` |


## FAQs

+ [How do I configure search?](docs/usage-interactive-search.md#how-do-i-configure-search) (Guide to Interactive Search)
Expand Down
172 changes: 53 additions & 119 deletions docs/experiments-msmarco-v2-unicoil.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,41 @@
# Pyserini: uniCOIL w/ doc2query-T5 on MS MARCO V2

This page describes how to reproduce retrieval experiments with the uniCOIL model on the MS MARCO V2 collections.
This guide describes how to reproduce retrieval experiments with the uniCOIL model on the MS MARCO V2 collections.
Details about our model can be found in the following paper:

> Jimmy Lin and Xueguang Ma. [A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques.](https://arxiv.org/abs/2106.14807) _arXiv:2106.14807_.
For uniCOIL, we make the corpus (sparse vectors) as well as the pre-built indexes available to download.
And further detailed in:

## Passage Ranking (No Expansion)
> Xueguang Ma, Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. [Document Expansions and Learned Sparse Lexical Representations for MS MARCO V1 and V2.](https://cs.uwaterloo.ca/~jimmylin/publications/Ma_etal_SIGIR2022.pdf) _Proceedings of the 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022)_, July 2022.
Here, we start with versions of the MS MARCO V2 corpora that have already been processed with uniCOIL, i.e., we have applied model inference on every document and stored the output sparse vectors.

Quick links:

+ [Passage Ranking (No Expansion)](#passage-ranking-no-expansion)
+ [Passage Ranking (With doc2query-T5 Expansion)](#passage-ranking-with-doc2query-t5-expansion)
+ [Document Ranking (No Expansion)](#document-ranking-no-expansion)
+ [Document Ranking (With doc2query-T5 Expansion)](#document-ranking-with-doc2query-t5-expansion)

> You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
## Passage Ranking (No Expansion)

For the TREC 2021 Deep Learning Track, we did not have time to train a new uniCOIL model on V2 data and we did not have time to finish doc2query-T5 expansions.
Thus, we applied uniCOIL without expansions in a zero-shot manner using a model trained on the MS MARCO (V1) passage corpus.
Thus, we applied uniCOIL without expansions in a zero-shot manner using the model trained on the MS MARCO V1 passage corpus.

Here, we start from MS MARCO V2 [passage corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#passage-collection) that has already been processed with uniCOIL, i.e., gone through term reweighting.
As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).
To reproduce these runs directly from our pre-built indexes, see our [two-click reproduction matrix for MS MARCO V2 passage](https://castorini.github.io/pyserini/2cr/msmarco-v2-passage.html).
The passage ranking experiments here correspond to row (3a) for pre-encoded queries, and a corresponding condition for on-the-fly query inference.

Download the sparse representation of the corpus generated by uniCOIL:
To build the indexes from scratch, download the sparse representation of the corpus generated by uniCOIL:

```bash
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_noexp_0shot.tar -P collections/

tar -xvf collections/msmarco_v2_passage_unicoil_noexp_0shot.tar -C collections/
```

To confirm, `msmarco_v2_passage_unicoil_noexp_0shot.tar` is 24 GB and has an MD5 checksum of `d9cc1ed3049746e68a2c91bf90e5212d`.

Index the sparse vectors:
To index the sparse vectors:

```bash
python -m pyserini.index.lucene \
Expand All @@ -39,9 +47,7 @@ python -m pyserini.index.lucene \
--impact --pretokenized
```

> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-passage-unicoil-noexp-0shot` in the command below.
Sparse retrieval with uniCOIL:
To perform retrieval:

```bash
python -m pyserini.search.lucene \
Expand Down Expand Up @@ -80,21 +86,21 @@ To reproduce the Anserini results, use pre-encoded queries with `--topics msmarc

## Passage Ranking (With doc2query-T5 Expansion)

> You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
After the TREC 2021 Deep Learning Track submissions, we were able to complete doc2query-T5 expansions.

Download the sparse representation of the corpus generated by uniCOIL:
To reproduce these runs directly from our pre-built indexes, see our [two-click reproduction matrix for MS MARCO V2 passage](https://castorini.github.io/pyserini/2cr/msmarco-v2-passage.html).
The passage ranking experiments here correspond to row (3b) for pre-encoded queries, and a corresponding condition for on-the-fly query inference.

To build the indexes from scratch, download the sparse representation of the corpus generated by uniCOIL:

```bash
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_0shot.tar -P collections/

tar -xvf collections/msmarco_v2_passage_unicoil_0shot.tar -C collections/
```

To confirm, `msmarco_v2_passage_unicoil_0shot.tar` is 41 GB and has an MD5 checksum of `1949a00bfd5e1f1a230a04bbc1f01539`.

Index the sparse vectors:
To index the sparse vectors:

```bash
python -m pyserini.index.lucene \
Expand All @@ -106,9 +112,7 @@ python -m pyserini.index.lucene \
--impact --pretokenized
```

> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-passage-unicoil-0shot` in the command below.
Sparse retrieval with uniCOIL:
To perform retrieval:

```bash
python -m pyserini.search.lucene \
Expand Down Expand Up @@ -147,46 +151,43 @@ To reproduce the Anserini results, use pre-encoded queries with `--topics msmarc

## Document Ranking (No Expansion)

> You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
For the TREC 2021 Deep Learning Track, we did not have time to train a new uniCOIL model on V2 data and we did not have time to finish doc2query-T5 expansions.
Thus, we applied uniCOIL without expansions in a zero-shot manner using a model trained on the MS MARCO (V1) passage corpus.
Thus, we applied uniCOIL without expansions in a zero-shot manner using the model trained on the MS MARCO V1 passage corpus.
When performing inference on the documents using the uniCOIL model here, we prepended the document title to provide context.
This is more effective than not prepending the title, which is also a condition that we have tried.

Here, we start from MS MARCO V2 [segmented document corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#document-collection-segmented) that has already been processed with uniCOIL, i.e., gone through term reweighting.
As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).
To reproduce these runs directly from our pre-built indexes, see our [two-click reproduction matrix for MS MARCO V2 doc](https://castorini.github.io/pyserini/2cr/msmarco-v2-doc.html).
The passage ranking experiments here correspond to row (3a) for pre-encoded queries, and a corresponding condition for on-the-fly query inference.

Download the sparse representation of the corpus generated by uniCOIL:
To build the indexes from scratch, download the sparse representation of the corpus generated by uniCOIL:

```bash
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_noexp_0shot.tar -P collections/

tar -xvf collections/msmarco_v2_doc_segmented_unicoil_noexp_0shot.tar -C collections/
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2.tar -P collections/
tar -xvf collections/msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2.tar -C collections/
```

To confirm, `msmarco_v2_doc_segmented_unicoil_noexp_0shot.tar` is 54 GB and has an MD5 checksum of `28261587d6afde56efd8df4f950e7fb4`.
To confirm, `msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2.tar` is 55 GB and has an MD5 checksum of `97ba262c497164de1054f357caea0c63`.

Index the sparse vectors:
To index the sparse vectors:

```bash
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input collections/msmarco_v2_doc_segmented_unicoil_noexp_0shot/ \
--index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-noexp-0shot/ \
--input collections/msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2/ \
--index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-noexp-0shot-v2/ \
--generator DefaultLuceneDocumentGenerator \
--threads 32 \
--impact --pretokenized
```

> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-doc-segmented-unicoil-noexp-0shot` in the command below.
Sparse retrieval with uniCOIL:
To perform retrieval:

```bash
python -m pyserini.search.lucene \
--index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-noexp-0shot/ \
--index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-noexp-0shot-v2/ \
--topics msmarco-v2-doc-dev \
--encoder castorini/unicoil-noexp-msmarco-passage \
--output runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot.dev.txt \
--output runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot-v2.dev.txt \
--batch 144 --threads 36 \
--hits 10000 --max-passage --max-passage-hits 1000 \
--impact
Expand All @@ -198,18 +199,18 @@ To evaluate, using `trec_eval`:

```bash
$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev \
runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot.dev.txt
runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot-v2.dev.txt

Results:
map all 0.2047
recip_rank all 0.2066
map all 0.2206
recip_rank all 0.2232

$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev \
runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot.dev.txt
runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot-v2.dev.txt

Results:
recall_100 all 0.7198
recall_1000 all 0.8854
recall_100 all 0.7460
recall_1000 all 0.8987
```

We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics.
Expand All @@ -220,88 +221,23 @@ To reproduce the Anserini results, use pre-encoded queries with `--topics msmarc

## Document Ranking (With doc2query-T5 Expansion)

> You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
After the TREC 2021 Deep Learning Track submissions, we were able to complete doc2query-T5 expansions.
When performing inference on the documents using the uniCOIL model here, we prepended the document title to provide context.
This is more effective than not prepending the title, which is also a condition that we have tried.

Download the sparse representation of the corpus generated by uniCOIL:

```bash
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_0shot.tar -P collections/

tar -xvf collections/msmarco_v2_doc_segmented_unicoil_0shot.tar -C collections/
```

To confirm, `msmarco_v2_doc_segmented_unicoil_0shot.tar` is 62 GB and has an MD5 checksum of `889db095113cc4fe152382ccff73304a`.

Index the sparse vectors:

```bash
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input collections/msmarco_v2_doc_segmented_unicoil_0shot/ \
--index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-0shot/ \
--generator DefaultLuceneDocumentGenerator \
--threads 32 \
--impact --pretokenized
```

> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-doc-segmented-unicoil-0shot` in the command below.
Sparse retrieval with uniCOIL:

```bash
python -m pyserini.search.lucene \
--index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-0shot/ \
--topics msmarco-v2-doc-dev \
--encoder castorini/unicoil-msmarco-passage \
--output runs/run.msmarco-doc-v2-segmented-unicoil-0shot.dev.txt \
--batch 144 --threads 36 \
--hits 10000 --max-passage --max-passage-hits 1000 \
--impact
```

For the document corpus, since we are searching the segmented version, we retrieve the top 10k _segments_ and perform MaxP to obtain the top 1000 _documents_.

To evaluate, using `trec_eval`:

```bash
$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev \
runs/run.msmarco-doc-v2-segmented-unicoil-0shot.dev.txt

Results:
map all 0.2217
recip_rank all 0.2242

$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev \
runs/run.msmarco-doc-v2-segmented-unicoil-0shot.dev.txt

Results:
recall_100 all 0.7556
recall_1000 all 0.9056
```

We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics.
However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.

These results differ slightly from [the regressions in Anserini](https://github.com/castorini/anserini/blob/master/docs/regressions-msmarco-v2-doc-segmented-unicoil-0shot.md) because here we are performing on-the-fly query encoding, whereas the Anserini indexes use pre-encoded queries.
To reproduce the Anserini results, use pre-encoded queries with `--topics msmarco-v2-doc-dev-unicoil`.

To reproduce these runs directly from our pre-built indexes, see our [two-click reproduction matrix for MS MARCO V2 doc](https://castorini.github.io/pyserini/2cr/msmarco-v2-doc.html).
The passage ranking experiments here correspond to row (3b) for pre-encoded queries, and a corresponding condition for on-the-fly query inference.


## Document Ranking (With doc2query-T5 Expansion, title prepended)

Download the sparse representation of the corpus generated by uniCOIL:
To build the indexes from scratch, download the sparse representation of the corpus generated by uniCOIL:

```bash
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_0shot_v2.tar -P collections/

tar -xvf collections/msmarco_v2_doc_segmented_unicoil_0shot_v2.tar -C collections/
```

To confirm, `msmarco_v2_doc_segmented_unicoil_0shot_v2.tar` is 72 GB and has an MD5 checksum of `c5639748c2cbad0152e10b0ebde3b804`.

Index the sparse vectors:
To index the sparse vectors:

```bash
python -m pyserini.index.lucene \
Expand All @@ -313,9 +249,7 @@ python -m pyserini.index.lucene \
--impact --pretokenized
```

> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-doc-segmented-unicoil-0shot-v2` in the command below.
Sparse retrieval with uniCOIL:
To perform retrieval:

```bash
python -m pyserini.search.lucene \
Expand Down
Loading

0 comments on commit b7bcf51

Please sign in to comment.