Update uniCOIL + TILDE docs (#1645)

castorini · Oct 9, 2021 · fc2ddb0 · fc2ddb0
1 parent f8b7cd9
commit fc2ddb0
Show file tree

Hide file tree

Showing 4 changed files with 32 additions and 50 deletions.
diff --git a/README.md b/README.md
@@ -90,8 +90,8 @@ For the most part, manual copying and pasting of commands into a shell is requir
 + [Reproducing doc2query results](docs/experiments-doc2query.md) (MS MARCO passage ranking and TREC-CAR)
 + [Reproducing docTTTTTquery results](docs/experiments-docTTTTTquery.md) (MS MARCO passage and document ranking)
 + [Reproducing DeepImpact for the MS MARCO Passage Ranking Task](docs/experiments-msmarco-passage-deepimpact.md)
-+ [Reproducing uniCOIL experiments with doc2query-T5 expansions for MS MARCO (V1)](docs/experiments-msmarco-unicoil.md)
-+ [Reproducing uniCOIL experiments with TILDE expansions for MS MARCO (V1) Passage Ranking](docs/experiments-msmarco-passage-unicoil-tilde-expansion.md)
++ [Reproducing uniCOIL experiments with doc2query-T5 expansions for MS MARCO V1](docs/experiments-msmarco-unicoil.md)
++ [Reproducing uniCOIL experiments with TILDE expansions for MS MARCO V1 Passage Ranking](docs/experiments-msmarco-passage-unicoil-tilde-expansion.md)
 + [Reproducing BM25 baselines on the MS MARCO V2 Collections](docs/experiments-msmarco-v2.md)
 
 ### Other Experiments

diff --git a/docs/experiments-msmarco-passage-unicoil-tilde-expansion.md b/docs/experiments-msmarco-passage-unicoil-tilde-expansion.md
@@ -1,6 +1,6 @@
-# Anserini: uniCOIL (w/ TILDE) for MS MARCO Passage Ranking
+# Anserini: uniCOIL w/ TILDE for MS MARCO V1 Passage Ranking
 
-This page describes how to reproduce experiments using uniCOIL with TILDE document expansion, as described in the following paper:
+This page describes how to reproduce experiments using uniCOIL with TILDE document expansion on the MS MARCO passage corpus, as described in the following paper:
 
 > Shengyao Zhuang and Guido Zuccon. [Fast Passage Re-ranking with Contextualized Exact Term
 Matching and Efficient Passage Expansion.](https://arxiv.org/pdf/2108.08513) _arXiv:2108.08513_.
@@ -10,12 +10,12 @@ The original uniCOIL model is described here:
 > Jimmy Lin and Xueguang Ma. [A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques.](https://arxiv.org/abs/2106.14807) _arXiv:2106.14807_.
 
 In the original uniCOIL paper, doc2query-T5 is used to perform document expansion, which is slow and expensive.
-As an alternative, Zhuang and Zuccon proposed to use the TILDE model to expand the corpus, resulting in a faster and cheaper document expansion process.
-For details of how to use TILDE to expand documents, please see [this guide](https://github.com/ielab/TILDE).
+As an alternative, Zhuang and Zuccon proposed to use the TILDE model to expand the documents instead, resulting in a faster and cheaper process that is just as effective.
+For details of how to use TILDE to expand documents, please refer to the [TILDE repo](https://github.com/ielab/TILDE).
+For additional details on the original uniCOIL design (with doc2query-T5 expansion), please refer to the [COIL repo](https://github.com/luyug/COIL/tree/main/uniCOIL).
 
-In this guide, we start with a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
+In this guide, we start with a version of the MS MARCO passage corpus that has already been processed with uniCOIL + TILDE, i.e., gone through document expansion and term re-weighting.
 Thus, no neural inference is involved.
-For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).
 
 Note that Pyserini provides [a comparable reproduction guide](https://github.com/castorini/pyserini/blob/master/docs/experiments-unicoil-tilde-expansion.md), so if you don't like Java, you can get _exactly_ the same results from Python.
 
@@ -25,81 +25,63 @@ We're going to use the repository's root directory as the working directory.
 First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing:
 
 ```bash
+# Alternate mirrors of the same data, pick one:
 wget https://git.uwaterloo.ca/jimmylin/unicoil/-/raw/master/msmarco-passage-unicoil-tilde-expansion-b8.tar -P collections/
-
-# Alternate mirror
 wget https://vault.cs.uwaterloo.ca/s/6LECmLdiaBoPwrL/download -O collections/msmarco-passage-unicoil-tilde-expansion-b8.tar
 
-tar -xvf collections/msmarco-passage-unicoil-tilde-expansion-b8.tar -C collections/
+tar xvf collections/msmarco-passage-unicoil-tilde-expansion-b8.tar -C collections/
 ```
 
 To confirm, `msmarco-passage-unicoil-tilde-expansion-b8.tar` should have MD5 checksum of `be0a786033140ebb7a984a3e155c19ae`.
 
-
 ## Indexing
 
 We can now index these docs as a `JsonVectorCollection` using Anserini:
 
 ```bash
 sh target/appassembler/bin/IndexCollection -collection JsonVectorCollection \
  -input collections/msmarco-passage-unicoil-tilde-expansion-b8/ \
- -index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8 \
+ -index indexes/lucene-index.msmarco-passage.unicoil-tilde-expansion-b8 \
  -generator DefaultLuceneDocumentGenerator -impact -pretokenized \
- -threads 12 -storeRaw -optimize
+ -threads 12
 ```
 
 The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.
 
 Upon completion, we should have an index with 8,841,823 documents.
-The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around ten minutes.
-
+The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 20 minutes.
 
 ## Retrieval
 
 To ensure that the tokenization in the index aligns exactly with the queries, we use pre-tokenized queries.
 The queries are already stored in the repo, so we can run retrieval directly:
 
 ```bash
-target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8 \
- -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil.tilde.expansion.tsv.gz \
- -output runs/run.msmarco-passage-unicoil-tilde-expansion-b8.trec \
+target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-passage.unicoil-tilde-expansion-b8 \
+ -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz \
+ -output runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv -format msmarco \
  -impact -pretokenized
 ```
 
-The queries are also available to download at the following locations:
-
-```bash
-wget https://git.uwaterloo.ca/jimmylin/unicoil/-/raw/master/topics.msmarco-passage.dev-subset.unicoil.tilde.expansion.tsv.gz -P collections/
-wget https://vault.cs.uwaterloo.ca/s/QGoHeBm4YsAgt6H/download -O collections/topics.msmarco-passage.dev-subset.unicoil.tilde.expansion.tsv.gz
-
-# MD5 checksum:
-```
-
-Query evaluation is much slower than with bag-of-words BM25; a complete run can take around 15 min.
-Note that, mirroring the indexing options, we specify `-impact -pretokenized` here also.
+Query evaluation is much slower than with bag-of-words BM25; a complete run takes around 30 minutes (on a single thread).
+Note that, mirroring the indexing options, we also specify `-impact -pretokenized` here.
 
-The output is in TREC output format.
-Let's convert to MS MARCO output format and then evaluate:
+With `-format msmarco`, runs are already in the MS MARCO output format, so we can evaluate directly:
 
 ```bash
-python tools/scripts/msmarco/convert_trec_to_msmarco_run.py \
-   --input runs/run.msmarco-passage-unicoil-tilde-expansion-b8.trec \
-   --output runs/run.msmarco-passage-unicoil-tilde-expansion-b8.txt --quiet
-
 python tools/scripts/msmarco/msmarco_passage_eval.py \
-   tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-unicoil-tilde-expansion-b8.txt
+   tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv
 ```
 
 The results should be as follows:
 
 ```
 #####################
-MRR @10: 0.34965502342293175
+MRR @10: 0.34957184927457136
 QueriesRanked: 6980
 #####################
 ```
 
 This corresponds to the effectiveness reported in the paper.
 
-
-## Reproduction Log[*](reproducibility.md)
+## Reproduction Log[*](reproducibility.md)
diff --git a/docs/experiments-msmarco-unicoil.md b/docs/experiments-msmarco-unicoil.md
@@ -1,4 +1,4 @@
-# Anserini: uniCOIL (w/ doc2query-T5) for MS MARCO (V1)
+# Anserini: uniCOIL w/ doc2query-T5 for MS MARCO V1
 
 This page describes how to reproduce the uniCOIL experiments in the following paper:
 
@@ -34,7 +34,7 @@ We can now index these docs as a `JsonVectorCollection` using Anserini:
 ```bash
 sh target/appassembler/bin/IndexCollection -collection JsonVectorCollection \
  -input collections/msmarco-passage-unicoil-b8/ \
- -index indexes/lucene-index.msmarco-passage-unicoil-b8 \
+ -index indexes/lucene-index.msmarco-passage.unicoil-b8 \
  -generator DefaultLuceneDocumentGenerator -impact -pretokenized \
  -threads 12
 ```
@@ -50,20 +50,20 @@ To ensure that the tokenization in the index aligns exactly with the queries, we
 The queries are already stored in the repo, so we can run retrieval directly:
 
 ```bash
-target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-passage-unicoil-b8 \
+target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-passage.unicoil-b8 \
  -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil.tsv.gz \
- -output runs/run.msmarco-passage-unicoil-b8.tsv -format msmarco \
+ -output runs/run.msmarco-passage.unicoil-b8.tsv -format msmarco \
  -impact -pretokenized
 ```
 
 Query evaluation is much slower than with bag-of-words BM25; a complete run takes around 30 minutes (on a single thread).
-Note that, mirroring the indexing options, we specify `-impact -pretokenized` here also.
+Note that, mirroring the indexing options, we also specify `-impact -pretokenized` here.
 
 With `-format msmarco`, runs are already in the MS MARCO output format, so we can evaluate directly:
 
 ```bash
 python tools/scripts/msmarco/msmarco_passage_eval.py \
-   tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-unicoil-b8.txt
+   tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.unicoil-b8.txt
 ```
 
 The results should be as follows:
@@ -101,7 +101,7 @@ We can now index these docs as a `JsonVectorCollection` using Anserini:
 ```bash
 sh target/appassembler/bin/IndexCollection -collection JsonVectorCollection \
  -input collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8/ \
- -index indexes/lucene-index.msmarco-doc-per-passage-expansion-unicoil-d2q-b8 \
+ -index indexes/lucene-index.msmarco-doc-per-passage-expansion.unicoil-d2q-b8 \
  -generator DefaultLuceneDocumentGenerator -impact -pretokenized \
  -threads 12
 ```
@@ -116,9 +116,9 @@ The indexing speed may vary; on a modern desktop with an SSD (using 12 threads,
 We can now run retrieval:
 
 ```bash
-target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-doc-per-passage-expansion-unicoil-d2q-b8 \
+target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-doc-per-passage-expansion.unicoil-d2q-b8 \
  -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.unicoil.tsv.gz \
- -output runs/run.msmarco-doc-unicoil-d2q-b8.tsv -format msmarco \
+ -output runs/run.msmarco-doc.unicoil-d2q-b8.tsv -format msmarco \
  -hits 1000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 \
  -impact -pretokenized
 ```
@@ -130,7 +130,7 @@ With `-format msmarco`, runs are already in the MS MARCO output format, so we ca
 
 ```bash
 python tools/scripts/msmarco/msmarco_doc_eval.py --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \
- --run runs/run.msmarco-doc-unicoil-d2q-b8.tsv
+ --run runs/run.msmarco-doc.unicoil-d2q-b8.tsv
 ```
 
 The results should be as follows:

diff --git a/...sources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz b/...sources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz