Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepImpact reproduction guide #1564

Merged
merged 1 commit into from
Jun 17, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,9 +106,10 @@ For the most part, manual copying and pasting of commands into a shell is requir

+ [Guide to BM25 baselines for the MS MARCO Passage Ranking Task](docs/experiments-msmarco-passage.md)
+ [Guide to BM25 baselines for the MS MARCO Document Ranking Task](docs/experiments-msmarco-doc.md)
+ [Guide to reproducing baselines MS MARCO Document Ranking Leaderboard](docs/experiments-msmarco-doc-leaderboard.md)
+ [Guide to reproducing baselines for the MS MARCO Document Ranking Leaderboard](docs/experiments-msmarco-doc-leaderboard.md)
+ [Guide to reproducing doc2query results](docs/experiments-doc2query.md) (MS MARCO passage ranking and TREC-CAR)
+ [Guide to reproducing docTTTTTquery results](docs/experiments-docTTTTTquery.md) (MS MARCO passage and document ranking)
+ [Guide to reproducing DeepImpact for the MS MARCO Passage Ranking Task](docs/experiments-msmarco-passage-deepimpact.md)

### Other Experiments

Expand Down
88 changes: 88 additions & 0 deletions docs/experiments-msmarco-passage-deepimpact.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Anserini: DeepImpact for MS MARCO Passage Ranking

This page describes how to reproduce the DeepImpact experiments in the following paper:

> Antonio Mallia, Omar Khattab, Nicola Tonellotto, and Torsten Suel. [Learning Passage Impacts for Inverted Indexes.](https://arxiv.org/abs/2104.12016) _arXiv:2104.12016_.

Here, we start with a version of the MS MARCO passage corpus that has already been processed with DeepImpact, i.e., gone through document expansion and term reweighting.
Thus, no neural inference is involved.


## Data Prep

We're going to use the repository's root directory as the working directory.
First, we need to download and extract the MS MARCO passage dataset with DeepImpact processing:

```bash
wget https://git.uwaterloo.ca/jimmylin/deep-impact/raw/master/msmarco-passage-deepimpact-b8.tar.gz -P collections/
tar -xzvf collections/msmarco-passage-deepimpact-b8.tar.gz -C collections/
```

To confirm, `msmarco-passage-deepimpact-b8.tar.gz` should have MD5 checksum of `8ea0ebdd707d5853a87940e5bdfd9b00`.


## Indexing

We can now index these docs as a `JsonVectorCollection` using Anserini:

```bash
sh target/appassembler/bin/IndexCollection -collection JsonVectorCollection \
-input collections/msmarco-passage-deepimpact-b8/ \
-index indexes/lucene-index.msmarco-passage-deepimpact-b8 \
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
-threads 18 -storeRaw
```

The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the DeepImpact tokens.

Upon completion, we should have an index with 8,841,823 documents.
The indexing speed may vary; on a modern desktop with an SSD (using 18 threads, per above), indexing takes around ten minutes.


## Retrieval

To ensure that the tokenization in the index aligns exactly with the queries, we use pre-tokenized queries.
First, fetch the MS MARCO passage ranking dev set queries:

```
wget https://git.uwaterloo.ca/jimmylin/deep-impact/raw/master/topics.msmarco-passage.dev-subset.deep-impact.tsv -P collections/
```

We can now run retrieval:

```
target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-passage-deepimpact-b8 \
-topicreader TsvInt -topics collections/topics.msmarco-passage.dev-subset.deep-impact.tsv \
-output runs/run.msmarco-passage-deepimpact-b8.trec \
-impact -pretokenized
```

Query evaluation is much slower than with bag-of-words BM25; a complete run can take around half an hour.
Note that, mirroring the indexing options, we specify `-impact -pretokenized` here also.

The output is in TREC output format.
Let's convert to MS MARCO output format and then evaluate:

```
python tools/scripts/msmarco/convert_trec_to_msmarco_run.py \
--input runs/run.msmarco-passage-deepimpact-b8.trec \
--output runs/run.msmarco-passage-deepimpact-b8.txt --quiet

python tools/scripts/msmarco/msmarco_passage_eval.py \
collections/msmarco-passage/qrels.dev.small.tsv runs/run.msmarco-passage-deepimpact-b8.txt
```

The results should be as follows:

```
#####################
MRR @10: 0.3252764133351524
QueriesRanked: 6980
#####################
```

The final evaluation metric is very close to the one reported in the paper (0.326).


## Reproduction Log[*](reproducibility.md)