Skip to content

Commit

Permalink
Refactor IRST documentation + minor tweaks to other docs (#1219)
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool authored Jun 23, 2022
1 parent accc8c5 commit b198f88
Show file tree
Hide file tree
Showing 6 changed files with 103 additions and 289 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Retrieval using dense representations is provided via integration with Facebook'

Pyserini is primarily designed to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture.
Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections
With Pyserini, it's easy to [reproduce](docs/pypi-reproduction.md) runs on a number of standard IR test collections!
With Pyserini, it's easy to reproduce runs on a number of standard IR test collections!
A low-effort way to try things out is to look at our [online notebooks](https://github.com/castorini/anserini-notebooks), which will allow you to get started with just a few clicks.

For additional details, [our paper](https://dl.acm.org/doi/10.1145/3404835.3463238) in SIGIR 2021 provides a nice overview.
Expand Down Expand Up @@ -591,13 +591,13 @@ The following guides provide step-by-step instructions:

### Sparse Retrieval

+ Reproducing [runs directly from the Python package](docs/pypi-reproduction.md)
+ Reproducing [Robust04 baselines for ad hoc retrieval](docs/experiments-robust04.md)
+ Reproducing the [BM25 baseline for MS MARCO V1 Passage Ranking](docs/experiments-msmarco-passage.md)
+ Reproducing the [BM25 baseline for MS MARCO V1 Document Ranking](docs/experiments-msmarco-doc.md)
+ Reproducing the [multi-field BM25 baseline for MS MARCO V1 Document Ranking from Elasticsearch](docs/experiments-elastic.md)
+ Reproducing [BM25 baselines on the MS MARCO V2 Collections](docs/experiments-msmarco-v2.md)
+ Reproducing LTR filtering experiments: [MS MARCO V1 Passage](docs/experiments-ltr-msmarco-passage-reranking.md), [MS MARCO V1 Document](docs/experiments-ltr-msmarco-document-reranking.md)
+ Reproducing IRST experiments on the [MS MARCO V1 Collections](docs/experiments-msmarco-irst.md)
+ Reproducing DeepImpact: [MS MARCO V1 Passage](docs/experiments-deepimpact.md)
+ Reproducing uniCOIL with doc2query-T5: [MS MARCO V1](docs/experiments-unicoil.md), [MS MARCO V2](docs/experiments-msmarco-v2-unicoil.md)
+ Reproducing uniCOIL with TILDE: [MS MARCO V1 Passage](docs/experiments-unicoil-tilde-expansion.md), [MS MARCO V2 Passage](docs/experiments-msmarco-v2-unicoil-tilde-expansion.md)
Expand Down
2 changes: 1 addition & 1 deletion docs/experiments-msmarco-doc.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Note that there is a separate guide for the [MS MARCO *passage* ranking task](ex

## Data Prep

The guide requires the [development installation](https://github.com/castorini/pyserini/blob/master/docs/installation.md#development-installation) for additional resource that are not shipped with the Python module; for the (more limited) runs that directly work from the Python module installed via `pip`, see [this guide](pypi-reproduction.md).
The guide requires the [development installation](https://github.com/castorini/pyserini/blob/master/docs/installation.md#development-installation) for additional resource that are not shipped with the Python module.

We're going to use the repository's root directory as the working directory.
First, we need to download and extract the MS MARCO document dataset:
Expand Down
242 changes: 97 additions & 145 deletions docs/experiments-msmarco-irst.md
Original file line number Diff line number Diff line change
@@ -1,227 +1,179 @@
# Pyserini: IRST on MS MARCO Passage and Document
# Pyserini: IRST on MS MARCO V1 Collections

This page describes how to reproduce IRST experiments with the IBM model on the MS MARCO collections.
This guide describes how to reproduce the IRST (Information Retrieval as Statistical Translation) experiments on the MS MARCO V1 collections, as described in the following paper:

> Yuqi Liu, Chengcheng Hu, and Jimmy Lin. [Another Look at Information Retrieval as Statistical Translation.](https://cs.uwaterloo.ca/~jimmylin/publications/Liu_etal_SIGIR2022.pdf) _Proceedings of the 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022)_, July 2022.
## Passage Reranking
Below, we discuss passage ranking and two document ranking conditions (full docs and segmented docs).

### Data Preprocessing
## Passage Ranking

For IRST, we make the corpus as well as the pre-built indexes available to download.
Here, we start directly from our pre-built indexes and already-trained IRST models.
The IBM model we use is referenced in [Boytsov et al. (2021)](https://arxiv.org/abs/2102.06815).
For training the model from scratch, consult the [guide in FlexNeuART](https://github.com/oaqa/FlexNeuART/tree/master/demo).

> You can skip the data prep and indexing steps if you use our pre-built indexes.
The following commands will reproduce the results in Table 1 of our paper:

Here, we start from MS MARCO [passage corpus](https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md) that has already been processed.
As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).
**IRST (Sum)**

### Performing End-to-End Retrieval Using Already Trained Model

The IBM model we used in this experiment is referenced in the Boytsov et al. [paper](https://arxiv.org/pdf/2102.06815.pdf)
Note that there is a separate guide for training the IBM Model on [FlexNeuART](https://github.com/oaqa/FlexNeuART/tree/master/demo)

We can run our script to get our end-to-end results.

IRST (Sum)
```bash
python -m pyserini.search.lucene.irst \
--topics topics \
--topics msmarco-passage-dev-subset \
--index msmarco-v1-passage \
--output irst_test/regression_test_sum.passage.topics.trec \
--alpha 0.1
--output runs/run.irst-sum.passage.dev.txt \
--alpha 0.1
```

IRST (Max)
**IRST (Max)**

```bash
python -m pyserini.search.lucene.irst \
--topics topics \
--topics msmarco-passage-dev-subset \
--index msmarco-v1-passage \
--output irst_test/regression_test_max.passage.topics.trec \
--output runs/run.irst-max.passage.dev.txt \
--alpha 0.3 \
--max-sim
--max-sim
```

For different topics, the `--topics` options are different, since Pyserini has all these topics available, we can pass in different values to run on different datasets.

`--topics`: <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Passage: `dl19-passage` <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Passage: `dl20` <br />
&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Passage V1: `msmarco-passage-dev-subset` <br />
The option `--topics` specifies the different topics.
The choices are:

+ MS MARCO V1 passage dev queries: `msmarco-passage-dev-subset` (per above)
+ TREC DL 2019 passage: `dl19-passage`
+ TREC DL 2020 passage: `dl20`

After the run finishes, we can also evaluate the results using the official MS MARCO evaluation script:

For TREC DL 2019, use this command to evaluate your run file:
To evaluate results, use `trec_eval`.
For MS MARCO V1 passage:

```bash
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -l 2 dl19-passage irst_test/regression_test_sum.passage.dl19-passage.trec
python -m pyserini.eval.trec_eval -c -M 10 -m ndcg_cut.10 -m map -m recip_rank \
msmarco-passage-dev-subset runs/run.irst-sum.passage.dev.txt
```

Similarly for TREC DL 2020,
For TREC DL 2019, note that we need to specify `-l 2`:

```bash
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -l 2 dl20-passage irst_test/regression_test_sum.passage.dl20.trec
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -l 2 \
dl19-passage runs/run.irst-sum.passage.dl19.txt
```

For MS MARCO Passage V1, no need to use -l 2 option:
Similarly, for TREC DL 2020:

```bash
python -m pyserini.eval.trec_eval -c -M 10 -m ndcg_cut.10 -m map -m recip_rank msmarco-passage-dev-subset irst_test/regression_test_sum.passage.msmarco-passage-dev-subset.trec
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -l 2 \
dl20-passage runs/run.irst-sum.passage.dl20.txt
```

## Document Reranking
The results should match Table 1 from our paper, repeated below:

| | MS MARCO Dev | TREC 2019 | | TREC 2020 | |
|:-----------------------------|-------------:|----------:|------:|----------:|------:|
| | MRR@10 | nDCG@10 | MAP | nDCG@10 | MAP |
| (1a) BM25 (k1= 0.82, b=0.68) | 0.188 | 0.497 | 0.290 | 0.488 | 0.288 |
| (2a) BM25 + IRST (Sum) | 0.221 | 0.526 | 0.328 | 0.558 | 0.352 |
| (2b) BM25 + IRST (Max) | 0.215 | 0.537 | 0.329 | 0.547 | 0.336 |

### Data Preprocessing
The BM25 baseline is provided for reference.

Now, we perform experiment on full document.
### Performing End-to-End Retrieval Using Already Trained Model
## Document Ranking

Please note that we did not have time to train a new IBM model on MS MARCO doc data, we used the trained MS MARCO passage IBM Model1 instead.
In the paper, we explore two different conditions for document ranking: full documents and segmented documents.

Next we can run our script to get our retrieval results.
For full documents:

**IRST (Sum)**

IRST (Sum)
```bash
python -m pyserini.search.lucene.irst \
--topics topics \
--topics msmarco-doc-dev \
--index msmarco-v1-doc \
--output irst_test/regression_test_sum.doc_full.topics.trec \
--output runs/run.irst-sum.doc-full.dev.txt \
--alpha 0.3 \
--hits 1000
--hits 1000
```

IRST (Max)
**IRST (Max)**

```bash
python -m pyserini.search.lucene.irst \
--topics topics \
--topics msmarco-doc-dev \
--index msmarco-v1-doc \
--output irst_test/regression_test_max.doc_full.topics.trec \
--output runs/run.irst-max.doc-full.dev.txt \
--alpha 0.3 \
--hits 1000 \
--max-sim
```

For segmented documents:

For different topics, the `--topics` are different, since Pyserini has all these topics available, we can pass in different values to run on different datasets.

`--topics`: <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Document: `dl19-doc` <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Document: `dl20` <br />
&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Document V1: `msmarco-doc` <br />


We can use the official TREC evaluation tool, trec_eval, to compute other metrics. For that we first need to convert the runs into TREC format:

For TREC DL 2019, use this command to evaluate your run file:

```bash
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -M 100 dl19-doc irst_test/regression_test_sum.doc_full.dl19-doc.trec
```

Similarly for TREC DL 2020
```bash
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -M 100 dl20-doc irst_test/regression_test_sum.doc_full.dl20.trec
```
**IRST (Sum)**

For MS MARCO Doc V1
```bash
python -m pyserini.eval.trec_eval -c -M 100 -m ndcg_cut.10 -m map -m recip_rank msmarco-doc-dev irst_test/regression_test_sum.doc_full.msmarco-doc.trec
```


## Document Segment Reranking


### Data Preprocessing

For MS MARCO doc, each document is first segmented into passages, each passage is treated as a unit of indexing.
We utilized the MaxP technique during the ranking, that is scoring documents based on one of its highest-scoring passage.

### Performing End-to-End Retrieval Using Already Trained Model


Please note that we did not have time to train a new IBM model on MS MARCO doc data, we used the trained MS MARCO passage IBM Model1 instead.

Next we can run our script to get our retrieval results.

IRST (Sum)
```bash
python -m pyserini.search.lucene.irst \
--topics topics \
--topics msmarco-doc-dev \
--index msmarco-v1-doc-segmented \
--output irst_test/regression_test_sum.doc_seg.topics.trec \
--output runs/run.irst-sum.doc-seg.dev.txt \
--alpha 0.3 \
--segments \
--hits 10000
```

IRST (Max)
**IRST (Max)**

```bash
python -m pyserini.search.lucene.irst \
--topics topics \
--topics msmarco-doc-dev \
--index msmarco-v1-doc-segmented \
--output irst_test/regression_test_max.doc_seg.topics.trec \
--output runs/run.irst-max.doc-seg.dev.txt \
--alpha 0.3 \
--hits 10000 \
--segments \
--max-sim
```

The option `--topics` specifies the different topics.
The choices are:

For different topics, the `--topics` are different, since Pyserini has all these topics available, we can pass in different values to run on different datasets.

`--topics`: <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Document: `dl19-doc` <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Document: `dl20` <br />
&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Document V1: `msmarco-doc` <br />

We can use the official TREC evaluation tool, trec_eval, to compute other metrics. For that we first need to convert the runs into TREC format:
+ MS MARCO V1 doc dev queries: `msmarco-doc-dev` (per above)
+ TREC DL 2019 passage: `dl19-doc`
+ TREC DL 2020 passage: `dl20`

For TREC DL 2019, use this command to evaluate your run file:
To evaluate results, use `trec_eval`.
For MS MARCO V1 doc:

```bash
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -M 100 dl19-doc irst_test/regression_test_sum.doc_seg.dl19-doc.trec
python -m pyserini.eval.trec_eval -c -M 100 -m ndcg_cut.10 -m map -m recip_rank \
msmarco-doc-dev runs/run.irst-sum.doc-full.dev.txt
```

Similarly for TREC DL 2020, no need to use -l 2 option:
For TREC DL 2019:

```bash
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -M 100 dl20-doc irst_test/regression_test_sum.doc_seg.dl20.trec
python -m pyserini.eval.trec_eval -c -M 100 -m map -m ndcg_cut.10 \
dl19-doc runs/run.irst-sum.doc-full.dl19.txt
```

For MS MARCO Doc V1, no need to use -l 2 option:
Similarly, for TREC DL 2020:

```bash
python -m pyserini.eval.trec_eval -c -M 100 -m ndcg_cut.10 -m map -m recip_rank msmarco-doc-dev irst_test/regression_test_sum.doc_seg.msmarco-doc.trec
python -m pyserini.eval.trec_eval -c -M 100 -m map -m ndcg_cut.10 \
dl20-doc runs/run.irst-sum.doc-full.dl20.txt
```

## Results
### Passage Ranking Datasets

| Topics | Method | MRR@10 | nDCG@10 | Map |
|:-------------------------|:------------------------|:------:|:--------:|:-----------:|
| DL19 | IRST(Sum) | - | 0.526 | 0.328 |
| DL19 | IRST(Max) | - | 0.537 | 0.329 |
| DL20 | IRST(Sum) | -| 0.558 | 0.352 |
| DL20 | IRST(Max) | -| 0.547 | 0.336 |
| MS MARCO Dev | IRST(Sum) | 0.221| - | - |
| MS MARCO Dev | IRST(Max) | 0.215| - | - |


### Document Ranking Datasets

| Topics | Method | MRR@100 | nDCG@10 | Map |
|:-------------------------|:------------------------|:------:|:--------:|:-----------:|
| DL19 | IRST(Sum) | - | 0.549 | 0.252 |
| DL19 | IRST(Max) | - | 0.491 | 0.220 |
| DL20 | IRST(Sum) | - | 0.556 | 0.383 |
| DL20 | IRST(Max) | - | 0.502 | 0.337 |
| MS MARCO Dev | IRST(Sum) |0.302 | - | - |
| MS MARCO Dev | IRST(Max) |0.252 | - | - |

### Document Segment Ranking Datasets

| Topics | Method | MRR@100 | nDCG@10 | Map |
|:-------------------------|:------------------------|:------:|:--------:|:-----------:|
| DL19 | IRST(Sum) | - | 0.560 | 0.271 |
| DL19 | IRST(Max) | - | 0.520 | 0.243 |
| DL20 | IRST(Sum) | - | 0.534 | 0.376 |
| DL20 | IRST(Max) | - | 0.509 | 0.350 |
| MS MARCO Dev | IRST(Sum) |0.296 | - | - |
| MS MARCO Dev | IRST(Max) |0.259 | - | - |
The results should match Table 2 from our paper, repeated below:

| | MS MARCO Dev | TREC 2019 | | TREC 2020 | |
|:-----------------------------|-------------:|----------:|------:|----------:|------:|
| | MRR@100 | nDCG@10 | MAP | nDCG@10 | MAP |
| **Document (Full)** | | | | | |
| (2a) BM25 (k1= 0.82, b=0.68) | 0.249 | 0.510 | 0.241 | 0.528 | 0.378 |
| (2b) BM25 + IRST (Sum) | 0.302 | 0.549 | 0.252 | 0.556 | 0.383 |
| (2c) BM25 + IRST (Max) | 0.252 | 0.491 | 0.220 | 0.502 | 0.337 |
| **Document (Segmented)** | | | | | |
| (3a) BM25 (k1= 0.82, b=0.68) | 0.269 | 0.529 | 0.240 | 0.531 | 0.362 |
| (3b) BM25 + IRST (Sum) | 0.296 | 0.560 | 0.271 | 0.534 | 0.376 |
| (3c) BM25 + IRST (Max) | 0.259 | 0.520 | 0.243 | 0.509 | 0.350 |

The BM25 baselines are provided for reference.

## Reproduction Log[*](reproducibility.md)
2 changes: 1 addition & 1 deletion docs/experiments-msmarco-passage.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Note that there is a separate guide for the [MS MARCO *document* ranking task](e

## Data Prep

The guide requires the [development installation](https://github.com/castorini/pyserini/blob/master/docs/installation.md#development-installation) for additional resource that are not shipped with the Python module; for the (more limited) runs that directly work from the Python module installed via `pip`, see [this guide](pypi-reproduction.md).
The guide requires the [development installation](https://github.com/castorini/pyserini/blob/master/docs/installation.md#development-installation) for additional resource that are not shipped with the Python module.

We're going to use `collections/msmarco-passage/` as the working directory.
First, we need to download and extract the MS MARCO passage dataset:
Expand Down
4 changes: 2 additions & 2 deletions docs/experiments-trec2021-clinical-trials.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@ This guide contains instructions for running BM25 and RM3 baselines on the [TREC

## Data Prep

The guide requires the [development installation](https://github.com/castorini/pyserini/blob/master/docs/installation.md#development-installation) for additional resource that are not shipped with the Python module; for the (more limited) runs that directly work from the Python module installed via `pip`, see [this guide](pypi-reproduction.md).
The guide requires the [development installation](https://github.com/castorini/pyserini/blob/master/docs/installation.md#development-installation) for additional resource that are not shipped with the Python module.

We're going to use the repository's root directory as the working directory.
First, we need to download and extract the Clinical Trials documents and topics.

```
```bash
mkdir collections/trec-ct

wget http://www.trec-cds.org/2021_data/ClinicalTrials.2021-04-27.part1.zip -P collections/trec-ct
Expand Down
Loading

0 comments on commit b198f88

Please sign in to comment.