PyGaggle: Neural Ranking Baselines on MS MARCO Passage Retrieval - Entire Dev Set
This page contains instructions for running various neural reranking baselines on the MS MARCO passage ranking task. We will run on the entire dev set. Note that there is also a separate MS MARCO document ranking task and a separate MS MARCO passage ranking task - Subset.
Prior to running this, we suggest looking at our first-stage BM25 ranking instructions. We rerank the BM25 run files that contain ~1000 passages per query using both monoBERT and monoT5. monoBERT and monoT5 are pointwise rerankers. This means that each document is scored independently using either BERT or T5 respectively.
Since it can take days to run these models on all of the 6980 queries from the MS MARCO dev set, we will use Compute Canada to replicate.
Please follow this guide to create an account on Compute Canada.
After that, please follow that guide to create a virtual environment so that you can easily install Python packages.
Note: Don't forget to update pip
and setuptools
.
When you are running experiments for the first time and need to debug here, please submit jobs interactively to ensure your code is bug-free. After that, write scripts to run the experiment without monitoring it, and you have to do this if you don't want to run experiments for days.
After you enter the compute node, let's install Pygaggle under the ~/scratch
directory.
Note 1: Run the following instructions at root of this repo. Note 2: Make sure that you have access to a GPU. Note 3: Installation must have been done from source and make sure the anserini-eval submodule is pulled. To do this, first clone the repository recursively.
git clone --recursive https://github.com/castorini/pygaggle.git
Then load Java module:
module load java
Then install Pytorch.
pip install torch
Then install PyGaggle by the following command.
pip install -r requirements.txt
Note: On Compute Canada, you may have to install tensorflow separately by the following command.
pip install tensorflow_gpu
- monoBERT-Large: Passage Re-ranking with BERT (Nogueira et al., 2019)
- monoT5-base: Document Ranking with a Pretrained Sequence-to-Sequence Model (Nogueira et al., 2020)
We're first going to download the queries, qrels and run files corresponding to the entire MS MARCO dev set considered. The run file is generated by following the BM25 ranking instructions. We'll store all these files in the data/msmarco_ans_entire
directory.
You can download these three files from this repository.
queries.dev.small.tsv: 6,980 queries from the MS MARCO dev set.
qrels.dev.small.tsv: 7,437 pairs of query relevant passage ids from the MS MARCO dev set.
run.bm25.dev.small.tsv: Approximately 6,980,000 pairs of dev set queries and retrieved passages using BM25.
Note: Please rename run.bm25.dev.small.tsv
to run.dev.small.tsv
.
As a sanity check, we can evaluate the first-stage retrieved documents using the official MS MARCO evaluation script.
python tools/scripts/msmarco/msmarco_passage_eval.py data/msmarco_ans_entire/qrels.dev.small.tsv data/msmarco_ans_entire/run.dev.small.tsv
The output should be:
#####################
MRR @10: 0.18736452221767383
QueriesRanked: 6980
#####################
Let's download and extract the pre-built MS MARCO index into indexes
:
wget https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-msmarco-passage-20191117-0ed488.tar.gz -P indexes
tar xvfz indexes/index-msmarco-passage-20191117-0ed488.tar.gz -C indexes
Now, we can begin with re-ranking the set.
First, lets evaluate using monoBERT!
python -um pygaggle.run.evaluate_passage_ranker --split dev \
--method seq_class_transformer \
--model castorini/monobert-large-msmarco \
--dataset data/msmarco_ans_entire/ \
--index-dir indexes/index-msmarco-passage-20191117-0ed488 \
--task msmarco \
--output-file runs/run.monobert.ans_entire.dev.tsv
Upon completion, the following output will be visible:
precision@1 0.2533
recall@3 0.45093
recall@50 0.80609
recall@1000 0.86289
mrr 0.38789
mrr@10 0.37922
It takes about ~57 hours to re-rank this entire dev set on MS MARCO using a V100.
The type of GPU will directly influence your inference time.
It is possible that the default batch results in a GPU OOM error.
In this case, assigning a batch size (using option --batch-size
) which is smaller than the default (96) should help!
The re-ranked run file run.monobert.ans_entire.dev.tsv
will also be available in the runs
directory upon completion.
We can use the official MS MARCO evaluation script to verify the MRR@10:
python tools/scripts/msmarco/msmarco_passage_eval.py data/msmarco_ans_entire/qrels.dev.small.tsv runs/run.monobert.ans_entire.dev.tsv
You should see the same result. Great, let's move on to monoT5!
We use the monoT5-base variant as it is the easiest to run without access to larger GPUs/TPUs. Let us now re-rank the set:
python -um pygaggle.run.evaluate_passage_ranker --split dev \
--method t5 \
--model castorini/monot5-base-msmarco \
--dataset data/msmarco_ans_entire \
--model-type t5-base \
--task msmarco \
--index-dir indexes/index-msmarco-passage-20191117-0ed488 \
--batch-size 32 \
--output-file runs/run.monot5.ans_entire.dev.tsv
The following output will be visible after it has finished:
precision@1 0.25129
recall@3 0.45362
recall@50 0.80709
recall@1000 0.86289
mrr 0.38839
mrr@10 0.37986
It takes about ~26 hours to re-rank this entire dev set on MS MARCO using a V100. It is worth noting again that you might need to modify the batch size to best fit the GPU at hand.
Upon completion, the re-ranked run file run.monot5.ans_entire.dev.tsv
will be available in the runs
directory.
We can use the official MS MARCO evaluation script to verify the MRR@10:
python tools/scripts/msmarco/msmarco_passage_eval.py data/msmarco_ans_entire/qrels.dev.small.tsv runs/run.monot5.ans_entire.dev.tsv
You should see the same result.
If you were able to replicate these results, please submit a PR adding to the replication log! Please mention in your PR if you find any difference!
- Results replicated by @qguo96 on 2020-10-08 (commit
3d4b7c0
) (Tesla V100 on Compute Canada) - Results replicated by @stephaniewhoo on 2020-10-25 (commit
e815051
) (Tesla V100 on Compute Canada) - Results replicated by @rayyang29 on 2020-11-16 (commit
d840b0c
)(Tesla V100 on Compute Canada) - Results replicated by @Dahlia-Chehata on 2021-01-10 (commit
623285a
) (Tesla V100 on Compute Canada) - Results replicated by @KaiSun314 on 2021-01-16 (commit
1414e32
) (Tesla V100 on Compute Canada)