Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LTR document exp + regression test #844

Merged
merged 17 commits into from
Oct 31, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions docs/experiments-ltr-msmarco-document-reranking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Pyserini: Learning-To-Rank Reranking Baseline for MS MARCO Document

This guide contains instructions for running learning-to-rank baseline on the [MS MARCO *document* reranking task](https://microsoft.github.io/msmarco/).
Learning-to-rank serves as a second stage reranker after BM25 retrieval.
Note, we use sliding window and maxP strategy here.

## Data Preprocessing

We're going to use the repository's root directory as the working directory.

First, we need to download and extract the MS MARCO document dataset:

```bash
mkdir collections/msmarco-doc
wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-doc/msmarco-docs.tsv.gz -P collections/msmarco-doc
wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-doc/msmarco_doc_passage_ids.txt -P collections/msmarco-doc
```

We will need to generate collection of passage segments. Here, we use segment size 3 and stride is 1.
```bash
python scripts/ltr_msmarco/convert_msmarco_passage_doc_to_anserini.py \
--original_docs_path collections/msmarco-doc/msmarco-docs.tsv.gz \
--doc_ids_path collections/msmarco-doc/msmarco_doc_passage_ids.txt \
--output_docs_path collections/msmarco-doc/msmarco_pass_doc.jsonl
```

Let's first get bag-of-words 10000 hits for segments as our LTR reranking candidates.
```bash
python scripts/ltr_msmarco/convert_collection_to_jsonl.py --collection-path collections/msmarco-doc/msmarco_pass_doc.jsonl --output-folder collections/msmarco-doc/msmarco_pass_doc/

python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
-threads 21 -input collections/msmarco-doc/msmarco_pass_doc \
-index indexes/lucene-index-msmarco-doc-passage -storePositions -storeDocvectors -storeRaw

python -m pyserini.search --topics msmarco-doc-dev \
--index indexes/lucene-index-msmarco-doc-passage \
--output collections/msmarco-doc/run.msmarco-pass-doc.bm25.txt \
--bm25 --output-format trec --hits 10000
```

Now, we prepare queries for LTR:
```bash
mkdir collections/msmarco-ltr-document

python scripts/ltr_msmarco/convert_queries.py \
--input tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
--output collections/msmarco-ltr-document/queries.dev.small.json

```

Download pretrained IBM models:

```bash
wget https://www.dropbox.com/s/vlrfcz3vmr4nt0q/ibm_model.tar.gz -P collections/msmarco-ltr-document/
tar -xzvf collections/msmarco-ltr-document/ibm_model.tar.gz -C collections/msmarco-ltr-document/
```

Download our pretrained LTR model:

```bash
wget https://www.dropbox.com/s/ffl2bfw4cd5ngyz/msmarco-passage-ltr-mrr-v1.tar.gz -P runs/
tar -xzvf runs/msmarco-passage-ltr-mrr-v1.tar.gz -C runs
```

Get our prebuilt LTR document index:
```bash
python -c "from pyserini.search import SimpleSearcher; SimpleSearcher.from_prebuilt_index('msmarco-document-ltr')"
```

Now, we have all things ready and can run inference. The LTR outpus rankings on segments level. We will need to use another script to get doc level results using maxP strategy.
```bash
python scripts/ltr_msmarco/ltr_inference.py \
--input collections/msmarco-doc/run.msmarco-pass-doc.bm25.txt \
--input-format trec \
--model runs/msmarco-passage-ltr-mrr-v1 \
--data document \
--ibm-model collections/msmarco-ltr-document/ibm_model/ \
--queries collections/msmarco-ltr-document \
--index ~/.cache/pyserini/indexes/index-msmarco-document-ltr-20211027-3e4c283 --output runs/run.ltr.doc-pas.trec

python scripts/ltr_msmarco/generate_document_score_withmaxP.py \
--input runs/run.ltr.doc-pas.trec \
--output runs/run.ltr.doc_level.tsv
```

```bash
python tools/scripts/msmarco/msmarco_doc_eval.py \
--judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \
--run runs/run.ltr.doc_level.tsv

```
The above evaluation should give your results as below.
```bash
#####################
MRR @100: 0.3090492928920076
QueriesRanked: 5193
#####################
```

## Building the Index From Scratch

```bash
python scripts/ltr_msmarco/convert_passage_doc.py \
--input collections/msmarco-doc/msmarco_pass_doc.jsonl \
--output collections/msmarco-ltr-document/ltr_msmarco_pass_doc.jsonl \
--proc_qty 10
```

The above script will convert the collection and queries to json files with `text_unlemm`, `analyzed`, `text_bert_tok` and `raw` fields.
Next, we need to convert the MS MARCO json collection into Anserini's jsonl files (which have one json object per line):

```bash
python scripts/ltr_msmarco/convert_collection_to_jsonl.py \
--collection-path collections/msmarco-ltr-document/ltr_msmarco_pass_doc.jsonl \
--output-folder collections/msmarco-ltr-document/ltr_msmarco_pass_doc_jsonl
```
We can now index these docs as a `JsonCollection` using Anserini with pretokenized option:

```bash
python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
-threads 21 -input collections/msmarco-ltr-document/ltr_msmarco_pass_doc_jsonl \
-index indexes/lucene-index-msmarco-document-ltr -storePositions -storeDocvectors -storeRaw -pretokenized
```

Note that pretokenized option let Anserini use whitespace analyzer so that do not break our preprocessed tokenization.
16 changes: 10 additions & 6 deletions docs/experiments-ltr-msmarco-passage-reranking.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,15 @@ Next, we're going to use `collections/msmarco-ltr-passage/` as the working direc
```bash
mkdir collections/msmarco-ltr-passage/

python scripts/ltr_msmarco-passage/convert_queries.py \
python scripts/ltr_msmarco/convert_queries.py \
--input collections/msmarco-passage/queries.eval.small.tsv \
--output collections/msmarco-ltr-passage/queries.eval.small.json

python scripts/ltr_msmarco-passage/convert_queries.py \
python scripts/ltr_msmarco/convert_queries.py \
--input collections/msmarco-passage/queries.dev.small.tsv \
--output collections/msmarco-ltr-passage/queries.dev.small.json

python scripts/ltr_msmarco-passage/convert_queries.py \
python scripts/ltr_msmarco/convert_queries.py \
--input collections/msmarco-passage/queries.train.tsv \
--output collections/msmarco-ltr-passage/queries.train.json
```
Expand Down Expand Up @@ -52,12 +52,16 @@ tar -xzvf runs/msmarco-passage-ltr-mrr-v1.tar.gz -C runs
Next we can run our inference script to get our reranking result.

```bash
python -m pyserini.ltr.search_msmarco_passage \
python scripts/ltr_msmarco/ltr_inference.py \
--input runs/run.msmarco-passage.bm25tuned.txt \
--input-format tsv \
--model runs/msmarco-passage-ltr-mrr-v1 \
--index ~/.cache/pyserini/indexes/index-msmarco-passage-ltr-20210519-e25e33f.a5de642c268ac1ed5892c069bdc29ae3 \
--data passage \
--ibm-model collections/msmarco-ltr-passage/ibm_model/ \
--queries collections/msmarco-ltr-passage \
--output runs/run.ltr.msmarco-passage.tsv

```

Here, our model is trained to maximize MRR@10.
Expand Down Expand Up @@ -106,7 +110,7 @@ On the other hand, recall@1000 provides the upper bound effectiveness of downstr
Equivalently, we can preprocess collection and queries with our scripts:

```bash
python scripts/ltr_msmarco-passage/convert_passage.py \
python scripts/ltr_msmarco/convert_passage.py \
--input collections/msmarco-passage/collection.tsv \
--output collections/msmarco-ltr-passage/ltr_collection.json
```
Expand All @@ -115,7 +119,7 @@ The above script will convert the collection and queries to json files with `tex
Next, we need to convert the MS MARCO json collection into Anserini's jsonl files (which have one json object per line):

```bash
python scripts/ltr_msmarco-passage/convert_collection_to_jsonl.py \
python scripts/ltr_msmarco/convert_collection_to_jsonl.py \
--collection-path collections/msmarco-ltr-passage/ltr_collection.json \
--output-folder collections/msmarco-ltr-passage/ltr_collection_jsonl
```
Expand Down
12 changes: 6 additions & 6 deletions docs/experiments-ltr-msmarco-passage-training.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@ Next, we're going to use `collections/msmarco-ltr-passage/` as the working direc
```bash
mkdir collections/msmarco-ltr-passage/

python scripts/ltr_msmarco-passage/convert_queries.py \
python scripts/ltr_msmarco/convert_queries.py \
--input collections/msmarco-passage/queries.eval.small.tsv \
--output collections/msmarco-ltr-passage/queries.eval.small.json

python scripts/ltr_msmarco-passage/convert_queries.py \
python scripts/ltr_msmarco/convert_queries.py \
--input collections/msmarco-passage/queries.dev.small.tsv \
--output collections/msmarco-ltr-passage/queries.dev.small.json

python scripts/ltr_msmarco-passage/convert_queries.py \
python scripts/ltr_msmarco/convert_queries.py \
--input collections/msmarco-passage/queries.train.tsv \
--output collections/msmarco-ltr-passage/queries.train.json
```
Expand All @@ -47,7 +47,7 @@ Download pretrained IBM models:

## Training the Model From Scratch
```bash
python scripts/ltr_msmarco-passage/train_ltr_model.py \
python scripts/ltr_msmarco/train_ltr_model.py \
--index ~/.cache/pyserini/indexes/index-msmarco-passage-ltr-20210519-e25e33f.a5de642c268ac1ed5892c069bdc29ae3
```

Expand All @@ -73,9 +73,9 @@ Number of negative samples used in training can be changed by `--neg-sample`, by
## Change the Optmization Goal of Your Trained Model
The script trains a model which optimizes MRR@10 by default.

You can change the `mrr_at_10` of [this function](../scripts/ltr_msmarco-passage/train_ltr_model.py#L621) and [here](../scripts/ltr_msmarco-passage/train_ltr_model.py#L358) to `recall_at_20` to train a model which optimizes recall@20.
You can change the `mrr_at_10` of [this function](../scripts/ltr_msmarco/train_ltr_model.py#L621) and [here](../scripts/ltr_msmarco/train_ltr_model.py#L358) to `recall_at_20` to train a model which optimizes recall@20.

You can also self defined a function format like [this](../scripts/ltr_msmarco-passage/train_ltr_model.py#L300) and change corresponding places mentioned above to have different optimization goal.
You can also self defined a function format like [this](../scripts/ltr_msmarco/train_ltr_model.py#L300) and change corresponding places mentioned above to have different optimization goal.

## Reproduction Log[*](reproducibility.md)
+ Results reproduced by [@Dahlia-Chehata](https://github.com/Dahlia-Chehata) on 2021-07-18 (commit [`a6b6545`](https://github.com/castorini/pyserini/commit/a6b6545c0133c03d50d5c33fb2fea7c527de04bb))
63 changes: 63 additions & 0 deletions integrations/sparse/test_ltr_msmarco_document.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#
# Pyserini: Reproducible IR research with sparse and dense representations
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import unittest
import subprocess
import os
from shutil import rmtree
from pyserini.search import SimpleSearcher
from random import randint
from urllib.request import urlretrieve
import tarfile
import sys

class TestLtrMsmarcoPassage(unittest.TestCase):
def test_reranking(self):
if(os.path.isdir('ltr_test')):
rmtree('ltr_test')
os.mkdir('ltr_test')
inp = 'run.msmarco-pass-doc.bm25.txt'
outp = 'run.ltr.msmarco-passage.test.trec'
outp_tsv = 'run.ltr.msmarco-passage.test.tsv'
#Download candidate
os.system('wget https://www.dropbox.com/s/sxf16jcjtw1q9z7/run.msmarco-pass-doc.bm25.txt -P ltr_test')
#Download prebuilt index
SimpleSearcher.from_prebuilt_index('msmarco-document-ltr')
#Pre-trained ltr model
model_url = 'https://www.dropbox.com/s/ffl2bfw4cd5ngyz/msmarco-passage-ltr-mrr-v1.tar.gz'
model_tar_name = 'msmarco-passage-ltr-mrr-v1.tar.gz'
os.system(f'wget {model_url} -P ltr_test/')
os.system(f'tar -xzvf ltr_test/{model_tar_name} -C ltr_test')
#ibm model
ibm_model_url = 'https://www.dropbox.com/s/vlrfcz3vmr4nt0q/ibm_model.tar.gz'
ibm_model_tar_name = 'ibm_model.tar.gz'
os.system(f'wget {ibm_model_url} -P ltr_test/')
#queries process
os.system(f'tar -xzvf ltr_test/{ibm_model_tar_name} -C ltr_test')
os.system('python scripts/ltr_msmarco/convert_queries.py --input tools/topics-and-qrels/topics.msmarco-doc.dev.txt --output ltr_test/queries.dev.small.json')
os.system(f'python scripts/ltr_msmarco/ltr_inference.py --input ltr_test/{inp} --input-format trec --data document --model ltr_test/msmarco-passage-ltr-mrr-v1 --index ~/.cache/pyserini/indexes/index-msmarco-document-ltr-20211027-3e4c283.2718874ab44f6d383e84ad20f3790460 --ibm-model ltr_test/ibm_model/ --queries ltr_test --output ltr_test/{outp}')
#convert trec to tsv withmaxP
os.system(f'python scripts/ltr_msmarco/generate_document_score_withmaxP.py --input ltr_test/{outp} --output ltr_test/{outp_tsv}')


result = subprocess.check_output(f'python tools/scripts/msmarco/msmarco_doc_eval.py --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt --run ltr_test/{outp_tsv}', shell=True).decode(sys.stdout.encoding)
a,b = result.find('#####################\nMRR @100:'), result.find('\nQueriesRanked: 5193\n#####################\n')
mrr = result[a+32:b]
self.assertAlmostEqual(float(mrr),0.3090492928920076, delta=0.000001)
rmtree('ltr_test')

if __name__ == '__main__':
unittest.main()
4 changes: 2 additions & 2 deletions integrations/sparse/test_ltr_msmarco_passage.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ def test_reranking(self):
os.system(f'wget {ibm_model_url} -P ltr_test/')
os.system(f'tar -xzvf ltr_test/{ibm_model_tar_name} -C ltr_test')
#queries process
os.system('python scripts/ltr_msmarco-passage/convert_queries.py --input tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt --output ltr_test/queries.dev.small.json')
os.system(f'python -m pyserini.ltr.search_msmarco_passage --input ltr_test/{inp} --input-format tsv --model ltr_test/msmarco-passage-ltr-mrr-v1 --index ~/.cache/pyserini/indexes/index-msmarco-passage-ltr-20210519-e25e33f.a5de642c268ac1ed5892c069bdc29ae3 --ibm-model ltr_test/ibm_model/ --queries ltr_test --output ltr_test/{outp}')
os.system('python scripts/ltr_msmarco/convert_queries.py --input tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt --output ltr_test/queries.dev.small.json')
os.system(f'python scripts/ltr_msmarco/ltr_inference.py --input ltr_test/{inp} --input-format tsv --model ltr_test/msmarco-passage-ltr-mrr-v1 --data passage --index ~/.cache/pyserini/indexes/index-msmarco-passage-ltr-20210519-e25e33f.a5de642c268ac1ed5892c069bdc29ae3 --ibm-model ltr_test/ibm_model/ --queries ltr_test --output-format tsv --output ltr_test/{outp}')
result = subprocess.check_output(f'python tools/scripts/msmarco/msmarco_passage_eval.py tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt ltr_test/{outp}', shell=True).decode(sys.stdout.encoding)
a,b = result.find('#####################\nMRR @10:'), result.find('\nQueriesRanked: 6980\n#####################\n')
mrr = result[a+31:b]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,5 @@
# limitations under the License.
#

from ._search_msmarco_passage import MsmarcoPassageLtrSearcher
__all__ = ['MsmarcoPassageLtrSearcher']
from ._search_msmarco import MsmarcoLtrSearcher
__all__ = ['MsmarcoLtrSearcher']
Original file line number Diff line number Diff line change
Expand Up @@ -25,19 +25,23 @@
import time
from tqdm import tqdm
import pickle
from pyserini.index import IndexReader

from pyserini.ltr._base import *


logger = logging.getLogger(__name__)

class MsmarcoPassageLtrSearcher:
def __init__(self, model: str, ibm_model:str, index:str):
class MsmarcoLtrSearcher:
def __init__(self, model: str, ibm_model:str, index:str, data: str):
self.model = model
self.ibm_model = ibm_model
self.fe = FeatureExtractor(index, max(multiprocessing.cpu_count()//2, 1))
self.index_reader = IndexReader(index)
self.data = data

def add_fe(self):
#self.fe.add(RunList('collections/msmarco-ltr-passage/run.monot5.run_list.whole.trec','t5'))
for qfield, ifield in [('analyzed', 'contents'),
('text_unlemm', 'text_unlemm'),
('text_bert_tok', 'text_bert_tok')]:
Expand Down Expand Up @@ -141,24 +145,19 @@ def add_fe(self):
self.fe.add(OrderedQueryPairs(15, field=ifield, qfield=qfield))

start = time.time()
self.fe.add(
IbmModel1(f"{self.ibm_model}/title_unlemm", "text_unlemm", "title_unlemm",
"text_unlemm"))
self.fe.add(IbmModel1(f"{self.ibm_model}/title_unlemm", "text_unlemm", "title_unlemm", "text_unlemm"))
end = time.time()
print('IBM model Load takes %.2f seconds' % (end - start))
start = end
self.fe.add(IbmModel1(f"{self.ibm_model}url_unlemm", "text_unlemm", "url_unlemm",
"text_unlemm"))
self.fe.add(IbmModel1(f"{self.ibm_model}url_unlemm", "text_unlemm", "url_unlemm", "text_unlemm"))
end = time.time()
print('IBM model Load takes %.2f seconds' % (end - start))
start = end
self.fe.add(
IbmModel1(f"{self.ibm_model}body", "text_unlemm", "body", "text_unlemm"))
self.fe.add(IbmModel1(f"{self.ibm_model}body", "text_unlemm", "body", "text_unlemm"))
end = time.time()
print('IBM model Load takes %.2f seconds' % (end - start))
start = end
self.fe.add(IbmModel1(f"{self.ibm_model}text_bert_tok", "text_bert_tok",
"text_bert_tok", "text_bert_tok"))
self.fe.add(IbmModel1(f"{self.ibm_model}text_bert_tok", "text_bert_tok", "text_bert_tok", "text_bert_tok"))
end = time.time()
print('IBM model Load takes %.2f seconds' % (end - start))
start = end
Expand All @@ -176,8 +175,13 @@ def batch_extract(self, df, queries, fe):
"query_dict": queries[qid]
}
for t in group.reset_index().itertuples():
task["docIds"].append(t.pid)
task_infos.append((qid, t.pid, t.rel))
if (self.data == 'document'):
if (self.index_reader.doc(t.pid) != None):
task["docIds"].append(t.pid)
task_infos.append((qid, t.pid, t.rel))
else:
task["docIds"].append(t.pid)
task_infos.append((qid, t.pid, t.rel))
tasks.append(task)
group_lst.append((qid, len(task['docIds'])))
if len(tasks) == 1000:
Expand Down
12 changes: 12 additions & 0 deletions pyserini/prebuilt_index_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,18 @@
"unique_terms": 2660824,
"downloaded": False
},
"msmarco-document-ltr": {
"description": "MS MARCO document corpus (4 extra preprocessed fields) used for LTR pipeline",
"filename": "index-msmarco-document-ltr-20211027-3e4c283.tar.gz",
"urls": [
"https://www.dropbox.com/s/5tr2otncs9rttbp/index-msmarco-document-ltr-20211027-3e4c283.tar.gz?dl=1" # too big for UWaterloo GitLab
],
"md5": "2718874ab44f6d383e84ad20f3790460",
"size compressed (bytes)": 46052436658,
"total_terms": 1232004740,
"documents": 20545628,
"downloaded": False
},
"msmarco-doc": {
"description": "Lucene index of the MS MARCO document corpus",
"filename": "index-msmarco-doc-20201117-f87c94.tar.gz",
Expand Down
Loading