castorini · stephaniewhoo · Oct 31, 2021 · Sep 25, 2021 · Oct 10, 2021 · Oct 10, 2021
diff --git a/docs/experiments-ltr-msmarco-document-reranking.md b/docs/experiments-ltr-msmarco-document-reranking.md
@@ -0,0 +1,125 @@
+# Pyserini: Learning-To-Rank Reranking Baseline for MS MARCO Document
+
+This guide contains instructions for running learning-to-rank baseline on the [MS MARCO *document* reranking task](https://microsoft.github.io/msmarco/).
+Learning-to-rank serves as a second stage reranker after BM25 retrieval.
+Note, we use sliding window and maxP strategy here.
+
+## Data Preprocessing
+
+We're going to use the repository's root directory as the working directory. 
+
+First, we need to download and extract the MS MARCO document dataset:
+
+```bash
+mkdir collections/msmarco-doc
+wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-doc/msmarco-docs.tsv.gz -P collections/msmarco-doc
+wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-doc/msmarco_doc_passage_ids.txt -P collections/msmarco-doc
+```
+
+We will need to generate collection of passage segments. Here, we use segment size 3 and stride is 1.
+```bash
+python scripts/ltr_msmarco/convert_msmarco_passage_doc_to_anserini.py \
+  --original_docs_path collections/msmarco-doc/msmarco-docs.tsv.gz \
+  --doc_ids_path collections/msmarco-doc/msmarco_doc_passage_ids.txt \
+  --output_docs_path collections/msmarco-doc/msmarco_pass_doc.jsonl
+```
+
+Let's first get bag-of-words 10000 hits for segments as our LTR reranking candidates.
+```bash
+python scripts/ltr_msmarco/convert_collection_to_jsonl.py --collection-path collections/msmarco-doc/msmarco_pass_doc.jsonl --output-folder collections/msmarco-doc/msmarco_pass_doc/
+
+python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
+  -threads 21 -input collections/msmarco-doc/msmarco_pass_doc \
+  -index indexes/lucene-index-msmarco-doc-passage -storePositions -storeDocvectors -storeRaw 
+
+python -m pyserini.search --topics msmarco-doc-dev \
+ --index indexes/lucene-index-msmarco-doc-passage \
+ --output collections/msmarco-doc/run.msmarco-pass-doc.bm25.txt \
+ --bm25 --output-format trec --hits 10000 
+```
+
+Now, we prepare queries for LTR:
+```bash
+mkdir collections/msmarco-ltr-document
+
+python scripts/ltr_msmarco/convert_queries.py \
+  --input tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
+  --output collections/msmarco-ltr-document/queries.dev.small.json
+
+```
+
+Download pretrained IBM models:
+
+```bash
+wget https://www.dropbox.com/s/vlrfcz3vmr4nt0q/ibm_model.tar.gz -P collections/msmarco-ltr-document/
+tar -xzvf collections/msmarco-ltr-document/ibm_model.tar.gz -C collections/msmarco-ltr-document/
+```
+
+Download our pretrained LTR model:
+
+```bash
+wget https://www.dropbox.com/s/ffl2bfw4cd5ngyz/msmarco-passage-ltr-mrr-v1.tar.gz -P runs/
+tar -xzvf runs/msmarco-passage-ltr-mrr-v1.tar.gz -C runs
+```
+
+Get our prebuilt LTR document index:
+```bash
+python -c "from pyserini.search import SimpleSearcher; SimpleSearcher.from_prebuilt_index('msmarco-document-ltr')"
+```
+
+Now, we have all things ready and can run inference. The LTR outpus rankings on segments level. We will need to use another script to get doc level results using maxP strategy.
+```bash
+python scripts/ltr_msmarco/ltr_inference.py \
+       --input collections/msmarco-doc/run.msmarco-pass-doc.bm25.txt \
+       --input-format trec \
+       --model runs/msmarco-passage-ltr-mrr-v1 \
+       --data document \
+       --ibm-model collections/msmarco-ltr-document/ibm_model/ \
+       --queries collections/msmarco-ltr-document \
+       --index ~/.cache/pyserini/indexes/index-msmarco-document-ltr-20211027-3e4c283 --output runs/run.ltr.doc-pas.trec 
+
+python scripts/ltr_msmarco/generate_document_score_withmaxP.py \
+      --input runs/run.ltr.doc-pas.trec \
+      --output runs/run.ltr.doc_level.tsv
+```
+
+```bash
+python tools/scripts/msmarco/msmarco_doc_eval.py \
+    --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \
+    --run runs/run.ltr.doc_level.tsv
+
+```
+The above evaluation should give your results as below.
+```bash
+#####################
+MRR @100: 0.3090492928920076
+QueriesRanked: 5193
+#####################
+```
+
+## Building the Index From Scratch
+
+```bash
+python scripts/ltr_msmarco/convert_passage_doc.py \
+  --input collections/msmarco-doc/msmarco_pass_doc.jsonl \
+  --output collections/msmarco-ltr-document/ltr_msmarco_pass_doc.jsonl \
+  --proc_qty 10
+```
+
+The above script will convert the collection and queries to json files with `text_unlemm`, `analyzed`, `text_bert_tok` and `raw` fields.
+Next, we need to convert the MS MARCO json collection into Anserini's jsonl files (which have one json object per line):
+
+```bash
+python scripts/ltr_msmarco/convert_collection_to_jsonl.py \
+  --collection-path collections/msmarco-ltr-document/ltr_msmarco_pass_doc.jsonl \
+  --output-folder collections/msmarco-ltr-document/ltr_msmarco_pass_doc_jsonl  
+```
+We can now index these docs as a `JsonCollection` using Anserini with pretokenized option:
+
+```bash
+python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
+ -threads 21 -input collections/msmarco-ltr-document/ltr_msmarco_pass_doc_jsonl  \
+ -index indexes/lucene-index-msmarco-document-ltr -storePositions -storeDocvectors -storeRaw -pretokenized
+```
+
+Note that pretokenized option let Anserini use whitespace analyzer so that do not break our preprocessed tokenization.
diff --git a/docs/experiments-ltr-msmarco-passage-reranking.md b/docs/experiments-ltr-msmarco-passage-reranking.md
@@ -11,15 +11,15 @@ Next, we're going to use `collections/msmarco-ltr-passage/` as the working direc
 ```bash
 mkdir collections/msmarco-ltr-passage/
 
-python scripts/ltr_msmarco-passage/convert_queries.py \
+python scripts/ltr_msmarco/convert_queries.py \
   --input collections/msmarco-passage/queries.eval.small.tsv \
   --output collections/msmarco-ltr-passage/queries.eval.small.json 
 
-python scripts/ltr_msmarco-passage/convert_queries.py \
+python scripts/ltr_msmarco/convert_queries.py \
   --input collections/msmarco-passage/queries.dev.small.tsv \
   --output collections/msmarco-ltr-passage/queries.dev.small.json
 
-python scripts/ltr_msmarco-passage/convert_queries.py \
+python scripts/ltr_msmarco/convert_queries.py \
   --input collections/msmarco-passage/queries.train.tsv \
   --output collections/msmarco-ltr-passage/queries.train.json
 ```
@@ -52,12 +52,16 @@ tar -xzvf runs/msmarco-passage-ltr-mrr-v1.tar.gz -C runs
 Next we can run our inference script to get our reranking result.
 
 ```bash
-python -m pyserini.ltr.search_msmarco_passage \
+python scripts/ltr_msmarco/ltr_inference.py \
   --input runs/run.msmarco-passage.bm25tuned.txt \
   --input-format tsv \
   --model runs/msmarco-passage-ltr-mrr-v1 \
   --index ~/.cache/pyserini/indexes/index-msmarco-passage-ltr-20210519-e25e33f.a5de642c268ac1ed5892c069bdc29ae3 \
+  --data passage \
+  --ibm-model collections/msmarco-ltr-passage/ibm_model/ \
+  --queries collections/msmarco-ltr-passage \
   --output runs/run.ltr.msmarco-passage.tsv 
+
 ```
 
 Here, our model is trained to maximize MRR@10. 
@@ -106,7 +110,7 @@ On the other hand, recall@1000 provides the upper bound effectiveness of downstr
 Equivalently, we can preprocess collection and queries with our scripts:
 
 ```bash
-python scripts/ltr_msmarco-passage/convert_passage.py \
+python scripts/ltr_msmarco/convert_passage.py \
   --input collections/msmarco-passage/collection.tsv \
   --output collections/msmarco-ltr-passage/ltr_collection.json 
 ```
@@ -115,7 +119,7 @@ The above script will convert the collection and queries to json files with `tex
 Next, we need to convert the MS MARCO json collection into Anserini's jsonl files (which have one json object per line):
 
 ```bash
-python scripts/ltr_msmarco-passage/convert_collection_to_jsonl.py \
+python scripts/ltr_msmarco/convert_collection_to_jsonl.py \
   --collection-path collections/msmarco-ltr-passage/ltr_collection.json \
   --output-folder collections/msmarco-ltr-passage/ltr_collection_jsonl 
 ```

diff --git a/docs/experiments-ltr-msmarco-passage-training.md b/docs/experiments-ltr-msmarco-passage-training.md
@@ -15,15 +15,15 @@ Next, we're going to use `collections/msmarco-ltr-passage/` as the working direc
 ```bash
 mkdir collections/msmarco-ltr-passage/
 
-python scripts/ltr_msmarco-passage/convert_queries.py \
+python scripts/ltr_msmarco/convert_queries.py \
   --input collections/msmarco-passage/queries.eval.small.tsv \
   --output collections/msmarco-ltr-passage/queries.eval.small.json 
 
-python scripts/ltr_msmarco-passage/convert_queries.py \
+python scripts/ltr_msmarco/convert_queries.py \
   --input collections/msmarco-passage/queries.dev.small.tsv \
   --output collections/msmarco-ltr-passage/queries.dev.small.json
 
-python scripts/ltr_msmarco-passage/convert_queries.py \
+python scripts/ltr_msmarco/convert_queries.py \
   --input collections/msmarco-passage/queries.train.tsv \
   --output collections/msmarco-ltr-passage/queries.train.json
 ```
@@ -47,7 +47,7 @@ Download pretrained IBM models:
 
 ## Training the Model From Scratch
 ```bash
-python scripts/ltr_msmarco-passage/train_ltr_model.py  \
+python scripts/ltr_msmarco/train_ltr_model.py  \
  --index ~/.cache/pyserini/indexes/index-msmarco-passage-ltr-20210519-e25e33f.a5de642c268ac1ed5892c069bdc29ae3 
 ```
 
@@ -73,9 +73,9 @@ Number of negative samples used in training can be changed by `--neg-sample`, by
 ## Change the Optmization Goal of Your Trained Model
 The script trains a model which optimizes MRR@10 by default. 
 
-You can change the `mrr_at_10`  of [this function](../scripts/ltr_msmarco-passage/train_ltr_model.py#L621) and [here](../scripts/ltr_msmarco-passage/train_ltr_model.py#L358) to `recall_at_20` to train a model which optimizes recall@20.
+You can change the `mrr_at_10`  of [this function](../scripts/ltr_msmarco/train_ltr_model.py#L621) and [here](../scripts/ltr_msmarco/train_ltr_model.py#L358) to `recall_at_20` to train a model which optimizes recall@20.
 
-You can also self defined a function format like [this](../scripts/ltr_msmarco-passage/train_ltr_model.py#L300) and change corresponding places mentioned above to have different optimization goal.
+You can also self defined a function format like [this](../scripts/ltr_msmarco/train_ltr_model.py#L300) and change corresponding places mentioned above to have different optimization goal.
 
 ## Reproduction Log[*](reproducibility.md)
 + Results reproduced by [@Dahlia-Chehata](https://github.com/Dahlia-Chehata) on 2021-07-18 (commit [`a6b6545`](https://github.com/castorini/pyserini/commit/a6b6545c0133c03d50d5c33fb2fea7c527de04bb))
diff --git a/integrations/sparse/test_ltr_msmarco_document.py b/integrations/sparse/test_ltr_msmarco_document.py
@@ -0,0 +1,63 @@
+#
+# Pyserini: Reproducible IR research with sparse and dense representations
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import unittest
+import subprocess
+import os
+from shutil import rmtree
+from pyserini.search import SimpleSearcher
+from random import randint
+from urllib.request import urlretrieve
+import tarfile
+import sys
+
+class TestLtrMsmarcoPassage(unittest.TestCase):
+    def test_reranking(self):
+        if(os.path.isdir('ltr_test')):
+            rmtree('ltr_test')
+            os.mkdir('ltr_test')
+        inp = 'run.msmarco-pass-doc.bm25.txt'
+        outp = 'run.ltr.msmarco-passage.test.trec'
+        outp_tsv = 'run.ltr.msmarco-passage.test.tsv'
+        #Download candidate
+        os.system('wget https://www.dropbox.com/s/sxf16jcjtw1q9z7/run.msmarco-pass-doc.bm25.txt -P ltr_test')
+        #Download prebuilt index
+        SimpleSearcher.from_prebuilt_index('msmarco-document-ltr')
+        #Pre-trained ltr model
+        model_url = 'https://www.dropbox.com/s/ffl2bfw4cd5ngyz/msmarco-passage-ltr-mrr-v1.tar.gz'
+        model_tar_name = 'msmarco-passage-ltr-mrr-v1.tar.gz'
+        os.system(f'wget {model_url} -P ltr_test/')
+        os.system(f'tar -xzvf ltr_test/{model_tar_name} -C ltr_test')
+        #ibm model
+        ibm_model_url = 'https://www.dropbox.com/s/vlrfcz3vmr4nt0q/ibm_model.tar.gz'
+        ibm_model_tar_name = 'ibm_model.tar.gz'
+        os.system(f'wget {ibm_model_url} -P ltr_test/')
+        #queries process
+        os.system(f'tar -xzvf ltr_test/{ibm_model_tar_name} -C ltr_test')
+        os.system('python scripts/ltr_msmarco/convert_queries.py --input tools/topics-and-qrels/topics.msmarco-doc.dev.txt --output ltr_test/queries.dev.small.json')
+        os.system(f'python scripts/ltr_msmarco/ltr_inference.py  --input ltr_test/{inp} --input-format trec --data document --model ltr_test/msmarco-passage-ltr-mrr-v1 --index ~/.cache/pyserini/indexes/index-msmarco-document-ltr-20211027-3e4c283.2718874ab44f6d383e84ad20f3790460 --ibm-model ltr_test/ibm_model/ --queries ltr_test --output ltr_test/{outp}')
+        #convert trec to tsv withmaxP
+        os.system(f'python scripts/ltr_msmarco/generate_document_score_withmaxP.py --input ltr_test/{outp} --output ltr_test/{outp_tsv}')
+
+
+        result = subprocess.check_output(f'python tools/scripts/msmarco/msmarco_doc_eval.py --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt --run ltr_test/{outp_tsv}', shell=True).decode(sys.stdout.encoding)
+        a,b = result.find('#####################\nMRR @100:'), result.find('\nQueriesRanked: 5193\n#####################\n')
+        mrr = result[a+32:b]
+        self.assertAlmostEqual(float(mrr),0.3090492928920076, delta=0.000001)
+        rmtree('ltr_test')
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/integrations/sparse/test_ltr_msmarco_passage.py b/integrations/sparse/test_ltr_msmarco_passage.py
@@ -46,8 +46,8 @@ def test_reranking(self):
         os.system(f'wget {ibm_model_url} -P ltr_test/')
         os.system(f'tar -xzvf ltr_test/{ibm_model_tar_name} -C ltr_test')
         #queries process
-        os.system('python scripts/ltr_msmarco-passage/convert_queries.py --input tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt --output ltr_test/queries.dev.small.json')
-        os.system(f'python -m pyserini.ltr.search_msmarco_passage --input ltr_test/{inp} --input-format tsv --model ltr_test/msmarco-passage-ltr-mrr-v1 --index ~/.cache/pyserini/indexes/index-msmarco-passage-ltr-20210519-e25e33f.a5de642c268ac1ed5892c069bdc29ae3 --ibm-model ltr_test/ibm_model/ --queries ltr_test --output ltr_test/{outp}')
+        os.system('python scripts/ltr_msmarco/convert_queries.py --input tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt --output ltr_test/queries.dev.small.json')
+        os.system(f'python scripts/ltr_msmarco/ltr_inference.py  --input ltr_test/{inp} --input-format tsv --model ltr_test/msmarco-passage-ltr-mrr-v1 --data passage --index ~/.cache/pyserini/indexes/index-msmarco-passage-ltr-20210519-e25e33f.a5de642c268ac1ed5892c069bdc29ae3 --ibm-model ltr_test/ibm_model/ --queries ltr_test --output-format tsv --output ltr_test/{outp}')
         result = subprocess.check_output(f'python tools/scripts/msmarco/msmarco_passage_eval.py tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt ltr_test/{outp}', shell=True).decode(sys.stdout.encoding)
         a,b = result.find('#####################\nMRR @10:'), result.find('\nQueriesRanked: 6980\n#####################\n')
         mrr = result[a+31:b]

diff --git a/...ni/ltr/search_msmarco_passage/__init__.py → pyserini/ltr/search_msmarco/__init__.py b/...ni/ltr/search_msmarco_passage/__init__.py → pyserini/ltr/search_msmarco/__init__.py
@@ -14,5 +14,5 @@
 # limitations under the License.
 #
 
-from ._search_msmarco_passage import MsmarcoPassageLtrSearcher
-__all__ = ['MsmarcoPassageLtrSearcher']
+from ._search_msmarco import MsmarcoLtrSearcher
+__all__ = ['MsmarcoLtrSearcher']
diff --git a/...smarco_passage/_search_msmarco_passage.py → ...ini/ltr/search_msmarco/_search_msmarco.py b/...smarco_passage/_search_msmarco_passage.py → ...ini/ltr/search_msmarco/_search_msmarco.py
@@ -25,19 +25,23 @@
 import time
 from tqdm import tqdm
 import pickle
+from pyserini.index import IndexReader
 
 from pyserini.ltr._base import *
 
 
 logger = logging.getLogger(__name__)
 
-class MsmarcoPassageLtrSearcher:
-    def __init__(self, model: str, ibm_model:str, index:str):
+class MsmarcoLtrSearcher:
+    def __init__(self, model: str, ibm_model:str, index:str, data: str):
          self.model = model
          self.ibm_model = ibm_model
          self.fe = FeatureExtractor(index, max(multiprocessing.cpu_count()//2, 1))
+         self.index_reader = IndexReader(index)
+         self.data = data
 
     def add_fe(self):
+        #self.fe.add(RunList('collections/msmarco-ltr-passage/run.monot5.run_list.whole.trec','t5'))
         for qfield, ifield in [('analyzed', 'contents'),
                            ('text_unlemm', 'text_unlemm'),
                            ('text_bert_tok', 'text_bert_tok')]:
@@ -141,24 +145,19 @@ def add_fe(self):
             self.fe.add(OrderedQueryPairs(15, field=ifield, qfield=qfield))
 
         start = time.time()
-        self.fe.add(
-            IbmModel1(f"{self.ibm_model}/title_unlemm", "text_unlemm", "title_unlemm",
-                    "text_unlemm"))
+        self.fe.add(IbmModel1(f"{self.ibm_model}/title_unlemm", "text_unlemm", "title_unlemm", "text_unlemm"))
         end = time.time()
         print('IBM model Load takes %.2f seconds' % (end - start))
         start = end
-        self.fe.add(IbmModel1(f"{self.ibm_model}url_unlemm", "text_unlemm", "url_unlemm",
-                        "text_unlemm"))
+        self.fe.add(IbmModel1(f"{self.ibm_model}url_unlemm", "text_unlemm", "url_unlemm", "text_unlemm"))
         end = time.time()
         print('IBM model Load takes %.2f seconds' % (end - start))
         start = end
-        self.fe.add(
-            IbmModel1(f"{self.ibm_model}body", "text_unlemm", "body", "text_unlemm"))
+        self.fe.add(IbmModel1(f"{self.ibm_model}body", "text_unlemm", "body", "text_unlemm"))
         end = time.time()
         print('IBM model Load takes %.2f seconds' % (end - start))
         start = end
-        self.fe.add(IbmModel1(f"{self.ibm_model}text_bert_tok", "text_bert_tok",
-                        "text_bert_tok", "text_bert_tok"))
+        self.fe.add(IbmModel1(f"{self.ibm_model}text_bert_tok", "text_bert_tok", "text_bert_tok", "text_bert_tok"))
         end = time.time()
         print('IBM model Load takes %.2f seconds' % (end - start))
         start = end
@@ -176,8 +175,13 @@ def batch_extract(self, df, queries, fe):
                 "query_dict": queries[qid]
             }
             for t in group.reset_index().itertuples():
-                task["docIds"].append(t.pid)
-                task_infos.append((qid, t.pid, t.rel))
+                if (self.data == 'document'):
+                    if (self.index_reader.doc(t.pid) != None):
+                        task["docIds"].append(t.pid)
+                        task_infos.append((qid, t.pid, t.rel))
+                else:
+                    task["docIds"].append(t.pid)
+                    task_infos.append((qid, t.pid, t.rel))
             tasks.append(task)
             group_lst.append((qid, len(task['docIds'])))
             if len(tasks) == 1000:

diff --git a/pyserini/prebuilt_index_info.py b/pyserini/prebuilt_index_info.py
@@ -101,6 +101,18 @@
         "unique_terms": 2660824,
         "downloaded": False
     },
+    "msmarco-document-ltr": {
+        "description": "MS MARCO document corpus (4 extra preprocessed fields) used for LTR pipeline",
+        "filename": "index-msmarco-document-ltr-20211027-3e4c283.tar.gz",
+        "urls": [
+            "https://www.dropbox.com/s/5tr2otncs9rttbp/index-msmarco-document-ltr-20211027-3e4c283.tar.gz?dl=1"  # too big for UWaterloo GitLab
+        ],
+        "md5": "2718874ab44f6d383e84ad20f3790460",
+        "size compressed (bytes)": 46052436658,
+        "total_terms": 1232004740,
+        "documents": 20545628,
+        "downloaded": False
+    },
     "msmarco-doc": {
         "description": "Lucene index of the MS MARCO document corpus",
         "filename": "index-msmarco-doc-20201117-f87c94.tar.gz",

diff --git a/...passage/append_d2q_to_collection_jsonl.py → ...msmarco/append_d2q_to_collection_jsonl.py b/...passage/append_d2q_to_collection_jsonl.py → ...msmarco/append_d2q_to_collection_jsonl.py
diff --git a/...co-passage/convert_collection_to_jsonl.py → ...tr_msmarco/convert_collection_to_jsonl.py b/...co-passage/convert_collection_to_jsonl.py → ...tr_msmarco/convert_collection_to_jsonl.py
diff --git a/...pts/ltr_msmarco-passage/convert_common.py → scripts/ltr_msmarco/convert_common.py b/...pts/ltr_msmarco-passage/convert_common.py → scripts/ltr_msmarco/convert_common.py