This README refers to the old version (v1) of mMARCO. We highly recommend you use the latest dataset and models from the main README.md
A multilingual version of MS MARCO passage ranking dataset
This repository presents a neural machine translation-based method for translating the MS MARCO passage ranking dataset. The code available here is the same used in our paper mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset.
As described in our work, we made available the MS MARCO passage ranking dataset translated to 8 languages (Chinese, French, German, Indonesian, Italian, Portuguese, Russian and Spanish). The translated passages collection and the queries set (training and validation) are available at https://huggingface.co/datasets/unicamp-dl/mmarco/tree/main/data/v1.1.
Our available fine-tuned models are:
Model | Description | MRR@10* |
---|---|---|
ptT5-base-pt-msmarco | a PTT5 model fine-tuned on Portuguese MS MARCO | 0.188 |
ptT5-base-en-pt-msmarco | a PTT5 model fine-tuned on English and Portuguese MS MARCO | 0.343 |
mT5-base-en-pt-msmarco | a mT5 model fine-tuned on both English and Portuguese MS MARCO | 0.375 |
mT5-base-multi-msmarco | a mT5 model fine-tuned on mMARCO | 0.366 |
mMiniLM-pt-msmarco | a mMiniLM model fine-tuned on Portuguese MS MARCO | - |
mMiniLM-en-pt-msmarco | a mMiniLM model fine-tuned on both English and Portuguese MS MARCO | 0.375 |
mMiniLM-multi-msmarco | a mMiniLM model fine-tuned on mMARCO | 0.363 |
* MRR@10 on English MS MARCO
We translate MS MARCO passage ranking dataset, a large-scale IR dataset comprising more than half million anonymized questions that were sampled from Bing's search query logs.
To translate the MS MARCO dataset, we use MarianNMT an open-source neural machine translation framework originally written in C++ for fast training and translation. The Language Technology Research Group at the University of Helsinki made available more than a thousand language pairs for translation, supported by HuggingFace framework.
In order to allow other users to translate the MS MARCO passage ranking dataset to other languages (or a dataset of your own will), we provide the translate.py
script. This script expects a .tsv file, in which each line follows a document_id \t document_text
format.
python translate.py --model_name_or_path Helsinki-NLP/opus-mt-{src}-{tgt} --target_language tgt_code--input_file collection.tsv --output_dir translated_data/
After translating, it is necessary to reassemble the file, as the documents were split into sentences.
python create_translated_collection.py --input_file translated_data/translated_file --output_file translated_{tgt}_collection
Translating the entire passages collection of MS MARCO took about 80 hours using a Tesla V100.
The steps reported here are the same used for any language from mMARCO.
Using pygaggle scripts, we convert the mMARCO Portuguese collection into json files:
python pygaggle/tools/scripts/msmarco/convert_collection_to_jsonl.py \
--collection-path path/to/portuguese_collection.tsv \
--output-folder collections/portuguese-msmarco-passage/collection_jsonl
Indexing using Pyserini
Now we can index the Portuguese collection using Pyserini:
python -m pyserini.index -collection JsonCollection \
-generator DefaultLuceneDocumentGenerator \
-threads 1 -input collections/portuguese-msmarco-passage/collection_jsonl/ \
-index indexes/portuguese-lucene-index-msmarco \
-storePositions -storeDocvectors -storeRaw -language portuguese
As the original English set, the built index should have 8,841,823 documents.
Using a pygaggle script, we select only the queries that are in the qrels file:
python pygaggle/tools/scripts/msmarco/filter_queries.py \
--qrels path/to/qrels.dev.small.tsv \
--queries path/to/portuguese_queries.dev.tsv \
--output collections/portuguese-msmarco-passage/portuguese_queries.dev.small.tsv
This script results a file with 6980 queries. Now we can retrieve from our index:
python -m pyserini.search --topics collections/portuguese-msmarco-passage/portuguese_queries.dev.small.tsv \
--index indexes/portuguese-lucene-index-msmarco --language portuguese \
--output runs/run.portuguese-msmarco-passage.dev.small.tsv \
--bm25 --output-format msmarco --hits 1000 --k1 0.82 --b 0.68
Using the official MS MARCO evaluation script:
python pygaggle/tools/scripts/msmarco/msmarco_passage_eval.py \
path/to/qrels.dev.small.tsv runs/run.portuguese-msmarco-passage.dev.small.tsv
The output should be like:
#####################
MRR @10: 0.14122873743575773
QueriesRanked: 6980
#####################
Finally, we can re-rank our BM25 initial run using ptt5-base-en-pt-msmarco-10k (or each one of the previous listed models):
python reranker.py --model_name_or_path=unicamp-dl/ptt5-base-en-pt-msmarco-10k \
--initial_run runs/run.portuguese-msmarco-passage.dev.small.tsv \
--corpus path/to/portuguese_collection.tsv --queries portuguese_queries.dev.small.tsv \
--output_run runs/run.mt5-reranked-portuguese-msmarco-passage.dev.small.tsv
Using the official MS MARCO evaluation script to evaluate the re-ranked results:
python pygaggle/tools/scripts/msmarco/msmarco_passage_eval.py \
path/to/qrels.dev.small.tsv runs/run.mt5-reranked-portuguese-msmarco-passage.dev.small.tsv
The output should be like:
#####################
MRR @10: 0.2832968344931086
QueriesRanked: 6980
#####################
If you extend or use this work, please cite the paper where it was introduced:
@misc{bonifacio2021mmarco,
title={mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset},
author={Luiz Henrique Bonifacio and Israel Campiotti and Roberto Lotufo and Rodrigo Nogueira},
year={2021},
eprint={2108.13897},
archivePrefix={arXiv},
primaryClass={cs.CL}
}