This guide explains how to perform retrieval using the Pyserini library. Depending on your use case (retrieving from a local index or the MSMARCOv2 index), you’ll use either retrieve_local.sh
or retrieve_ms2.sh
.
Pyserini simplifies retrieval by providing prebuilt indices. It eliminates the need for training your retrieval models, allowing you to focus on using and analyzing retrieval results. After obtaining retrieval results, you can enhance them by adding raw text and other metadata.
-
Install Required Libraries:
- FAISS and PyTorch:
See here for more reference. The below example is for CUDA 11.4.
conda create --name faiss_1.7.4 python=3.10 conda activate faiss_1.7.4 conda install faiss-gpu=1.7.4 mkl=2021 pytorch pytorch-cuda numpy -c pytorch -c nvidia
- Java (for Lucene):
conda install -c conda-forge openjdk=11 maven -y
- Pyserini:
pip install pyserini
- FAISS and PyTorch:
See here for more reference. The below example is for CUDA 11.4.
-
Set Environment Variables:
Add the path tolibjvm.so
if you encounterRuntimeError: Unable to find libjvm.so
:export JAVA_HOME=/path/to/java/home
-
Input Dataset Format: The input dataset file should be a JSON file, which is a list of dictionaries. Each dictionary must have a key
"question"
, where the value is the query to be searched for. For example:[ {"question": "What is the capital of France?", "other_key": "value"}, {"question": "Who won the first Nobel Prize in Physics?", "other_key": "value"} ]
- Local Index Retrieval: Use
retrieve_local.sh
for custom datasets and local FAISS indices. - MSMARCOv2 Index Retrieval: Use
retrieve_ms2.sh
for prebuilt MSMARCOv2 Lucene indices.
Use retrieve_local.sh
to retrieve results from a local FAISS index for custom datasets.
bash retrieve_local.sh <corpus_name> <dataset_type> <dataset_path> <dataset_name>
<corpus_name>
: Index name (e.g.,wiki
,web
, orwiki-web
).<dataset_type>
: Dataset source (local
orpyserini
).<dataset_path>
: Path to the dataset (for local datasets).<dataset_name>
: Name of the dataset (e.g.,msmarcoqa
).
bash retrieve_local.sh wiki local ../data/custom_dataset.json custom-dataset
This script:
- Converts the raw dataset to a topics file.
- Runs FAISS retrieval on the local index.
- Converts retrieval results to DPR format.
Use retrieve_ms2.sh
to retrieve results from the MSMARCOv2 Lucene index.
bash retrieve_ms2.sh <dataset_short_name>
<dataset_short_name>
: Name of the dataset (e.g.,nq-test
,msmarcoqa
,hotpot
).
bash retrieve_ms2.sh nq-test
This script:
- Converts the dataset to a topics file.
- Performs BM25 retrieval using the MSMARCOv2 Lucene index.
- Converts the retrieval results to DPR format.
RuntimeError: Unable to find libjvm.so
: Ensure Java is installed andJAVA_HOME
is set.- Missing files or directories: Verify that all dataset and index paths are correct.