[NeurIPS 2023] Model-enhanced Vector Index (Paper)
[Option 1] Create conda environment:
conda env create -f environment.yml
conda activate mevi
[Option 2] Use docker:
docker pull hugozhl/nci:latest
[1] Download and preprocess:
bash dataprocess/msmarco_passage/download_data.sh
python dataprocess/msmarco_passage/prepare_origin.py \
--data_dir data/marco --origin
[2] Tokenize documents:
# tokenize for T5-ANCE and AR2
# T5-ANCE
python dataprocess/msmarco_passage/prepare_passage_tokenized.py \
--output_dir data/marco/ance \
--document_path data/marco/raw/corpus.tsv \
--dataset marco --model ance
rm data/marco/ance/all_document_indices_*.pkl
rm data/marco/ance/all_document_tokens_*.bin
rm data/marco/ance/all_document_masks_*.bin
# AR2
python dataprocess/msmarco_passage/prepare_passage_tokenized.py \
--output_dir data/marco/ar2 \
--document_path data/marco/raw/corpus.tsv \
--dataset marco --model ar2
rm data/marco/ar2/all_document_indices_*.pkl
rm data/marco/ar2/all_document_tokens_*.bin
rm data/marco/ar2/all_document_masks_*.bin
[3] Query generation for augmentation:
We used the docT5query checkpoint as in NCI. The QG data is only for training.
Please download the finetuned docT5query ckpt to data/marco/ckpts/doc2query-t5-base-msmarco
# MUST download the finetuned docT5query ckpt before running the scripts
python dataprocess/msmarco_passage/doc2query.py --data_dir data/marco
# if the qg data has bad quality, e.g. empty query or many duplicate queries, add another script below
python dataprocess/msmarco_passage/complement_qg10.py --data_dir data/marco # Optional
[4] Generate document embeddings and construct RQ
For T5-ANCE, please download T5-ANCE checkpoint to data/marco/ckpts/t5-ance
.
For AR2, please download AR2 checkpoint to data/marco/ckpts/ar2g_marco_finetune.pkl
and coCondenser checkpoint to data/marco/ckpts/co-condenser-marco-retriever
# MUST download the checkpoints before running the scripts
export DOCUMENT_ENCODER=ance
# export DOCUMENT_ENCODER=ar2 # use this line for ar2
bash MEVI/marco_generate_embedding_n_rq.sh
Train the RQ-based NCI.
export DOCUMENT_ENCODER=ance
# export DOCUMENT_ENCODER=ar2 # use this line for ar2
export WANDB_TOKEN="your wandb token"
bash MEVI/marco_train_nci_rq.sh
First generate query embeddings.
# for T5-ANCE
python MEVI/generate.py \
--query_file data/marco/origin/dev_mevi_dedup.tsv \
--model_path data/marco/ckpts/t5-ance \
--tokenizer_path data/marco/ckpts/t5-ance \
--query_embedding_path data/marco/ance/query_emb.bin \
--gpus 0,1,2,3,4,5,6,7 --gen_query
# for AR2
python MEVI/generate.py \
--query_file data/marco/origin/dev_mevi_dedup.tsv \
--model_path data/marco/ckpts/ar2g_marco_finetune.pkl \
--tokenizer_path bert-base-uncased \
--query_embedding_path data/marco/ar2/query_emb.bin \
--gpus 0,1,2,3,4,5,6,7 --gen_query
Then use faiss for ANN search.
# for T5-ANCE; if for AR2, change the ance directory to ar2 directory
python MEVI/faiss_search.py \
--query_path data/marco/ance/query_emb.bin \
--doc_path data/marco/ance/docemb.bin \
--output_path data/marco/ance/hnsw256.txt \
--raw_query_path data/marco/origin/dev_mevi_dedup.tsv \
--param HNSW256
Please download our checkpoint for MSMARCO Passage or train from scratch before evaluation, and put the checkpoint in data/marco/ckpts
. If using the downloaded checkpoint, please also download the corresponding RQ files.
# MUST download or train a ckpt before running the scripts
export DOCUMENT_ENCODER=ance
# export DOCUMENT_ENCODER=ar2 # use this line for ar2
bash MEVI/marco_eval_nci_rq.sh
Ensemble the results from the twin-tower model and the sequence-to-sequence model.
export DOCUMENT_ENCODER=ance
# export DOCUMENT_ENCODER=ar2 # use this line for ar2
bash MEVI/marco_ensemble.sh
[1] Download and preprocess:
bash dataprocess/NQ_dpr/download_data.sh
python dataprocess/NQ_dpr/preprocess.py --data_dir data/nq_dpr
[2] Tokenize documents:
# use AR2
python dataprocess/NQ_dpr/tokenize_passage_ar2.py \
--output_dir data/nq_dpr \
--document_path data/nq_dpr/corpus.tsv
rm data/nq_dpr/all_document_indices_*.pkl
rm data/nq_dpr/all_document_tokens_*.bin
rm data/nq_dpr/all_document_masks_*.bin
[3] Query generation for augmentation:
We used the docT5query checkpoint as in NCI. The QG data is only for training. Please refer to the QG section for MSMARCO Passage.
# download finetuned docT5query ckpt to data/marco/ckpts/doc2query-t5-base-msmarco
python dataprocess/NQ_dpr/doc2query.py \
--data_dir data/nq_dpr --n_gen_query 1 \
--ckpt_path data/marco/ckpts/doc2query-t5-base-msmarco
[4] Generate document embeddings and construct RQ
Please download AR2 checkpoint to data/marco/ckpts/ar2g_nq_finetune.pkl
and ERNIE checkpoint to data/marco/ckpts/ernie-2.0-base-en
# MUST download the checkpoints before running the scripts
bash MEVI/nqdpr_generate_embedding_n_rq.sh
[5] Tokenize query
Since NQ has too many augmented queries, to eliminate runtime memory usage, we tokenize query to enable memmap.
python dataprocess/NQ_dpr/tokenize_query.py \
--output_dir data/nq_dpr \
--tok_train 1 --tok_corpus 1 --tok_qg 1
[6] Get answers
We sort the answers for fast evaluation. (Time-consuming! Please download the processed binary files if necessary.)
python dataprocess/NQ_dpr/get_answers.py \
--data_dir data/nq_dpr \
--dev 1 --test 1
python dataprocess/NQ_dpr/get_inverse_answers.py \
--data_dir data/nq_dpr \
--dev 1 --test 1
Train the RQ-based NCI.
export WANDB_TOKEN="your wandb token"
bash MEVI/nqdpr_train_nci_rq.sh
First generate query embeddings.
python MEVI/generate.py \
--query_file data/nq_dpr/nq-test.qa.csv \
--model_path data/marco/ckpts/ar2g_nq_finetune.pkl \
--tokenizer_path bert-base-uncased \
--query_embedding_path data/nq_dpr/query_emb.bin \
--gpus 0,1,2,3,4,5,6,7 --gen_query
Then use faiss for ANN search.
python MEVI/faiss_search.py \
--query_path data/nq_dpr/query_emb.bin \
--doc_path data/nq_dpr/docemb.bin \
--output_path data/nq_dpr/hnsw256.txt \
--raw_query_path data/nq_dpr/nq-test.qa.csv \
--param HNSW256
Please download our checkpoint for NQ or train from scratch before evaluation, and put the checkpoint in data/marco/ckpts
. If using the downloaded checkpoint, please also download the corresponding RQ files.
# MUST download or train a ckpt before running the scripts
bash MEVI/nqdpr_eval_nci_rq.sh
Ensemble the results from the twin-tower model and the sequence-to-sequence model.
bash MEVI/nqdpr_ensemble.sh
If you find this work useful, please cite our paper.
We learned a lot and borrowed some codes from the following projects when building MEVI.