Skip to content

Dense Index of MARCO V2 and KILT #974

Answered by MXueguang
paulowoicho asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @paulowoicho,
The reason for OOM here is probably due to loading the entire corpus into the memory at the beginning (before sharding).
I would suggest splitting the raw collection in advance.
i.e. use Linux cli tool split to split the converted_collection into multiple small files.
then run the encoding with:

python3 -m pyserini.encode input --corpus converted_collection/split00.jsonl \
                            --fields url title text \
                            output --embeddings indexes/dense_0/split00 \
                            --to-faiss \
                            encoder --encoder castorini/ance-msmarco-doc-maxp \
                            --fields url title text \
 …

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@paulowoicho
Comment options

@paulowoicho
Comment options

@MXueguang
Comment options

Answer selected by paulowoicho
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #973 on January 29, 2022 21:24.