Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MIRACL #198

Closed
Muennighoff opened this issue Jan 9, 2024 · 14 comments
Closed

Add MIRACL #198

Muennighoff opened this issue Jan 9, 2024 · 14 comments
Assignees
Labels
enhancement New feature or request

Comments

@Muennighoff
Copy link
Contributor

No description provided.

@KennethEnevoldsen KennethEnevoldsen added the enhancement New feature or request label Mar 5, 2024
@imenelydiaker
Copy link
Contributor

I may be able do this, is anyone working on it? @Muennighoff @KennethEnevoldsen ?

Should I wait until this PR #233 is merged ? I think the PR addresses BEIR specifically, so maybe no need to wait until it's merged.

Queries and qrels: https://huggingface.co/datasets/miracl/miracl
Corpus: https://huggingface.co/datasets/miracl/miracl-corpus

@Muennighoff
Copy link
Contributor Author

Parts of MIRACL are already in MTEB I think - Ideally we unify them so that it's just one MIRACL task from which the different datasets can be selected from

@imenelydiaker
Copy link
Contributor

Parts of MIRACL are already in MTEB I think - Ideally we unify them so that it's just one MIRACL task from which the different datasets can be selected from

Yep Jina team added some of it (de and es) here: https://github.com/embeddings-benchmark/mteb/blob/main/mteb/tasks/Retrieval/MIRACLRetrieval.py
And the Korean version is here: https://github.com/embeddings-benchmark/mteb/blob/main/mteb/tasks/Retrieval/KoMiracl.py

The original dataset is here, with all these languages included:

Maybe we can just use the original dataset with all provided languages ?
Do you prefer duplicating the data into a MTEB data repository to make sure it will always be available for the benchmark, or should we just load the dataset from it's original HF repo and transform it on load ?

@KennethEnevoldsen
Copy link
Contributor

Do you prefer duplicating the data into a MTEB data repository to make sure it will always be available for the benchmark, or should we just load the dataset from it's original HF repo and transform it on load ?

Generally, the first option is more robust, but we have multiple datasets atm. which does the second one. The second option is also easier to update (if the dataset is updated).

For MIRACL though I would just go for option 1 (permissive license) as long there are no planned updates.

@thakur-nandan
Copy link
Contributor

Hi @KennethEnevoldsen @imenelydiaker @Muennighoff,

@crystina-z and I are co-authors of the MIRACL benchmark.

We just saw the announcement on Twitter/X of building out the multilingual MTEB, and I saw this issue is open.

We would be happy to help you to integrate MIRACL within MMTEB. Please ask us directly if you have any questions.

Thanks,
Nandan

@KennethEnevoldsen
Copy link
Contributor

Hi @thakur-nandan, very happy to have you guys on board. Seems like MIRACL is partly added in a few different ways (partially for some languages and as a reranking task and as a retrieval task). You guys might be interesting unifying those and adding in the missing languages? If you have the time of course. If you do not I will mark this thread with a "help wanted".

@crystina-z
Copy link
Contributor

Hi @KennethEnevoldsen sounds good and we'd love to help! I can take the reranking task and @thakur-nandan will handle the retrieval task. We'll start this week and get back as soon as we can.

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Apr 11, 2024

Wonderful @crystina-z and @thakur-nandan.

@Muennighoff
Copy link
Contributor Author

Are you still interested in adding this? Would be amazing! 🙌 cc @crystina-z @thakur-nandan @imenelydiaker

@imenelydiaker
Copy link
Contributor

@crystina-z @thakur-nandan if you haven't started yet I can take it from here 😊

@thakur-nandan
Copy link
Contributor

@imenelydiaker Thanks for your help. Me and @crystina-z have already started to look into both the reranking and retrieval tasks and should have the PR soon!

Regards,
Nandan Thakur

@crystina-z crystina-z mentioned this issue May 6, 2024
10 tasks
@crystina-z
Copy link
Contributor

Hi all! I just submitted #641 for the reranking part. lmk how you think!

@thakur-nandan
Copy link
Contributor

Submitted #642 for the retrieval part. I have not been able to successfully reproduce the mE5-small nDCG@10 numbers.

@KennethEnevoldsen
Copy link
Contributor

we are currently waiting for #833, which is being worked on by @imenelydiaker so will add you to this issue as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants