-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MIRACL #198
Comments
I may be able do this, is anyone working on it? @Muennighoff @KennethEnevoldsen ? Should I wait until this PR #233 is merged ? I think the PR addresses BEIR specifically, so maybe no need to wait until it's merged. Queries and qrels: https://huggingface.co/datasets/miracl/miracl |
Parts of MIRACL are already in MTEB I think - Ideally we unify them so that it's just one MIRACL task from which the different datasets can be selected from |
Yep Jina team added some of it (de and es) here: https://github.com/embeddings-benchmark/mteb/blob/main/mteb/tasks/Retrieval/MIRACLRetrieval.py The original dataset is here, with all these languages included:
Maybe we can just use the original dataset with all provided languages ? |
Generally, the first option is more robust, but we have multiple datasets atm. which does the second one. The second option is also easier to update (if the dataset is updated). For MIRACL though I would just go for option 1 (permissive license) as long there are no planned updates. |
Hi @KennethEnevoldsen @imenelydiaker @Muennighoff, @crystina-z and I are co-authors of the MIRACL benchmark. We just saw the announcement on Twitter/X of building out the multilingual MTEB, and I saw this issue is open. We would be happy to help you to integrate MIRACL within MMTEB. Please ask us directly if you have any questions. Thanks, |
Hi @thakur-nandan, very happy to have you guys on board. Seems like MIRACL is partly added in a few different ways (partially for some languages and as a reranking task and as a retrieval task). You guys might be interesting unifying those and adding in the missing languages? If you have the time of course. If you do not I will mark this thread with a "help wanted". |
Hi @KennethEnevoldsen sounds good and we'd love to help! I can take the reranking task and @thakur-nandan will handle the retrieval task. We'll start this week and get back as soon as we can. |
Wonderful @crystina-z and @thakur-nandan. |
Are you still interested in adding this? Would be amazing! 🙌 cc @crystina-z @thakur-nandan @imenelydiaker |
@crystina-z @thakur-nandan if you haven't started yet I can take it from here 😊 |
@imenelydiaker Thanks for your help. Me and @crystina-z have already started to look into both the reranking and retrieval tasks and should have the PR soon! Regards, |
Hi all! I just submitted #641 for the reranking part. lmk how you think! |
Submitted #642 for the retrieval part. I have not been able to successfully reproduce the mE5-small nDCG@10 numbers. |
we are currently waiting for #833, which is being worked on by @imenelydiaker so will add you to this issue as well. |
No description provided.
The text was updated successfully, but these errors were encountered: