Add MIRACL #198

Muennighoff · 2024-01-09T08:29:51Z

No description provided.

imenelydiaker · 2024-03-05T21:25:29Z

I may be able do this, is anyone working on it? @Muennighoff @KennethEnevoldsen ?

Should I wait until this PR #233 is merged ? I think the PR addresses BEIR specifically, so maybe no need to wait until it's merged.

Queries and qrels: https://huggingface.co/datasets/miracl/miracl
Corpus: https://huggingface.co/datasets/miracl/miracl-corpus

Muennighoff · 2024-03-05T23:53:14Z

Parts of MIRACL are already in MTEB I think - Ideally we unify them so that it's just one MIRACL task from which the different datasets can be selected from

imenelydiaker · 2024-03-06T08:39:41Z

Parts of MIRACL are already in MTEB I think - Ideally we unify them so that it's just one MIRACL task from which the different datasets can be selected from

Yep Jina team added some of it (de and es) here: https://github.com/embeddings-benchmark/mteb/blob/main/mteb/tasks/Retrieval/MIRACLRetrieval.py
And the Korean version is here: https://github.com/embeddings-benchmark/mteb/blob/main/mteb/tasks/Retrieval/KoMiracl.py

The original dataset is here, with all these languages included:

Queries and qrels: https://huggingface.co/datasets/miracl/miracl
Corpus: https://huggingface.co/datasets/miracl/miracl-corpus

Maybe we can just use the original dataset with all provided languages ?
Do you prefer duplicating the data into a MTEB data repository to make sure it will always be available for the benchmark, or should we just load the dataset from it's original HF repo and transform it on load ?

KennethEnevoldsen · 2024-03-06T10:19:31Z

Do you prefer duplicating the data into a MTEB data repository to make sure it will always be available for the benchmark, or should we just load the dataset from it's original HF repo and transform it on load ?

Generally, the first option is more robust, but we have multiple datasets atm. which does the second one. The second option is also easier to update (if the dataset is updated).

For MIRACL though I would just go for option 1 (permissive license) as long there are no planned updates.

thakur-nandan · 2024-04-11T17:05:53Z

Hi @KennethEnevoldsen @imenelydiaker @Muennighoff,

@crystina-z and I are co-authors of the MIRACL benchmark.

We just saw the announcement on Twitter/X of building out the multilingual MTEB, and I saw this issue is open.

We would be happy to help you to integrate MIRACL within MMTEB. Please ask us directly if you have any questions.

Thanks,
Nandan

KennethEnevoldsen · 2024-04-11T17:23:48Z

Hi @thakur-nandan, very happy to have you guys on board. Seems like MIRACL is partly added in a few different ways (partially for some languages and as a reranking task and as a retrieval task). You guys might be interesting unifying those and adding in the missing languages? If you have the time of course. If you do not I will mark this thread with a "help wanted".

crystina-z · 2024-04-11T17:41:09Z

Hi @KennethEnevoldsen sounds good and we'd love to help! I can take the reranking task and @thakur-nandan will handle the retrieval task. We'll start this week and get back as soon as we can.

KennethEnevoldsen · 2024-04-11T20:11:26Z

Wonderful @crystina-z and @thakur-nandan.

Muennighoff · 2024-05-01T15:30:23Z

Are you still interested in adding this? Would be amazing! 🙌 cc @crystina-z @thakur-nandan @imenelydiaker

imenelydiaker · 2024-05-01T15:50:35Z

@crystina-z @thakur-nandan if you haven't started yet I can take it from here 😊

thakur-nandan · 2024-05-01T22:07:18Z

@imenelydiaker Thanks for your help. Me and @crystina-z have already started to look into both the reranking and retrieval tasks and should have the PR soon!

Regards,
Nandan Thakur

crystina-z · 2024-05-06T22:02:14Z

Hi all! I just submitted #641 for the reranking part. lmk how you think!

thakur-nandan · 2024-05-06T23:08:08Z

Submitted #642 for the retrieval part. I have not been able to successfully reproduce the mE5-small nDCG@10 numbers.

KennethEnevoldsen · 2024-06-05T18:28:44Z

we are currently waiting for #833, which is being worked on by @imenelydiaker so will add you to this issue as well.

KennethEnevoldsen added the enhancement New feature or request label Mar 5, 2024

izhx mentioned this issue Apr 13, 2024

Aggregating MMTEB datasets #354

Open

imenelydiaker mentioned this issue Apr 24, 2024

Add miracl fr reranking dataset #552

Closed

10 tasks

crystina-z mentioned this issue May 6, 2024

MIRACL reranking #641

Merged

10 tasks

thakur-nandan mentioned this issue May 6, 2024

Adding MIRACL Retrieval #642

Merged

10 tasks

KennethEnevoldsen assigned imenelydiaker Jun 5, 2024

isaac-chung closed this as completed Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MIRACL #198

Add MIRACL #198

Muennighoff commented Jan 9, 2024

imenelydiaker commented Mar 5, 2024

Muennighoff commented Mar 5, 2024

imenelydiaker commented Mar 6, 2024

KennethEnevoldsen commented Mar 6, 2024

thakur-nandan commented Apr 11, 2024

KennethEnevoldsen commented Apr 11, 2024

crystina-z commented Apr 11, 2024

KennethEnevoldsen commented Apr 11, 2024 •

edited

Loading

Muennighoff commented May 1, 2024

imenelydiaker commented May 1, 2024

thakur-nandan commented May 1, 2024

crystina-z commented May 6, 2024

thakur-nandan commented May 6, 2024

KennethEnevoldsen commented Jun 5, 2024

Add MIRACL #198

Add MIRACL #198

Comments

Muennighoff commented Jan 9, 2024

imenelydiaker commented Mar 5, 2024

Muennighoff commented Mar 5, 2024

imenelydiaker commented Mar 6, 2024

KennethEnevoldsen commented Mar 6, 2024

thakur-nandan commented Apr 11, 2024

KennethEnevoldsen commented Apr 11, 2024

crystina-z commented Apr 11, 2024

KennethEnevoldsen commented Apr 11, 2024 • edited Loading

Muennighoff commented May 1, 2024

imenelydiaker commented May 1, 2024

thakur-nandan commented May 1, 2024

crystina-z commented May 6, 2024

thakur-nandan commented May 6, 2024

KennethEnevoldsen commented Jun 5, 2024

KennethEnevoldsen commented Apr 11, 2024 •

edited

Loading