Added support for Scandinavian Languages #124

KennethEnevoldsen · 2023-07-24T20:00:49Z

Addition

Added all relevant tasks for Da, No, and Sv I could find. Norwegian includes Norwegian Bokmål and Norwegian Nynorsk. Danish also includes a bitext between Danish and one of its dialects.
Added a minor fix to the Bitext task

Task selection:

Wherever possible I try to keep as streamlined with ScandEval as possible. I.e. if there are cases two versions of the dataset exist, I use the same one as ScandEval.
I do not include machine-translated datasets (this excludes the QA datasets in ScandEval)
Where possible I use the test split. If there are no splits I use the train split.
I do not include non-public datasets such as DaNewsroom

Potential problems

Can't get a non-zero performance on the retrieval task. See Adding a new retrieval task #122 (A solution is simply to remove it for the first PR)
Currently, there is an overweight of sentiment classification tasks.

Other

I have added the transform_dataset method. Might be relevant to make that a general method in the AbsTask. In that way, the loading workflow is independent of the transformation workflow (would allow you to e.g. add splits without having to fix the custom transformations of each of the datasets). Currently, it is not implemented in the main function.

Fixes #126

Muennighoff

Amazing work!

I like the dataset_transform. I think we can leave it like you did for now & maybe we'll make it a generic function later
We still need to import all tasks in the __init__.py files for each task directory I think
Why did you choose 16 samples_per_label for all Classification tasks? @NouamaneTazi @loicmagne do you remember how we selected the different samples_per_label values for CLF tasks? Was it based on how big the dataset is / the number of labels?
Can you run a few models on the new tasks and share the result files here? I can add a new leaderboard tab for some of these languages where we have a few datasets. Maybe three new CLF tabs for Danish, Norwegian, Swedish? Or is it more useful if it's just one new CLF tab for "Scandinavia"?
Do we have results on SweFAQRetrieval from prior work & are we able to reproduce it?

KennethEnevoldsen · 2023-07-25T16:41:50Z

Sure will keep it as is
~~Will get on that later today~~ Done
It was arbitrary. I can choose based on similarly sized datasets? (Edit: Most of the datasets are quite small. I changed the largest one to 32, but keep the rest at 16).
Sure thing I have already done it, but will just rerun it with the updated to 3. I would use Danish, Norwegian, and Swedish. Potentially you could also add Scandinavian as a meaningful multilingual group (but it shouldn't be instead of the individual languages).
The Swedish sentence transformer have results here. They gain an accuracy of 50-70 (I think they might have frame it as a binary classification task between a given potential answer and the answer). I have tried their model as well (got a score of 0), which seems to indicate that it is an error on my end.

Muennighoff · 2023-07-27T07:45:29Z

1. 2. 3. Sounds great!
4. Nice let me know when you have the files! Then I will add them to the LB & we can merge this 👍
5. Maybe let's remove this task for now then like you suggested? We can add it later if we manage to fix the bug 👍

KennethEnevoldsen · 2023-07-28T02:50:39Z

Perfect. I have sent the results by mail.

I have also removed the task and fixed #126 (which was a good thing as it revealed a fix errors). Assuming you think so as well I think this is all good to merge.

Muennighoff

LGTM; Have added all tasks to the leaderboard: https://huggingface.co/spaces/mteb/leaderboard

Let me know if the leaderboard looks okay to you? & then will merge 👍

KennethEnevoldsen · 2023-07-28T23:22:59Z

The only change I would make is to change the name of Bitext from other to Danish (it is Danish + a Danish dialect). Otherwise, I think it looks good!

Edit: Actually if you wish I am creating an aggregated site for the Scandinavian subsection here (still working on it). Feel free to link to it. Plan to also add Finnish, Icelandic and Faroese as well in the future (as well as adding them to MTEB).

Edit: Oh it seems like ScalaNbClassification is in Swedish instead of Norwegian

Muennighoff · 2023-07-29T07:05:34Z

The only change I would make is to change the name of Bitext from other to Danish (it is Danish + a Danish dialect). Otherwise, I think it looks good!

Edit: Actually if you wish I am creating an aggregated site for the Scandinavian subsection here (still working on it). Feel free to link to it. Plan to also add Finnish, Icelandic and Faroese as well in the future (as well as adding them to MTEB).

Edit: Oh it seems like ScalaNbClassification is in Swedish instead of Norwegian

Fixed & added the link! If you want to link it in a different way, let me know - You can also edit the app.py of the leaderboard directly if you want to.

Also FYI all your scores are in this repository: https://huggingface.co/datasets/mteb/results
I renamed some of them to account for the name change in SweRecClassification.

Merging now 🚀

KennethEnevoldsen added 3 commits July 24, 2023 12:53

Make sure that main score is added to bitext mining tasks

a2fe7c6

Added scandinavian languages: da, no, sv

68a4e5a

Updated readme with scandinavian tasks

ffb751b

Muennighoff reviewed Jul 25, 2023

View reviewed changes

KennethEnevoldsen added 6 commits July 25, 2023 09:44

Changes n samples for the nordic lang CLF

343530a

Added scandinavian models to init

9893d04

Added error logs to gitignore

92909c2

fix import error

774022d

fix dataset columns

4c0cd73

rename dataset columns

effb04f

KennethEnevoldsen added 6 commits July 27, 2023 10:31

remove swefaq

4af0e58

fix: Added functionality to raise error

d6415b3

fix: Updated names

81b677e

fix: Removed no as a language

006f28d

Added missing data transformation

378605e

Fix spelling error

85b759d

KennethEnevoldsen requested a review from Muennighoff July 28, 2023 19:09

Muennighoff approved these changes Jul 28, 2023

View reviewed changes

Muennighoff merged commit acb0f59 into embeddings-benchmark:main Jul 29, 2023

KennethEnevoldsen deleted the add-scandinavian-lang branch July 30, 2023 03:12

KennethEnevoldsen mentioned this pull request Jul 30, 2023

Bump version ID and update PyPI #128

Merged

Muennighoff mentioned this pull request May 20, 2024

Integrate with MTEB? kaistAI/InstructIR#3

Open

Muennighoff mentioned this pull request May 31, 2024

Integrate with MTEB? gowitheflow-1998/RAR-b#4

Closed

Muennighoff mentioned this pull request Jul 10, 2024

Integrate with MTEB? CoIR-team/coir#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for Scandinavian Languages #124

Added support for Scandinavian Languages #124

KennethEnevoldsen commented Jul 24, 2023 •

edited

Loading

Muennighoff left a comment •

edited

Loading

KennethEnevoldsen commented Jul 25, 2023 •

edited

Loading

Muennighoff commented Jul 27, 2023 •

edited

Loading

KennethEnevoldsen commented Jul 28, 2023 •

edited

Loading

Muennighoff left a comment

KennethEnevoldsen commented Jul 28, 2023 •

edited

Loading

Muennighoff commented Jul 29, 2023

Added support for Scandinavian Languages #124

Added support for Scandinavian Languages #124

Conversation

KennethEnevoldsen commented Jul 24, 2023 • edited Loading

Addition

Task selection:

Potential problems

Other

Muennighoff left a comment • edited Loading

Choose a reason for hiding this comment

KennethEnevoldsen commented Jul 25, 2023 • edited Loading

Muennighoff commented Jul 27, 2023 • edited Loading

KennethEnevoldsen commented Jul 28, 2023 • edited Loading

Muennighoff left a comment

Choose a reason for hiding this comment

KennethEnevoldsen commented Jul 28, 2023 • edited Loading

Muennighoff commented Jul 29, 2023

KennethEnevoldsen commented Jul 24, 2023 •

edited

Loading

Muennighoff left a comment •

edited

Loading

KennethEnevoldsen commented Jul 25, 2023 •

edited

Loading

Muennighoff commented Jul 27, 2023 •

edited

Loading

KennethEnevoldsen commented Jul 28, 2023 •

edited

Loading

KennethEnevoldsen commented Jul 28, 2023 •

edited

Loading