Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for Scandinavian Languages #124

Conversation

KennethEnevoldsen
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen commented Jul 24, 2023

Addition

  • Added all relevant tasks for Da, No, and Sv I could find. Norwegian includes Norwegian Bokmål and Norwegian Nynorsk. Danish also includes a bitext between Danish and one of its dialects.
  • Added a minor fix to the Bitext task

Task selection:

  • Wherever possible I try to keep as streamlined with ScandEval as possible. I.e. if there are cases two versions of the dataset exist, I use the same one as ScandEval.
  • I do not include machine-translated datasets (this excludes the QA datasets in ScandEval)
  • Where possible I use the test split. If there are no splits I use the train split.
  • I do not include non-public datasets such as DaNewsroom

Potential problems

  • Can't get a non-zero performance on the retrieval task. See Adding a new retrieval task #122 (A solution is simply to remove it for the first PR)
  • Currently, there is an overweight of sentiment classification tasks.

Other

  • I have added the transform_dataset method. Might be relevant to make that a general method in the AbsTask. In that way, the loading workflow is independent of the transformation workflow (would allow you to e.g. add splits without having to fix the custom transformations of each of the datasets). Currently, it is not implemented in the main function.

Fixes #126

Copy link
Contributor

@Muennighoff Muennighoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work!

  • I like the dataset_transform. I think we can leave it like you did for now & maybe we'll make it a generic function later
  • We still need to import all tasks in the __init__.py files for each task directory I think
  • Why did you choose 16 samples_per_label for all Classification tasks? @NouamaneTazi @loicmagne do you remember how we selected the different samples_per_label values for CLF tasks? Was it based on how big the dataset is / the number of labels?
  • Can you run a few models on the new tasks and share the result files here? I can add a new leaderboard tab for some of these languages where we have a few datasets. Maybe three new CLF tabs for Danish, Norwegian, Swedish? Or is it more useful if it's just one new CLF tab for "Scandinavia"?
  • Do we have results on SweFAQRetrieval from prior work & are we able to reproduce it?

@KennethEnevoldsen
Copy link
Contributor Author

KennethEnevoldsen commented Jul 25, 2023

  1. Sure will keep it as is
  2. Will get on that later today Done
  3. It was arbitrary. I can choose based on similarly sized datasets? (Edit: Most of the datasets are quite small. I changed the largest one to 32, but keep the rest at 16).
  4. Sure thing I have already done it, but will just rerun it with the updated to 3. I would use Danish, Norwegian, and Swedish. Potentially you could also add Scandinavian as a meaningful multilingual group (but it shouldn't be instead of the individual languages).
  5. The Swedish sentence transformer have results here. They gain an accuracy of 50-70 (I think they might have frame it as a binary classification task between a given potential answer and the answer). I have tried their model as well (got a score of 0), which seems to indicate that it is an error on my end.

@Muennighoff
Copy link
Contributor

Muennighoff commented Jul 27, 2023

1. 2. 3. Sounds great!
4. Nice let me know when you have the files! Then I will add them to the LB & we can merge this 👍
5. Maybe let's remove this task for now then like you suggested? We can add it later if we manage to fix the bug 👍

@KennethEnevoldsen
Copy link
Contributor Author

KennethEnevoldsen commented Jul 28, 2023

Perfect. I have sent the results by mail.

I have also removed the task and fixed #126 (which was a good thing as it revealed a fix errors). Assuming you think so as well I think this is all good to merge.

Copy link
Contributor

@Muennighoff Muennighoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; Have added all tasks to the leaderboard: https://huggingface.co/spaces/mteb/leaderboard

Let me know if the leaderboard looks okay to you? & then will merge 👍

@KennethEnevoldsen
Copy link
Contributor Author

KennethEnevoldsen commented Jul 28, 2023

The only change I would make is to change the name of Bitext from other to Danish (it is Danish + a Danish dialect). Otherwise, I think it looks good!

Edit: Actually if you wish I am creating an aggregated site for the Scandinavian subsection here (still working on it). Feel free to link to it. Plan to also add Finnish, Icelandic and Faroese as well in the future (as well as adding them to MTEB).

Edit: Oh it seems like ScalaNbClassification is in Swedish instead of Norwegian

@Muennighoff
Copy link
Contributor

The only change I would make is to change the name of Bitext from other to Danish (it is Danish + a Danish dialect). Otherwise, I think it looks good!

Edit: Actually if you wish I am creating an aggregated site for the Scandinavian subsection here (still working on it). Feel free to link to it. Plan to also add Finnish, Icelandic and Faroese as well in the future (as well as adding them to MTEB).

Edit: Oh it seems like ScalaNbClassification is in Swedish instead of Norwegian

Fixed & added the link! If you want to link it in a different way, let me know - You can also edit the app.py of the leaderboard directly if you want to.

Also FYI all your scores are in this repository: https://huggingface.co/datasets/mteb/results
I renamed some of them to account for the name change in SweRecClassification.

Merging now 🚀

@Muennighoff Muennighoff merged commit acb0f59 into embeddings-benchmark:main Jul 29, 2023
@KennethEnevoldsen KennethEnevoldsen deleted the add-scandinavian-lang branch July 30, 2023 03:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow toggle for raising errors
2 participants