feat: Added new language code standard #326

KennethEnevoldsen · 2024-04-08T14:47:56Z

This adds a new language code standard using an iso 639-3 code as well as an iso code for scripts.

I believe the benchmarks are currently preserved as they should be. I also believe all tasks run as intended (I have tested a few of the relevant candidates).

A few things we would probably need to fix:

MTEB(task_langs=[...]), currently uses just the user-specified splits, we should instead transfer it to use the standard language codes.
- Currently the above filter also only filters within tasks (not monolingual tasks). I believe it should do both.
- While we are changing it, we could also add a few other filters based on domains etc.
The above will cause a need to update the benchmark lists (shouldn't be too big)
This change does not lead to changes (at least as I understand it) in the results, but it probably should. We have already discussed standardizing the format of the results.

The suggestions above will be major breaking which changes the way the benchmark interface works as well as how results are stored.

supersedes: #319
fixes: #251

…nto language_codes_minor

KennethEnevoldsen · 2024-04-10T08:01:39Z

@MartinBernstorff, @Muennighoff and @imenelydiaker it would be great to have this PR review before it diverge too much

mteb/abstasks/languages.py

MartinBernstorff

Good work @kenneth! Huge diff though, so might very well have missed something, but the stuff I noticed all seems very minor.

mteb/abstasks/languages.py

MartinBernstorff · 2024-04-10T08:10:40Z

mteb/abstasks/AbsTask.py

@@ -58,3 +58,24 @@ def evaluate(self, model, split="test"):
        :param split: Which datasplit to be used.
        """
        raise NotImplementedError
+
+    @property
+    def languages(self) -> set[str]:


My impression is that providing these "shortcuts" tend to be annoying for maintenance, but your mileage may vary.

Annoying for maintenance and nice for use. My guess would be that it removes confusion due to the mapping dicts (leading to less issues). All a guess though. I will need to for:

MTEB(task_langs=[...]), currently uses just the user-specified splits, we should instead transfer it to use the standard language codes.

So rather than implement it in the MTEB object I believe it is better places here.

mteb/abstasks/TaskMetadata.py

mteb/evaluation/MTEB.py

…nto language_codes_minor

KennethEnevoldsen · 2024-04-10T17:38:03Z

I will merge this in now due to the many merge conflicts I keep having to solve. Once it is in we can basically launch MMTEB as the other changes are only nice to have, but not need to have and don't cause issues for people adding datasets

Muennighoff · 2024-04-10T17:45:37Z

I will merge this in now due to the many merge conflicts I keep having to solve. Once it is in we can basically launch MMTEB as the other changes are only nice to have, but not need to have and don't cause issues for people adding datasets

Good with me! Only had a quick look but looks great, really nice work!

Muennighoff · 2024-04-10T18:31:19Z

Do we have to change the language codes in all existing result files?

KennethEnevoldsen · 2024-04-10T18:58:43Z

@Muennighoff atm. I don't believe it changes the results file at all (still uses the hf_name for the multilingual dataset e.g. "fr-en") and for the monolingual datasets, it doesn't include a lang-id. The same is the case for MTEB(task_langs="fr-en"), which should use the same mappings as before and still match the hf-name.

KennethEnevoldsen added 7 commits April 5, 2024 14:34

fix: Added initial language code suggestion

6408b12

docs: updated task metadata description

6653d97

fix: changed folder structure to iso 639-3 codes

f67af13

fix: Updated all language tags

7c8ab2f

clean: removed accidental results commit

0a4969c

fix: Add trusting of remote code to remove warning

861b6e1

fix: Added formatting

26e4fb4

KennethEnevoldsen requested review from MartinBernstorff, imenelydiaker and Muennighoff April 8, 2024 14:48

KennethEnevoldsen added 4 commits April 8, 2024 16:48

fix: trust remote code the flores dataset

5d18841

Merge branch 'main' of https://github.com/embeddings-benchmark/mteb i…

2628f26

…nto language_codes_minor

docs: Added point for language rewrite

93eb2fd

fix: reran linter after merge

37fe5fe

MartinBernstorff reviewed Apr 10, 2024

View reviewed changes

mteb/abstasks/languages.py Outdated Show resolved Hide resolved

MartinBernstorff approved these changes Apr 10, 2024

View reviewed changes

KennethEnevoldsen added 3 commits April 10, 2024 10:24

Merge branch 'main' of https://github.com/embeddings-benchmark/mteb i…

e1d894e

…nto language_codes_minor

fix: Added corrections from review

e353979

Merge branch 'main' of https://github.com/embeddings-benchmark/mteb i…

a8eeff5

…nto language_codes_minor

KennethEnevoldsen added 2 commits April 10, 2024 19:59

fix: Updated languages for newly added datasets

192d38c

docs: added points for new annotations

b0a9537

KennethEnevoldsen enabled auto-merge (squash) April 10, 2024 18:00

KennethEnevoldsen merged commit f0daece into main Apr 10, 2024
5 checks passed

KennethEnevoldsen mentioned this pull request Apr 10, 2024

Add OpenSubtitles Bitext Mining dataset #330

Closed

9 tasks

KennethEnevoldsen deleted the language_codes_minor branch April 10, 2024 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added new language code standard #326

feat: Added new language code standard #326

KennethEnevoldsen commented Apr 8, 2024 •

edited

Loading

KennethEnevoldsen commented Apr 10, 2024

MartinBernstorff left a comment

MartinBernstorff Apr 10, 2024

KennethEnevoldsen Apr 10, 2024

KennethEnevoldsen commented Apr 10, 2024

Muennighoff commented Apr 10, 2024

Muennighoff commented Apr 10, 2024

KennethEnevoldsen commented Apr 10, 2024

feat: Added new language code standard #326

feat: Added new language code standard #326

Conversation

KennethEnevoldsen commented Apr 8, 2024 • edited Loading

KennethEnevoldsen commented Apr 10, 2024

MartinBernstorff left a comment

Choose a reason for hiding this comment

MartinBernstorff Apr 10, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Apr 10, 2024

Choose a reason for hiding this comment

KennethEnevoldsen commented Apr 10, 2024

Muennighoff commented Apr 10, 2024

Muennighoff commented Apr 10, 2024

KennethEnevoldsen commented Apr 10, 2024

KennethEnevoldsen commented Apr 8, 2024 •

edited

Loading