Settle on one language standard #251

KennethEnevoldsen · 2024-03-18T08:22:56Z

Currently, the repository uses multiple language standards. In #245 it was discussed which standard was the ideal one to use.

I have multiple options:

ISO 639-1 (two-letter codes). Is not comprehensive
ISO 639-3 (three letter codes). Attempt to be comprehensive also covering ancient languages
BCP 47. A "best current practice" standard which consist of:

Simple Language Tag: "en" for English, using ISO 639-1 code (or ISO 639-3)
Language and Region: "en-US" for American English, combining ISO 639-1 and ISO 3166-1 codes.
Language, Script, and Region: "zh-Hant-TW" for Traditional Chinese as used in Taiwan, combining ISO 639-1, ISO 15924, and ISO 3166-1 codes.

Generally I would recommend 2) (potentially 3) using the iso 639-3 standard). Let me know what you guys think.

Muennighoff · 2024-03-18T13:59:22Z

If we want it to apply to all existing datasets, then I think Flores makes 1) / 2) impossible as it differentiates some languages purely on script basis such as zho_Hant & zho_Hans (https://github.com/embeddings-benchmark/mteb/blob/main/mteb/tasks/BitextMining/multilingual/FloresBitextMining.py)

KennethEnevoldsen · 2024-03-18T14:58:48Z

Ahh then it might be relevant to use that system. "Language_script", where language is ISO 639-3. Potentially with the added _region. To allow for specification of dialect.

avidale · 2024-03-18T15:25:55Z

Ahh then it might be relevant to use that system. "Language_script", where language is ISO 639-3. Potentially with the added _region. To allow for specification of dialect.

As far as I understand, for most languages, there are already ISO 639-3 codes for all major dialects (English seems to be one of the rare exceptions), so the extra region addition might not be needed.

imenelydiaker · 2024-03-18T15:46:47Z

I'd go for option 2 (a trigram for languages). But we should keep the script we added for language mappings. What we noticed when creating the French benchmark, was that regardless of MTEB language code standards, some datasets used other codes so we had to map them to MTEB.

Indeed Flores makes things complicated 🤔

KennethEnevoldsen · 2024-03-18T16:13:37Z

It sounds like we are doing, language code (three letters). However, There is language such as da-bornholm, which is a regional dialect of Danish and does not have a three-letter code. Should we just code that as "dan" (the code for danish) and then add the dialect in the description/metadata?

imenelydiaker · 2024-03-18T16:33:21Z

It sounds like we are doing, language code (three letters). However, There is language such as da-bornholm, which is a regional dialect of Danish and does not have a three-letter code. Should we just code that as "dan" (the code for danish) and then add the dialect in the description/metadata?

Yes, adding the dialect in the metadata is a good idea, basically dialects are subsets of a language.

KennethEnevoldsen · 2024-03-20T09:09:50Z

Perfect seems like we agree I will update the language codes once we have the major PRs in #265, #260

KennethEnevoldsen · 2024-04-03T13:41:03Z

@MartinBernstorff, @imenelydiaker, @Muennighoff I believe we are at a point where we could implement this. I just want to talk the idea over, but here is my initial thoughts:

Current formats
We want a general approach for representing languages either as a three letter language code "eng" for single languages tasks, a tuple ("eng", "dan") for Bitext mining and some tasks (e.g. language classification), has multiple languages ("dan", "nob", "swe" ...).

For some datasets tasks these language tags are of relevance to how the dataset is loaded (e.g. CrosslingualTask and MultilingualTask) while for others it is not.

Proposed solution
For all cases where the language tag is only relevant for documentation converting it to the three letter code should be without issue ("en" -> "eng"), these can take the form of: list[str | tuple[str, str]].

For tasks where it is of importance, a mapping function is implemented, which for a given language (or language pair) return the corresponding code for the dataset loading (e.g. "eng" -> "en" or ("eng", "dan") -> "dan-eng").

Issues
I could see one potential issue here. E.g. bitext mining dataset which contains the mapping from {lang}{script}-{lang}{script}, which then could have a mapping from the same two languages but in multiple scripts (e.g. for Japanese).

A solution would be to simply have a list of languages (or really subsets) for the dataset loading, and each of these then have a mapping to a languages. E.g.

{"en": ["eng"], ...} or {"dan-eng": [("eng", "dan")], ...}

Here you are really annotation the subsets rather than the task itself (which contain all of the splits).

For datasets without splits though I am unsure how to structure this as you don't have a subset (you could do {None: ["{lang}"]}), but that seems a bit complex for just annotation the language.

Unsure what the best trade-offs are here, but since it is quite a refactor I would like to discuss it before I spent the time to go through the entire benchmark.

Muennighoff · 2024-04-04T08:48:12Z

Just adding that another consideration is how to display it in the json file - currently it is a dictionary with a language key first that corresponds to the language names as used by the dataset. If a dataset only has 1 language i.e. it's not multilingual there's no language key. It may make sense to always have a language key to standardize the format, i.e. even if it's an English-only or German-only dataset, still have a en: or a de: key (or whatever language code).

KennethEnevoldsen · 2024-04-04T11:52:09Z

totally agree @Muennighoff, would be great to normalise the format of the results more.

KennethEnevoldsen assigned imenelydiaker and Muennighoff Mar 18, 2024

KennethEnevoldsen added this to the mmteb milestone Mar 18, 2024

KennethEnevoldsen mentioned this issue Mar 18, 2024

Update structure of mteb/tasks to mteb/tasks/{type}/{language} #245

Merged

KennethEnevoldsen mentioned this issue Mar 18, 2024

[#215] Optionally specify language when encoding texts #216

Closed

KennethEnevoldsen assigned KennethEnevoldsen and unassigned imenelydiaker and Muennighoff Mar 20, 2024

KennethEnevoldsen mentioned this issue Apr 8, 2024

feat: Added new language code standard #326

Merged

3 tasks

KennethEnevoldsen closed this as completed in #326 Apr 10, 2024

Muennighoff mentioned this issue May 6, 2024

Standardizing results format #639

Closed

Muennighoff mentioned this issue Jun 15, 2024

Remove CrosslingualTask & MultilingualTask? #933

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Settle on one language standard #251

Settle on one language standard #251

KennethEnevoldsen commented Mar 18, 2024

Muennighoff commented Mar 18, 2024

KennethEnevoldsen commented Mar 18, 2024 •

edited

Loading

avidale commented Mar 18, 2024

imenelydiaker commented Mar 18, 2024 •

edited

Loading

KennethEnevoldsen commented Mar 18, 2024

imenelydiaker commented Mar 18, 2024

KennethEnevoldsen commented Mar 20, 2024

KennethEnevoldsen commented Apr 3, 2024 •

edited

Loading

Muennighoff commented Apr 4, 2024

KennethEnevoldsen commented Apr 4, 2024

Settle on one language standard #251

Settle on one language standard #251

Comments

KennethEnevoldsen commented Mar 18, 2024

Muennighoff commented Mar 18, 2024

KennethEnevoldsen commented Mar 18, 2024 • edited Loading

avidale commented Mar 18, 2024

imenelydiaker commented Mar 18, 2024 • edited Loading

KennethEnevoldsen commented Mar 18, 2024

imenelydiaker commented Mar 18, 2024

KennethEnevoldsen commented Mar 20, 2024

KennethEnevoldsen commented Apr 3, 2024 • edited Loading

Muennighoff commented Apr 4, 2024

KennethEnevoldsen commented Apr 4, 2024

KennethEnevoldsen commented Mar 18, 2024 •

edited

Loading

imenelydiaker commented Mar 18, 2024 •

edited

Loading

KennethEnevoldsen commented Apr 3, 2024 •

edited

Loading