Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Settle on one language standard #251

Closed
KennethEnevoldsen opened this issue Mar 18, 2024 · 10 comments · Fixed by #326
Closed

Settle on one language standard #251

KennethEnevoldsen opened this issue Mar 18, 2024 · 10 comments · Fixed by #326
Assignees
Milestone

Comments

@KennethEnevoldsen
Copy link
Contributor

Currently, the repository uses multiple language standards. In #245 it was discussed which standard was the ideal one to use.

I have multiple options:

  1. ISO 639-1 (two-letter codes). Is not comprehensive
  2. ISO 639-3 (three letter codes). Attempt to be comprehensive also covering ancient languages
  3. BCP 47. A "best current practice" standard which consist of:
  • Simple Language Tag: "en" for English, using ISO 639-1 code (or ISO 639-3)
  • Language and Region: "en-US" for American English, combining ISO 639-1 and ISO 3166-1 codes.
  • Language, Script, and Region: "zh-Hant-TW" for Traditional Chinese as used in Taiwan, combining ISO 639-1, ISO 15924, and ISO 3166-1 codes.

Generally I would recommend 2) (potentially 3) using the iso 639-3 standard). Let me know what you guys think.

@Muennighoff
Copy link
Contributor

If we want it to apply to all existing datasets, then I think Flores makes 1) / 2) impossible as it differentiates some languages purely on script basis such as zho_Hant & zho_Hans (https://github.com/embeddings-benchmark/mteb/blob/main/mteb/tasks/BitextMining/multilingual/FloresBitextMining.py)

@KennethEnevoldsen
Copy link
Contributor Author

KennethEnevoldsen commented Mar 18, 2024

Ahh then it might be relevant to use that system. "Language_script", where language is ISO 639-3. Potentially with the added _region. To allow for specification of dialect.

@avidale
Copy link

avidale commented Mar 18, 2024

Ahh then it might be relevant to use that system. "Language_script", where language is ISO 639-3. Potentially with the added _region. To allow for specification of dialect.

As far as I understand, for most languages, there are already ISO 639-3 codes for all major dialects (English seems to be one of the rare exceptions), so the extra region addition might not be needed.

@imenelydiaker
Copy link
Contributor

imenelydiaker commented Mar 18, 2024

I'd go for option 2 (a trigram for languages). But we should keep the script we added for language mappings. What we noticed when creating the French benchmark, was that regardless of MTEB language code standards, some datasets used other codes so we had to map them to MTEB.

Indeed Flores makes things complicated 🤔

@KennethEnevoldsen
Copy link
Contributor Author

It sounds like we are doing, language code (three letters). However, There is language such as da-bornholm, which is a regional dialect of Danish and does not have a three-letter code. Should we just code that as "dan" (the code for danish) and then add the dialect in the description/metadata?

@imenelydiaker
Copy link
Contributor

It sounds like we are doing, language code (three letters). However, There is language such as da-bornholm, which is a regional dialect of Danish and does not have a three-letter code. Should we just code that as "dan" (the code for danish) and then add the dialect in the description/metadata?

Yes, adding the dialect in the metadata is a good idea, basically dialects are subsets of a language.

@KennethEnevoldsen
Copy link
Contributor Author

Perfect seems like we agree I will update the language codes once we have the major PRs in #265, #260

@KennethEnevoldsen
Copy link
Contributor Author

KennethEnevoldsen commented Apr 3, 2024

@MartinBernstorff, @imenelydiaker, @Muennighoff I believe we are at a point where we could implement this. I just want to talk the idea over, but here is my initial thoughts:

Current formats
We want a general approach for representing languages either as a three letter language code "eng" for single languages tasks, a tuple ("eng", "dan") for Bitext mining and some tasks (e.g. language classification), has multiple languages ("dan", "nob", "swe" ...).

For some datasets tasks these language tags are of relevance to how the dataset is loaded (e.g. CrosslingualTask and MultilingualTask) while for others it is not.

Proposed solution
For all cases where the language tag is only relevant for documentation converting it to the three letter code should be without issue ("en" -> "eng"), these can take the form of: list[str | tuple[str, str]].

For tasks where it is of importance, a mapping function is implemented, which for a given language (or language pair) return the corresponding code for the dataset loading (e.g. "eng" -> "en" or ("eng", "dan") -> "dan-eng").

Issues
I could see one potential issue here. E.g. bitext mining dataset which contains the mapping from {lang}{script}-{lang}{script}, which then could have a mapping from the same two languages but in multiple scripts (e.g. for Japanese).

A solution would be to simply have a list of languages (or really subsets) for the dataset loading, and each of these then have a mapping to a languages. E.g.

{"en": ["eng"], ...} or {"dan-eng": [("eng", "dan")], ...}

Here you are really annotation the subsets rather than the task itself (which contain all of the splits).

For datasets without splits though I am unsure how to structure this as you don't have a subset (you could do {None: ["{lang}"]}), but that seems a bit complex for just annotation the language.

Unsure what the best trade-offs are here, but since it is quite a refactor I would like to discuss it before I spent the time to go through the entire benchmark.

@Muennighoff
Copy link
Contributor

Just adding that another consideration is how to display it in the json file - currently it is a dictionary with a language key first that corresponds to the language names as used by the dataset. If a dataset only has 1 language i.e. it's not multilingual there's no language key. It may make sense to always have a language key to standardize the format, i.e. even if it's an English-only or German-only dataset, still have a en: or a de: key (or whatever language code).

@KennethEnevoldsen
Copy link
Contributor Author

totally agree @Muennighoff, would be great to normalise the format of the results more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants