-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Settle on one language standard #251
Comments
If we want it to apply to all existing datasets, then I think Flores makes 1) / 2) impossible as it differentiates some languages purely on script basis such as |
Ahh then it might be relevant to use that system. "Language_script", where language is ISO 639-3. Potentially with the added _region. To allow for specification of dialect. |
As far as I understand, for most languages, there are already ISO 639-3 codes for all major dialects (English seems to be one of the rare exceptions), so the extra region addition might not be needed. |
I'd go for option 2 (a trigram for languages). But we should keep the script we added for language mappings. What we noticed when creating the French benchmark, was that regardless of MTEB language code standards, some datasets used other codes so we had to map them to MTEB. Indeed Flores makes things complicated 🤔 |
It sounds like we are doing, language code (three letters). However, There is language such as da-bornholm, which is a regional dialect of Danish and does not have a three-letter code. Should we just code that as "dan" (the code for danish) and then add the dialect in the description/metadata? |
Yes, adding the dialect in the metadata is a good idea, basically dialects are subsets of a language. |
@MartinBernstorff, @imenelydiaker, @Muennighoff I believe we are at a point where we could implement this. I just want to talk the idea over, but here is my initial thoughts: Current formats For some datasets tasks these language tags are of relevance to how the dataset is loaded (e.g. CrosslingualTask and MultilingualTask) while for others it is not. Proposed solution For tasks where it is of importance, a mapping function is implemented, which for a given language (or language pair) return the corresponding code for the dataset loading (e.g. "eng" -> "en" or ("eng", "dan") -> "dan-eng"). Issues A solution would be to simply have a list of languages (or really subsets) for the dataset loading, and each of these then have a mapping to a languages. E.g. {"en": ["eng"], ...} or {"dan-eng": [("eng", "dan")], ...} Here you are really annotation the subsets rather than the task itself (which contain all of the splits). For datasets without splits though I am unsure how to structure this as you don't have a subset (you could do {None: ["{lang}"]}), but that seems a bit complex for just annotation the language. Unsure what the best trade-offs are here, but since it is quite a refactor I would like to discuss it before I spent the time to go through the entire benchmark. |
Just adding that another consideration is how to display it in the json file - currently it is a dictionary with a language key first that corresponds to the language names as used by the dataset. If a dataset only has 1 language i.e. it's not multilingual there's no language key. It may make sense to always have a language key to standardize the format, i.e. even if it's an English-only or German-only dataset, still have a en: or a de: key (or whatever language code). |
totally agree @Muennighoff, would be great to normalise the format of the results more. |
Currently, the repository uses multiple language standards. In #245 it was discussed which standard was the ideal one to use.
I have multiple options:
Generally I would recommend 2) (potentially 3) using the iso 639-3 standard). Let me know what you guys think.
The text was updated successfully, but these errors were encountered: