-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added new language code standard #326
Conversation
@MartinBernstorff, @Muennighoff and @imenelydiaker it would be great to have this PR review before it diverge too much |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work @kenneth! Huge diff though, so might very well have missed something, but the stuff I noticed all seems very minor.
@@ -58,3 +58,24 @@ def evaluate(self, model, split="test"): | |||
:param split: Which datasplit to be used. | |||
""" | |||
raise NotImplementedError | |||
|
|||
@property | |||
def languages(self) -> set[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression is that providing these "shortcuts" tend to be annoying for maintenance, but your mileage may vary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annoying for maintenance and nice for use. My guess would be that it removes confusion due to the mapping dicts (leading to less issues). All a guess though. I will need to for:
MTEB(task_langs=[...]), currently uses just the user-specified splits, we should instead transfer it to use the standard language codes.
So rather than implement it in the MTEB object I believe it is better places here.
…nto language_codes_minor
…nto language_codes_minor
I will merge this in now due to the many merge conflicts I keep having to solve. Once it is in we can basically launch MMTEB as the other changes are only nice to have, but not need to have and don't cause issues for people adding datasets |
Good with me! Only had a quick look but looks great, really nice work! |
Do we have to change the language codes in all existing result files? |
@Muennighoff atm. I don't believe it changes the results file at all (still uses the hf_name for the multilingual dataset e.g. "fr-en") and for the monolingual datasets, it doesn't include a lang-id. The same is the case for MTEB(task_langs="fr-en"), which should use the same mappings as before and still match the hf-name. |
This adds a new language code standard using an iso 639-3 code as well as an iso code for scripts.
I believe the benchmarks are currently preserved as they should be. I also believe all tasks run as intended (I have tested a few of the relevant candidates).
A few things we would probably need to fix:
The suggestions above will be major breaking which changes the way the benchmark interface works as well as how results are stored.
supersedes: #319
fixes: #251