What's New
Lighteval becomes massively multilingual!
We now have extensive coverage in many languages, as well as new templates to manage multilinguality more easily.
-
Add 3 NLI tasks supporting 26 unique languages. #329 by @hynky1999
-
Add 3 COPA tasks supporting about 20 unique languages. #330 by @hynky1999
-
Add Hellaswag tasks supporting about 36 unique languages. #332 by @hynky1999
- mlmm_hellaswag
- hellaswag_{tha/tur}
-
Add RC tasks supporting about 130 unique languages/scripts. #333 by @hynky1999
-
Add GK tasks supporting about 35 unique languages/scripts. #338 by @hynky1999
- meta_mmlu
- mlmm_mmlu
- rummlu
- mmlu_ara_mcf
- tur_leaderboard_mmlu
- cmmlu
- mmlu
- ceval
- mlmm_arc_challenge
- alghafa_arc_easy
- community_arc
- community_truthfulqa
- exams
- m3exams
- thai_exams
- xcsqa
- alghafa_piqa
- mera_openbookqa
- alghafa_openbookqa
- alghafa_sciqa
- mathlogic_qa
- agieval
- mera_worldtree
-
Misc Tasks #339 by @hynky1999
- openai_mmlu_tasks
- turkish_mmlu_tasks
- lumi arc
- hindi/swahili/arabic (from alghafa) arc
- cmath
- mgsm
- xcodah
- xstory
- xwinograd + tr winograd
- mlqa
- mkqa
- mintaka
- mlqa_tasks
- french triviaqa
- chegeka
- acva
- french_boolq
- hindi_boolq
-
Serbian LLM Benchmark Task by @DeanChugall in #340
-
iroko bench by @hynky1999 in #357
Other Tasks
Features
- Now Evaluate OpenAI models by @NathanHB in #359
- New Doc and README by @NathanHB in #327
- Refacto LLM as A Judge by @NathanHB in #337
- Selecting tasks using their superset by @hynky1999 in #308
- Nicer output on task search failure by @hynky1999 in #357
- Adds tasks templating by @hynky1999 in #335
- Support for multilingual generative metrics by @hynky1999 in #293
- Class implementations of faithfulness and extractiveness metrics by @chuandudx in #323
- Translation literals by @hynky1999 in #356
Bug Fixes
- Math normalization: do not crash on invalid format by @guipenedo in #331
- Skipping push to hub test by @clefourrier in #334
- Fix Metrics import path in community task template file. by @chuandudx in #309
- Allow kwargs for BERTScore compute function and remove unused var by @chuandudx in #311
- Fixes sampling for vllm when num_samples==1 by @edbeeching in #343
- Fix the dataset loading for custom tasks by @clefourrier in #364
- Fix: missing property tag in inference endpoints by @clefourrier in #368
- Fix Tokenization + misc fixes by @hynky1999 in #354
- Fix BLEURT evaluation errors by @chuandudx in #316
- Adds Baseline workflow + fixes by @hynky1999 in #363
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @hynky1999
- Support for multilingual generative metrics (#293)
- Adds tasks templating (#335)
- Multilingual NLI Tasks (#329)
- Multilingual COPA tasks (#330)
- Multilingual Hellaswag tasks (#332)
- Multilingual Reading Comprehension tasks (#333)
- Multilingual General Knowledge tasks (#338)
- Selecting tasks using their superset (#308)
- Fix Tokenization + misc fixes (#354)
- Misc-multilingual tasks (#339)
- add iroko bench + nicer output on task search failure (#357)
- Translation literals (#356)
- selected tasks for multilingual evaluation (#371)
- Adds Baseline workflow + fixes (#363)
- @DeanChugall
- Serbian LLM Benchmark Task (#340)
- @NathanHB
New Contributors
- @chuandudx made their first contribution in #323
- @edbeeching made their first contribution in #343
- @DeanChugall made their first contribution in #340
- @Stopwolf made their first contribution in #225
- @martinscooper made their first contribution in #366
Full Changelog: v0.5.0...v0.6.0