What's New

Lighteval becomes massively multilingual!

We now have extensive coverage in many languages, as well as new templates to manage multilinguality more easily.

Add 3 NLI tasks supporting 26 unique languages. #329 by @hynky1999
- xnli
- xnli2.0
- indic_xnli
- cmnli + ocnli
- rcb
Add 3 COPA tasks supporting about 20 unique languages. #330 by @hynky1999
- xcopa
- indic-copa
- parus
Add Hellaswag tasks supporting about 36 unique languages. #332 by @hynky1999
- mlmm_hellaswag
- hellaswag_{tha/tur}
Add RC tasks supporting about 130 unique languages/scripts. #333 by @hynky1999
- xquad
- thaiqa
- sber_squad
- arcd
- kenswquad
- chinese_squad
- cmrc2018
- indicqa
- fquad_v2
- tydiqa
- beleble
Add GK tasks supporting about 35 unique languages/scripts. #338 by @hynky1999
- meta_mmlu
- mlmm_mmlu
- rummlu
- mmlu_ara_mcf
- tur_leaderboard_mmlu
- cmmlu
- mmlu
- ceval
- mlmm_arc_challenge
- alghafa_arc_easy
- community_arc
- community_truthfulqa
- exams
- m3exams
- thai_exams
- xcsqa
- alghafa_piqa
- mera_openbookqa
- alghafa_openbookqa
- alghafa_sciqa
- mathlogic_qa
- agieval
- mera_worldtree
Misc Tasks #339 by @hynky1999
- openai_mmlu_tasks
- turkish_mmlu_tasks
- lumi arc
- hindi/swahili/arabic (from alghafa) arc
- cmath
- mgsm
- xcodah
- xstory
- xwinograd + tr winograd
- mlqa
- mkqa
- mintaka
- mlqa_tasks
- french triviaqa
- chegeka
- acva
- french_boolq
- hindi_boolq
Serbian LLM Benchmark Task by @DeanChugall in #340
iroko bench by @hynky1999 in #357

Now Evaluate OpenAI models by @NathanHB in #359
New Doc and README by @NathanHB in #327
Refacto LLM as A Judge by @NathanHB in #337
Selecting tasks using their superset by @hynky1999 in #308
Nicer output on task search failure by @hynky1999 in #357
Adds tasks templating by @hynky1999 in #335
Support for multilingual generative metrics by @hynky1999 in #293
Class implementations of faithfulness and extractiveness metrics by @chuandudx in #323
Translation literals by @hynky1999 in #356

Math normalization: do not crash on invalid format by @guipenedo in #331
Skipping push to hub test by @clefourrier in #334
Fix Metrics import path in community task template file. by @chuandudx in #309
Allow kwargs for BERTScore compute function and remove unused var by @chuandudx in #311
Fixes sampling for vllm when num_samples==1 by @edbeeching in #343
Fix the dataset loading for custom tasks by @clefourrier in #364
Fix: missing property tag in inference endpoints by @clefourrier in #368
Fix Tokenization + misc fixes by @hynky1999 in #354
Fix BLEURT evaluation errors by @chuandudx in #316
Adds Baseline workflow + fixes by @hynky1999 in #363

The following contributors have made significant changes to the library over the last release:

Full Changelog: v0.5.0...v0.6.0