From ce201f778f01a7c0d815c49224f2457cc1a33d45 Mon Sep 17 00:00:00 2001 From: Baber Abbasi <92168766+baberabb@users.noreply.github.com> Date: Fri, 27 Sep 2024 00:54:26 +0500 Subject: [PATCH] add mmlu readme (#2282) --- lm_eval/tasks/mmlu/README.md | 73 ++++++++++++++++++++++++++++++++++++ 1 file changed, 73 insertions(+) create mode 100644 lm_eval/tasks/mmlu/README.md diff --git a/lm_eval/tasks/mmlu/README.md b/lm_eval/tasks/mmlu/README.md new file mode 100644 index 0000000000..a3425d5176 --- /dev/null +++ b/lm_eval/tasks/mmlu/README.md @@ -0,0 +1,73 @@ +# Task-name + +### Paper + +Title: `Measuring Massive Multitask Language Understanding` + +Abstract: `https://arxiv.org/abs/2009.03300` + +`The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.` + +Homepage: `https://github.com/hendrycks/test` + +Note: The `Flan` variants are derived from [here](https://github.com/jasonwei20/flan-2), and as described in Appendix D.1 of [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416). + +### Citation + +``` +@article{hendryckstest2021, + title={Measuring Massive Multitask Language Understanding}, + author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt}, + journal={Proceedings of the International Conference on Learning Representations (ICLR)}, + year={2021} +} + +@article{hendrycks2021ethics, + title={Aligning AI With Shared Human Values}, + author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt}, + journal={Proceedings of the International Conference on Learning Representations (ICLR)}, + year={2021} +} +``` + +### Groups, Tags, and Tasks + +#### Groups + +* `mmlu`: `Original multiple-choice MMLU benchmark` +* `mmlu_continuation`: `MMLU but with continuation prompts` +* `mmlu_generation`: `MMLU generation` + +MMLU is the original benchmark as implemented by Hendrycks et al. with the choices in context and the answer letters (e.g `A`, `B`, `C`, `D`) in the continuation. +`mmlu_continuation` is a cloze-style variant without the choices in context and the full answer choice in the continuation. +`mmlu_generation` is a generation variant, similar to the original but the LLM is asked to generate the correct answer letter. + + +#### Subgroups + +* `mmlu_stem' +* `mmlu_humanities' +* `mmlu_social_sciences' +* `mmlu_other' + +Subgroup variants are prefixed with the subgroup name, e.g. `mmlu_stem_continuation`. + +### Checklist + +For adding novel benchmarks/datasets to the library: +* [x] Is the task an existing benchmark in the literature? + * [x] Have you referenced the original paper that introduced the task? + * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test? + + +If other tasks on this dataset are already supported: +* [x] Is the "Main" variant of this task clearly denoted? +* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates? +* [x] Have you noted which, if any, published evaluation setups are matched by this variant? + +# changelog +ver 1: PR #497 +switch to original implementation + +ver 2: PR #2116 +add missing newline in description.