This is the repo for the paper Multiple-Choice Questions are Efficient and Robust LLM Evaluators.
In data.tar.gz
, there are three folders and five jsonl files:
gsm8k-mc
math-mc
pythonio-mc
gsm8k-test-candidates.jsonl
,gsm8k-trian-candidates.jsonl
math-test-candidates.jsonl
,math-trian-candidates.jsonl
pythonio-candidates.jsonl
In each of the three folders, there is one test.jsonl
and one train.jsonl
, which are the multiple-choice questions used in our paper.
The other five jsonl files contain the complete candidate pool for each problem generated by us, represented as a list of strings:
{
"question":"Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
"candidates":["72","84","96","292","4896","36","60","144","48","6","1800","30040"]
}
For all questions the first candidate in the list is the ground-truth answer.
To evaluate a model on one of the three datasets, run:
python run_mc.py --dataset gsm8k --model google/flan-t5-small
where the dataset can be either of gsm8k
, math
, and pythonio
. The model argument can be a model name on Hugging Face, or a local directory.
@misc{zhang2024multiplechoice,
title={Multiple-Choice Questions are Efficient and Robust LLM Evaluators},
author={Ziyin Zhang and Lizhen Xu and Zhaokun Jiang and Hongkun Hao and Rui Wang},
year={2024},
eprint={2405.11966},
archivePrefix={arXiv},
primaryClass={cs.CL}
}