MC Evaluation

This is the repo for the paper Multiple-Choice Questions are Efficient and Robust LLM Evaluators.

Data

In data.tar.gz, there are three folders and five jsonl files:

gsm8k-mc
math-mc
pythonio-mc
gsm8k-test-candidates.jsonl, gsm8k-trian-candidates.jsonl
math-test-candidates.jsonl, math-trian-candidates.jsonl
pythonio-candidates.jsonl

In each of the three folders, there is one test.jsonl and one train.jsonl, which are the multiple-choice questions used in our paper.

The other five jsonl files contain the complete candidate pool for each problem generated by us, represented as a list of strings:

{
    "question":"Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "candidates":["72","84","96","292","4896","36","60","144","48","6","1800","30040"]
}

For all questions the first candidate in the list is the ground-truth answer.

Evaluation

To evaluate a model on one of the three datasets, run:

python run_mc.py --dataset gsm8k --model google/flan-t5-small

where the dataset can be either of gsm8k, math, and pythonio. The model argument can be a model name on Hugging Face, or a local directory.

Citation

@misc{zhang2024multiplechoice,
      title={Multiple-Choice Questions are Efficient and Robust LLM Evaluators}, 
      author={Ziyin Zhang and Lizhen Xu and Zhaokun Jiang and Hongkun Hao and Rui Wang},
      year={2024},
      eprint={2405.11966},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
data.tar.gz		data.tar.gz
dataset_mc.py		dataset_mc.py
run_mc.py		run_mc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MC Evaluation

Data

Evaluation

Citation

About

Languages

License

Geralt-Targaryen/MC-Evaluation

Folders and files

Latest commit

History

Repository files navigation

MC Evaluation

Data

Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Languages