VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation

VarBench is a new benchmark with dynamically-valued variables to help deal with the problem of benchmark data contamination. Currently, the following tasks are supported: GSM8K, CommonsenseQA, AI2 Reasoning Challenge, and TruthfulQA. We are planning to extend VarBench to include other complex tasks such as AGIEval and MMLU. To keep the results comparable, we use the lm-eval harness from EleutherAI.

Constructing VarBench
Extracting Variables
Alternative Perturbation
Citation

Constructing VarBench

The default extracted variables are stored in ./gen_data/${dataset}/${split}_${dataset}_${model}.jsonl, where dataset is one of gsm8k, truthfulqa, csqa, arc_challenge, split is one of test, dev, validation (depending on each dataset), and model is on default gpt4o in this work. In this section we will use them to construct new test sets.

Note: These files has been manually corrected and verified. Therefore, try not to overwrite these files.

Constructing GSM+

we need to sample variable values. By selecting different random seeds, we can create a new set of values for the variables defined in the previous step.

python sample.py \
    --data_path ./gen_data/gsm8k/test_gsm8k_gpt4o.jsonl \
    --save_dir ./gen_data/gsm8k/sample_42 \
    --task generate_test_set_gsm8k \
    --seed 42

Constructing TruthfulQA+

python sample.py \
    --data_path ./gen_data/truthfulqa/validation_truthfulqa_gpt4o.jsonl \
    --save_dir ./gen_data/truthfulqa/sample_42 \
    --task generate_dev_set_csqa \
    --seed ${seed}

Constructing CommonsenseQA+ (CSQA+)

python sample.py \
    --data_path ./gen_data/csqa/dev_csqa_gpt4o.jsonl \
    --save_dir ./gen_data/csqa/sample_42 \
    --task generate_dev_set_csqa \
    --seed ${seed}

Constructing ARC+

python sample.py \
    --data_path ./gen_data/arc/challenge/test_arc_challenge_gpt4o.jsonl \
    --save_dir ./gen_data/arc/challenge/sample_42 \
    --task generate_test_set_arc \
    --seed ${seed}

Extracting Variables

This step is optional if you only wish to select new variable values for the benchmark

The first step is to extract variables from the original GSM8K test set to create a delexicalized version, and construct code solutions for each problem.

python generate.py \
    --model_name_or_path "gpt-4o" \
    --top_p 0.3 \
    --save_dir ./gen_data/gsm8k \
    --save_filename gsm8k_test_gpt4o.jsonl \
    --task generate_gsm8k

This step will add three components for each datapoint, under keys variables, question_delex, func correspondingly.

Note: that the save_filename is different from the default test_gsm8k_gpt4o.jsonl, avoiding overwrite the original annotation.

Then we prompt gpt again to create value range for each variable. On default, we load variables from gsm8k_test_gpt4o.jsonl, which is generated in the previous step.

python generate.py \
    --model_name_or_path "gpt-4o" \
    --top_p 0.3 \
    --save_dir ./gen_data/gsm8k \
    --save_filename gsm8k_test_gpt4o_range.jsonl \
    --task generate_input_range

This step will add component under key input_range for each data point.

Alternative Perturbation

Alternative perturbations are created in other_process.py file.

For gsm8k, we conduct paraphrasing:

python other_process.py \
	--task paraphrase_gsm8k \
	--save_dir ./gen_data/gsm8k/paraphrase/ \
	--save_filename "test.jsonl" \
	--seed 2

For arc and csqa we conduct shuffling:

python other_process.py \
	--task shuffle_arc \
	--save_dir "./gen_data/arc/challenge/shuffle/" \
	--save_filename "test.jsonl" \
	--seed 40

But for csqa, we conduct shuffling during sampling by change line 144 in sample.py to be True.

For truthfulqa, we rewrite questions by setting new_question in line 293 in sample.py to be True.

Citation

If you end up using this benchmark or the accompanying code, please cite the following paper.

@inproceedings{qian2024varbench,
    title = "{VarBench}: Robust Language Model Benchmarking Through Dynamic Variable Perturbation",
    author = "Qian, Kun  and
      Wan, Shunji and
      Tang, Claudia and
      Wang, Youzhi and
      Zhang, Xuanming and
      Chen, Maximillian  and
      Yu, Zhou",
    booktitle = "arXiv preprint",
    month = june,
    year = "2024",
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
base		base
gen_data		gen_data
lm_eval_tasks		lm_eval_tasks
prompt		prompt
README.md		README.md
construct_annotation.py		construct_annotation.py
extract_results.py		extract_results.py
generate.py		generate.py
other_process.py		other_process.py
sample.py		sample.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation

Constructing VarBench

Constructing GSM+

Constructing TruthfulQA+

Constructing CommonsenseQA+ (CSQA+)

Constructing ARC+

Extracting Variables

Alternative Perturbation

Citation

About

Releases

Packages

Languages

qbetterk/VarBench

Folders and files

Latest commit

History

Repository files navigation

VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation

Constructing VarBench

Constructing GSM+

Constructing TruthfulQA+

Constructing CommonsenseQA+ (CSQA+)

Constructing ARC+

Extracting Variables

Alternative Perturbation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages