Skip to content

qbetterk/VarBench

Repository files navigation

VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation

[Paper]|[Huggingface]

VarBench is a new benchmark with dynamically-valued variables to help deal with the problem of benchmark data contamination. Currently, the following tasks are supported: GSM8K, CommonsenseQA, AI2 Reasoning Challenge, and TruthfulQA. We are planning to extend VarBench to include other complex tasks such as AGIEval and MMLU. To keep the results comparable, we use the lm-eval harness from EleutherAI.

  1. Constructing VarBench
  2. Extracting Variables
  3. Alternative Perturbation
  4. Citation

Constructing VarBench

The default extracted variables are stored in ./gen_data/${dataset}/${split}_${dataset}_${model}.jsonl, where dataset is one of gsm8k, truthfulqa, csqa, arc_challenge, split is one of test, dev, validation (depending on each dataset), and model is on default gpt4o in this work. In this section we will use them to construct new test sets.

Note: These files has been manually corrected and verified. Therefore, try not to overwrite these files.

Constructing GSM+

we need to sample variable values. By selecting different random seeds, we can create a new set of values for the variables defined in the previous step.

python sample.py \
    --data_path ./gen_data/gsm8k/test_gsm8k_gpt4o.jsonl \
    --save_dir ./gen_data/gsm8k/sample_42 \
    --task generate_test_set_gsm8k \
    --seed 42

Constructing TruthfulQA+

python sample.py \
    --data_path ./gen_data/truthfulqa/validation_truthfulqa_gpt4o.jsonl \
    --save_dir ./gen_data/truthfulqa/sample_42 \
    --task generate_dev_set_csqa \
    --seed ${seed}

Constructing CommonsenseQA+ (CSQA+)

python sample.py \
    --data_path ./gen_data/csqa/dev_csqa_gpt4o.jsonl \
    --save_dir ./gen_data/csqa/sample_42 \
    --task generate_dev_set_csqa \
    --seed ${seed}

Constructing ARC+

python sample.py \
    --data_path ./gen_data/arc/challenge/test_arc_challenge_gpt4o.jsonl \
    --save_dir ./gen_data/arc/challenge/sample_42 \
    --task generate_test_set_arc \
    --seed ${seed}

Extracting Variables

This step is optional if you only wish to select new variable values for the benchmark

The first step is to extract variables from the original GSM8K test set to create a delexicalized version, and construct code solutions for each problem.

python generate.py \
    --model_name_or_path "gpt-4o" \
    --top_p 0.3 \
    --save_dir ./gen_data/gsm8k \
    --save_filename gsm8k_test_gpt4o.jsonl \
    --task generate_gsm8k

This step will add three components for each datapoint, under keys variables, question_delex, func correspondingly.

Note: that the save_filename is different from the default test_gsm8k_gpt4o.jsonl, avoiding overwrite the original annotation.

Then we prompt gpt again to create value range for each variable. On default, we load variables from gsm8k_test_gpt4o.jsonl, which is generated in the previous step.

python generate.py \
    --model_name_or_path "gpt-4o" \
    --top_p 0.3 \
    --save_dir ./gen_data/gsm8k \
    --save_filename gsm8k_test_gpt4o_range.jsonl \
    --task generate_input_range

This step will add component under key input_range for each data point.

Alternative Perturbation

Alternative perturbations are created in other_process.py file.

  • For gsm8k, we conduct paraphrasing:
python other_process.py \
	--task paraphrase_gsm8k \
	--save_dir ./gen_data/gsm8k/paraphrase/ \
	--save_filename "test.jsonl" \
	--seed 2
  • For arc and csqa we conduct shuffling:
python other_process.py \
	--task shuffle_arc \
	--save_dir "./gen_data/arc/challenge/shuffle/" \
	--save_filename "test.jsonl" \
	--seed 40

But for csqa, we conduct shuffling during sampling by change line 144 in sample.py to be True.

  • For truthfulqa, we rewrite questions by setting new_question in line 293 in sample.py to be True.

Citation

If you end up using this benchmark or the accompanying code, please cite the following paper.

@inproceedings{qian2024varbench,
    title = "{VarBench}: Robust Language Model Benchmarking Through Dynamic Variable Perturbation",
    author = "Qian, Kun  and
      Wan, Shunji and
      Tang, Claudia and
      Wang, Youzhi and
      Zhang, Xuanming and
      Chen, Maximillian  and
      Yu, Zhou",
    booktitle = "arXiv preprint",
    month = june,
    year = "2024",
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages