This repository contains the data and code for the work Execution-based Evaluation for Open Domain Code Generation.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
If you find our paper or code useful, please cite the paper
@article{wang2022execution,
title={Execution-Based Evaluation for Open-Domain Code Generation},
author={Zhiruo Wang, Shuyan Zhou, Daniel Fried, Graham Neubig},
journal={arXiv preprint arXiv:2212.10481},
year={2022}
}
pip install -r requirements.txt
We split the dataset by which natural language is of the corresponding intent.
.
├── README.md
├── data
│ ├── en_test.jsonl
│ ├── es_test.jsonl
│ ├── ja_test.jsonl
│ └── ru_test.jsonl
Each line contains a serialized json object, an example looks like:
{
'task_id': 3844801,
'intent': "check if all elements in list `myList` are identical",
'prompt': "def f_3844801(myList):\n\treturn ",
'canonical_solution': "all(x == myList[0] for x in myList)",
'suffix': "",
'test_start': "\ndef check(candidate):",
'test': [
"\n assert candidate([1,2,3]) == False\n",
"\n assert candidate([1,1,1,1,1,1]) == True\n",
"\n assert candidate([1]) == True\n",
"\n assert candidate(['k','k','k','k','k']) == True\n",
"\n assert candidate([None,'%$#ga',3]) == False\n"
],
'entry_point': "f_3844801",
}
where:
task_id
is the post id of the original StackOverflow post where the sample is constructed from;intent
is the natural language description rewritten by human annotators with qualified specificity;prompt
is the function prefix (definition, input arguments, etc.) to properly execute the code snippet;canonical_solution
is the reference solution (verified by human annotators) of the coding problem;suffix
is the function suffix (return values, if any) to proper to execute the code;test_start
is the definition of test functions, also, including library imports if necessitated by the program;test
is the list of test cases created by human annotators;entry_point
is the function name that should be called for 'check' during evaluation.
To correctly execute the (canonical) code snippets, one needs to install all involved libraries, as listed in the ./library/
directory.
We provide code to evaluate on two state-of-the-art code generation models: CodeX and CodeGen. To perform the NL-to-Code generation task and collect model predictions:
For CodeX, run
python nl2code_codex.py --language en \
--model_name "code-davinci-002" \
--openai_api_key ${YOUR_API_KEY} \
change the model_name
argument to "code-cushman-001" or "code-davinci-001" to try other model variants.
For CodeGen, run
python nl2code_codegen.py --language en \
--model_size 350M --model_data mono
Other valid options for model_size
include: "2B", "6B", and "16B", which correspond to the 2.7B, 6.1B, and 16.1B CodeGen models.
For model_data
, other options include "multi" and "nl".
Our default evaluation metric is the execution pass rate. Before the evaluation, make sure your environment has all required libraries installed, and better imported as in the code samples. To do this, you can:
pip install -r ./library/requirements.txt
python ./library/imports.py
Then we can perform the execution by running:
python eval_passk.py --language en --prediction_path ${MODEL_PRED_PATH}
We also support five other non-execution metrics: BLEU, ROUGE, METEOR, ChrF, and CodeBLEU. For example, to evaluate with the BLEU metric, run:
python eval_nonex.py --language en --prediction_path ${MODEL_PRED_PATH} --eval_metric bleu
Specifying the eval_metric
argument with "rouge"/"meteor"/"chrf"/"codebleu" to use other metrics.
To evaluate on the subset of open-domain or closed-domain samples, you only need to add another argument at evaluation time (when running eval_passk.py
or eval_nonex.py
), by
--library_usage "open" # or "closed"
To include more prompt-solution pairs for in-context prompting learning, specify the num_examples
at inference time (when running nl2code_codex.py
and nl2code_codegen.py
), by
--num_examples 1 # 2, 3, ...
To add exemplar test cases in the prompt inputs, specify the num_tests
at inference time, by
--num_tests 1 # 2, 3, ...
To use different numbers of test cases for execution-based evaluation, specify the num_tests_eval
when running eval_passk.py
, for example
python eval_passk.py --language en --prediction_path ${MODEL_PRED_PATH} --num_tests_eval 1
Our paper explores three methods to create function names in the wrapping context:
- "id":
f_${ID}
, simple string formatting using the StackOverflow post ID - "constant":
function
, using the same string constant for all samples - "intent": heuristic-based extraction from the paired NL intent
To experiment with different function names, specify the function_name
at inference time, by
--function_name "intent" # "id" "constant"
We also provide code to compare execution-based and non-execution evaluation metrics on a sample-wise basis. Take the execution and BLEU score as an example, one can run:
python metric_corr.py --language en \
--prediction_file ${MODEL_PRED_PATH} \
--eval_metric "bleu"
To get visualizations in violin plots and histograms, add --do_plot_violin
or do_plot_stacked_hist
.