The code review dataset is located in data/code_repair_data.jsonl
. The fields of the data are explained below:
Field | Explanation |
---|---|
id |
The local ID in the task |
src_uid |
Unique identifier of the problem |
description |
The original problem description in natural language |
input_specification |
Description of the form of input data |
output_specification |
Description of the form of output data |
sample_inputs |
Sample inputs |
sample_outputs |
Sample outputs |
notes |
Additional note for the problem |
source_code |
Buggy code submitted by human |
execute_outcome |
The execute outcome of the buggy code |
lang_cluster |
The programming language that buggy code used |
lang |
The specific programming language version of buggy code |
difficulty |
Difficulty of the problem |
human_solution |
Accepted human solution |
testcases |
List of testcases of the problem |
cd code_repair
- install
python>=3.9
(we usepython==3.9
) - install
pytorch
(we usepytorch==2.1.1
) based on your cuda version pip install -r requirement.txt
conda install -c conda-forge perl
Validate the correctness of installation:
perl -v
touch myscript.pl
perl myscript.pl
Programs written in D, and Delphi that need to run on Windows require the following dependencies to be installed:
Download dmd 2.105.0 for windows and unzip it to a suitable location. Replace d_path
in run.py
Download delphi 7 and install it to a suitable location. Replace delphi_path
in run.py
Programs written in other languages need to be run using the ExecEval (under the project root directory), and the following dependencies need to be installed:
- Install docker-ce
cd ExecEval
docker build . -t exec-eval:1.0
Run the inference scripts to get the inference results of the targeted LLMs. The inference results code_repair_result_{model_name}.jsonl
will be saved under the inference/results
folder. The inference logs code_repair_log_{model_name}.log
will be saved under the inference/logs
folder.
We provide the following closed-sourced LLMs inference scripts for you:
Model Name | Model Version | Script Name |
---|---|---|
PaLM 2 | text-bison-001 | run_palm2.py |
GPT-4 | gpt-4-0613 | run_gpt.py |
GPT-3.5 | gpt-3.5-turbo-0613 | run_gpt.py |
For PaLM 2, you can run the following command by replacing google_api_key
with your own Google API key.
python run_palm.py
--api_key your_palm_api_key
--data_load_name code_repair_data.jsonl
--candidate_num 1
--result_save_name code_repair_run_palm.jsonl
--log_file_name code_repair_run_palm.log
For GPT-4 and GPT-3.5, you can run the following command by replacing openai_api_key
with your own OpenAI API key, model_version
with specific model version.
python run_gpt.py
--api_key your_openai_apikey
--model model_specific_version
--data_load_name code_repair_data.jsonl
--candidate_num 1
--result_save_name code_repair_run_{model_name}.jsonl
--log_file_name code_repair_run_{model_name}.log
We provide the following open-sourced LLMs inference scripts for you:
Model Name | Model Checkpoint | Script Name |
---|---|---|
Code LLaMA | codellama/CodeLlama-34b-Instruct-hf | run_codellama.py |
LLaMA 2 | meta-llama/Llama-2-70b-chat-hf | run_llama2.py |
StarCoder | HuggingFaceH4/starchat-beta | run_starcoder.py |
Vicuna | lmsys/vicuna-13b-v1.5-16k | run_vicuna.py |
WizardCoder | WizardLM/WizardCoder-15B-V1.0 | run_wizardcoder.py |
For HuggingFace models, you can run the following command by replacing huggingface_access_token
with your own HuggingFace access token, cache_dir
with path to a directory in which a downloaded pretrained model and tokenizer should be cached, model_checkpoint
with specific model checkpoint.
python run_{model_name}.py
--access_token access_token
--cache_dir cache_dir
--checkpoint your_model_ckpt
--data_load_name code_repair_data.jsonl
--candidate_num 1
--result_save_name code_repair_run_{model_name}.jsonl
--log_file_name code_repair_run_{model_name}.log
The code ready for testing should be stored line by line in your_codes.jsonl and the file should be placed in your_codes_dir. A typical code record is shown below and should contain at least the following keys:
{
"lang_cluster": "{model_name}",
"lang": "{model_name}",
"source_code": "{model_name}",
"src_uid": "{model_name}",
"difficulty": 800,
"testcases": "[{'input': 'input1', 'output': ['output1']}, {'input': 'input2', 'output': ['output2']}]"
}
- For all programming languages except Perl, D, and Delphi, example of most typical usage:
-
docker run -it -p x:y -e NUM_WORKERS=n exec-eval:1.0.
This will expose port y (default 5000) as http://localhost:y on the local machine whereas port x is used within the docker container which can be set by environment variable GUNICORN_PORT. It is recommended to not use all cpus, as if cpu goes into 100% load it might affect execution speed of the codes uncontrollably, and keeping some cpus free for evaluation script. A valid example assuming less cpus available:docker run -it -p 5000:5000 -e NUM_WORKERS=5 exec-eval:1.0
-
python run_execeval.py --codes_dir your_codes_dir --results_dir your_results_dir --code_filename your_codes.jsonl
The results of the run are output to
your_results_dir
, forming a jsonl file, which compares the input jsonl, with each new entry adding the results of each test case run, stored in thetestcases
-
For Perl, D, and Delphi, example of most typical usage:
python run.py --code_path your_codes_{program_language}.jsonl --output_path result/results.json --cmd_path your_cmd_path
Please change the
--code_path
withperl/d/delphi
code files. The execute results are saved to--output_path
, which records the results ofaccepted
,wrong
, anderror
for each key, and each output records the possible error outputs and the type of error.
After the execution, we provide a scorer script to count the number of correct solutions around different languages and difficulties.
Please put all your executed results into --result_dir
, include d/perl/delphi
and the rest. Then run following command to count the results generated by {model_name}
: python score_code_repair.py --result_dir your_result_dir --model_name model_name