This is the repositiory for the paper "Constrained Decoding for Secure Code Generation".
🏠 Home Page • 🏆 Leaderboard • 📄 Paper
There is a disconnection between benchmarks for Code LLMs that evaluate the security and those that assess correctness. Existing benchmarks, like HumanEval and MBPP only evaluate the correctness, while others like Copilot dataset and SecurityEval only target on the security. To bridge this gap, we present CodeGuard+, along with two new metrics, to measure Code LLMs' ability to generate both secure and correct code. Currently, CodeGuard+ supports Python and C/C++, with 91 prompts covering 34 CWEs.
The directory structure of this repository is as follows:
.
|-- data # All prompts.
|-- base # Basic prompts
|-- perturbed # Perturbed prompts (pending)
|-- inference # Code for inference
|-- unit_test # Unit tests for each prompt
|-- CWE
|-- prompt
|-- functional.py # Individual unit test
|-- requirements.txt # Python packages needed by prompts and unit tests
Our benchmark CodeGuard+ is adapted from Copilot Dataset, SecurityEval and CodeQL official repository. It now includes 91 prompts covering 34 CWEs, along with corresponding unit tests and CodeQL queries. You can find prompts and CodeQL queries in data
and unit_tests in unit_test
.
pip install -r requirements.txt
bash setup_codeql.sh
- Verify docker is installed;
- Run
bash setup_sonar.sh
; - Once the sonar server process starts, access the sonar sever gui by opening
http://{IP of your server}:9000
, and login to the server (the default username and password areadmin
); - Click the
A
near the top right corner of the webpage, and then clickMy Account
; - Click on
security
and use the interface to create a new token of typeUser Token
; - Modify
sonar_eval.py
with token you just generated and your desired scan path.
To run inference and evaluation, please run the script eval.sh
using the following command template:
bash eval.sh {model_dir} {output_name} {eval_type} {decoding_method}
One example of running inference and evaluation for Llama3-8B is:
bash eval.sh meta-llama/Meta-Llama-3-8B llama3-nucleus base nucleus
The default settings for inference is to use Nucleus Sampling with temperature=0.4
and top_p=0.95
to generate 10 programs. Please check generate.py
, codeql_eval.py
, sonar_eval.py
and correctness_eval.py
for more details about arguments, customized inference and customized evaluation.
This repository is still under construction, thank you for your patience!
@article{fu2024constrained,
title={Constrained Decoding for Secure Code Generation},
author={Yanjun Fu and Ethan Baker and Yu Ding and Yizheng Chen},
year={2024},
journal={arXiv preprint arXiv:2405.00218}
}