The dataset integrates 5941 security patches (multi-language, single-commit, and multi-commit patches).
Dataset | Patches | Commits | Language | Refs |
---|---|---|---|---|
CVE-Details | 2224 | 1816 | multi | 1 |
SecBench | 659 | 659 | multi | 2 |
SAP | 1127 | 565 | Java | 3 |
Big-Vul | 4047 | 3433 | C/C++ | 4 |
Dataset vulnerabilities span 146 different types of vulnerabilities and 20 languages. More details in the paper.
Dataset Schema
cve_id
: The common vulnerabilities and exposures identifier.project
: GitHub project name.sha
: Commit key or identifier of the version in the project repository.cwe_id
: Severity score of the vulnerability.score
: Severity score of the vulnerability.files
: Set of files changed by the patch. Schema:{path: ..., additions: ..., deletions: ..., changes: ..., status: ...}
.github
: Commit Link.parents
: Commit keys for the previous software version.date
: Date of the changes.author
: Author of the changes.ext_files
: Extension of the files.lang
: Programming language.summary
: Summary of the vulnerability.message
: Commit message.comments
: Developers comments. Schema:{author: ..., date: ..., body: ...}
Programing Language
Language | Commits |
---|---|
C/C++ | 3944 |
Java | 1369 |
PHP | 1350 |
Visit the paper for more details about the other programming languages supported.
CWE/Weakness
CWE | Commits |
---|---|
CWE-79 | 870 |
CWE-20 | 712 |
CWE-119 | 705 |
CWE-200 | 419 |
CWE-125 | 380 |
Visit the paper for more details about the other weaknesses supported.