A project based on BERT to detect GLIBC vulnerabilities.
PwnBERT is a BERT-based vulnerability detection tool designed to identify and analyze Pwn-related vulnerabilities (e.g. UAF, heap overflow, etc.) in C language. By combining natural language processing techniques and security domain knowledge, this project aims to provide an efficient and reliable solution to help developers and security researchers identify potential security risks and thus strengthen code security. In future, we will try to apply PwnBERT
on actual open-source and kernels and see if it can detect vulnerabilities in real-world programs.
Generally speaking, PwnBERT is a tool that helps find and analyze vulnerabilities in computer programs written in C language that could be exploited by attackers. It uses a technique called natural language processing and combines it with security expertise to make it easier for developers and security researchers to identify potential security risks and make code more secure. The goal is to create a reliable and efficient solution for identifying and preventing potential security threats.
Why you should Pay attention on this project?
-
it used OpenAI API (ChatGPT) for acquiring training set
-
it used AI training to detect complex vulns in codes, instead of identifying them via structure analyzing.
-
it's made by me, ALL BY MY SELF! (i know it does not sound like a good reason but i will put it up here anyway:) )
Currently, we have finished our main fine-tuning of our BERT module, which enable PwnBERT to indentify Off-by-one
vulns ( after all we still decided to use DistilBert :sadfaceemoji ) and the accuracy is pretty ideable. Optimization will be done.
CodeBERT is a state-of-the-art neural model for code representation learning. It is based on the Transformer architecture and is pre-trained on a large corpus of code. CodeBERT can be fine-tuned on various downstream tasks such as code classification, code retrieval, and code generation. By leveraging the pre-trained model, CodeBERT can effectively capture the semantic and syntactic information of code, which makes it a powerful tool for code analysis and understanding. In PwnBERT, we use CodeBERT to assist in identifying and analyzing Pwn-related vulnerabilities in C language.
In our project, we generally seperated our plan into few stages;
-
Make the trainset from a large amount of code ( generated by openAI API )
-
using elaborately designed prompt to generate specific codes section
-
fine-tuning using those samples & CodeBERT
-
- Using data sample collected from websites like
CVE
to make a training-set- collect vuln list
- retrive real vuln codes that are related to the vuln in vuln list
- .....
In this part we will use OpenAI API
to generate our training set, then fine-tune our BERT model using those samples. We select this appoarch first due to the fact that we are not sure if we can collect enough data for training, and we are not sure if we can collect data that are related to the vulns we want to detect.
In this part, what we basically did is use OpenAI API
's ChatGPT
to generate our prompt, then extract the code in collect_generated_code(amount_of_time):
. You can test our code by following these steps:
-
$ touch config.py
This will create a config file that will be used later for themain.py
file -
$ echo "OPEN_AI_KEY = #YOUR_API_KEY"
Change#YOUR_API_KEY
to your OpenAI API KEY -
$ python3 main.py
This will run the python file.
From executing this file, you can acquire your training sets and eval sets, remember to modifiy generate_tokens()
function in main
function.
Due to some problem we have not solve yet, we decided to create train_v2.py
for trainning.
Generally speaking: This is a Python script for fine-tuning the DistilBert model for sequence classification using PyTorch and the transformers library. The script defines a CodeDataset class that inherits from PyTorch's Dataset class, which represents a dataset of code files. The CodeDataset class loads code files from two directories, one containing vulnerable code and the other containing non-vulnerable code, and preprocesses the code using the DistilBertTokenizer to generate token IDs, attention masks, and labels.
The script then defines a compute_metrics function that calculates the accuracy of the model on the evaluation dataset. The main function of the script, finetune_pwnbert, loads the DistilBertForSequenceClassification model from the transformers library, initializes the model with a specified number of output labels, and fine-tunes the model on the training dataset. The function takes as input four directory paths containing the training and evaluation datasets of vulnerable and non-vulnerable code, respectively, and saves the finetuned model and tokenizer in the specified output directory.
According to our test result, PwnBERT
can identify relatively effective, average eval_loss=MISSING DATA HERE
However, due to the fact that we are using BERT
, PwnBERT can only identify Off-by-one
vuln that looks like the sample from trainset (not sure if it comprehens it or not)
In this particular stage, we will generally focus on collecting real and more diverse training samples. For instance. We will use CVE
as our main source of data, and we will try to collect as much data as possible. We will also try to collect data from other sources, such as Github
and Stackoverflow
.
However, with tons of researchs and tries, we found this abosolutely amazing repo called: juliet-test-suite-c
Which basically saved our life.
Basically, The Juliet Test Suite for C is a comprehensive suite of test cases designed to help developers and researchers identify and evaluate the effectiveness of static analysis and other security-related tools for detecting, diagnosing, and mitigating software vulnerabilities in C programs. Developed by the National Security Agency (NSA) Center for Assured Software (CAS) and the U.S. Department of Homeland Security (DHS), the suite contains over 81,000 test cases covering a wide range of CWEs (Common Weakness Enumerations).
Mainly updates after Mar 22, 2023:
-
v1.1, Mar 20: Started to use
concurrent.futures
for acceleration purposes. -
v1.2, Mar 22: Created
PwnBERT.py
, major adjust the structure of directories (because I need to import them), fix minor bugs and added new stuff ongenerate_code_segments/
andtokenize_codes
. -
v1.2.1: Fix bugs that might effect significantly on the codes
-
v1.2.5(idk): Fix Mega bug