The threat presented by malware hidden in Portable Document Formats (PDFs) is a serious issue to the average Internet user, with the ability of a PDF to execute purposefully-embedded JavaScript serving as a method of obscuring malicious scripts and data. While there are several existing Machine Learning-based models designed for PDF malware detection, the usage of transformers to statically analyze PDFs for malware has not yet been explored. Due to their attention mechanisms and ability to process data in parallel, transformers hold great potential for analyzing large quantities of data in detail without being excessively computationally demanding. By preprocessing PDFs as byte strings, generating meaningful word embeddings using one-hot encoding and variable n-grams, and feeding these results to a fine-tuned transformer model, we have produced a model that classifies a testing set of PDFs as malicious or benign with 96.67% accuracy. After evaluating the performance of our model, we can note that this is a feasible method of performing robust static analysis on PDF files. However, it is important to continue refining the current model and exploring additional methods of improving the accuracy and precision of the model on a varied dataset.
The following repository contains the scripts, models, and data related to this research project. The data used for this project was from CIC-Evasive-PDFMal2022 which can be requested here.
csv_generator.py
: generates the relevant CSV file for training/validation datademo.py
: demonstrates model inferencing on a sample PDFpreprocessing.py
: converts PDF to variable n-gram byte stringsplit.py
: creates training/validation data splittrain.py
: runs model training on preprocessed dataval.py
: runs model validation on inferences
In order to run the scripts within this repository, first set up a virtual environment using the following command.
python3 -m virtualenv venv
In order to activate the virtual environment, run the following command.
source venv/bin/activate
Once the virtualn environment is activated, you can install all of the necessary dependencies using the following command.
pip install -r requirements.txt
Additionally, two directories by the names of data
and results
should be placed at the root of the repository (these are included in the gitignore).
In order to run the demo, ensure that data/dummy.pdf
(replace with whichever PDF you want to perform inferencing on) and results/model_weights.pth
are placed correctly. To access a sample dummy.pdf
visit here and to access the model, visit here. You can then run the following command to perform the demo.
python3 demo.py
In order to perform preprocessing of the CIC-Evasive-PDFMal2022 dataset, there are a few stages. First, run split.py
on the relevant zip files to generate the training-validation split required (90-10 is a recommended ratio). Use the following command.
python3 split.py -t 90
Next, from the produced dataset, run csv_generator.py
on both the training and validation datasets inorder to generate CSV files for them. Use the following command.
python3 csv_generator.py
This will output a training.csv
and a testing.csv
which you can place in the data
directory.
In order to run the training script, make sure that data/training.csv
is created (you can access a copy of the training data here). You can then begin training using the following command.
python3 train.py
In order to run the validation script, make sure that both data/testing.csv
is created and results/model_weights.pth
are placed correctly (you can access a copy of the validation data here and a copy of the model here). You can then begin validation using the following command.
python3 val.py
"NLP-based Malware Detection of PDFs" was developed by Benson Liu, Caolinn Hukill, Juliet Zhang, & Salma Alandary for ECE 188: Computer Security taught at UCLA in Fall 2022. For any questions or additional infromation about this project, please contact the authors.