This repo provides the official implementation for "On deceiving malware classification with section injection", available at: https://arxiv.org/abs/2208.06092
Clone the project.
git clone https://github.com/adeilsonsilva/malware-injection
Copy your datasets to
data
directory, as the container will have a volume attached to it.
Run the following script to build and run the image:
./run.sh
If you run this script, you're all set to use the machine learning models. To use GIST, you're better off using the virtual environment (it requires some quirk and outdated libraries).
python3 -m venv .
source bin/activate
pip3 install --user -r gist-requirements.txt
# * Do whatever you want *
deactivate # quit from venv
This repo depends on pe-modifier as git a submodule. Remember to install it by using:
git submodule update --init
Our models were trained using Tensorflow GPU 2.3.0, which uses CUDA 10.1 [Source]. To proceed with instalation:
-
You can use
nvidia-docker
to run the provided container with host GPUs, assuming you have everything setup locally. Check out nvidia-docker from its source or install it using this script. -
You can use this script to install all nvidia drives and cuda 10.1 locally on your machine.
If you don't want to use docker (you should!), make sure to install following libraries:
python3
python3-pip
libfftw3-3
Then proceed with python requirements to use the machine learning models:
cd code
pip3 install -r requirements.txt
If you're interested in using GIST algorithm, install its dependencies:
cd code
pip3 install -r gist-requirements.txt
cd ../dependencies/pyleargist-2.0.5/
python3 setup.py build
python3 setup.py install --user
This project is structured to use separate scripts. They are all in code
directory, change to it in case you are not using the docker container.
The main scripts for training/handling data are inside src
:
├── src
│ ├── gen_dataset_npz.py # Converts a existing dataset to npz
│ ├── gen_headerless_dataset.py # Generate a headerless version of the dataset
│ ├── gen_injected_dataset_npz.py # Generates an injected dataset (.npz)
│ └── run_ml_model.py # Main script used for training/testing.
You can also check models
directory to check used architectures:
├── models
│ ├── Augmenter.py # Module with code used for data augmentation (data injection/reordering)
│ ├── Chen2018.py # Module with Inception architecture
│ ├── Data.py # Main data handler module with various wrappers
│ ├── Le2018.py # Module with cnn/lstm variations
│ ├── Nataraj2011.py # Module with KNN
To cite the paper, kindly use the following BibTex entry:
@misc{Silva2022,
doi = {10.48550/ARXIV.2208.06092},
url = {https://arxiv.org/abs/2208.06092},
author = {da Silva, Adeilson Antonio and Segundo, Mauricio Pamplona},
keywords = {Cryptography and Security (cs.CR), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {On deceiving malware classification with section injection},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution Non Commercial No Derivatives 4.0 International}
}
dataset/
├── benign
│ ├── sample1.exe
│ ├── ...
│ └── sampleN.exe
└── malware
├── sample1.exe
├── ...
└── sampleN.exe
If you are not going for the binary problem check if your families are in the allowed list.
# Load as image
image = cv2.imread(path_img, cv2.IMREAD_GRAYSCALE)
image_reshaped = image.reshape(image.shape[0]*image.shape[1], 1)
image_final = cv2.resize(image_reshaped, (height, width))
# Load as exe
bin_stream = np.fromfile(path_exe, dtype='uint8')
bin_stream_reshaped = bin_stream.reshape(bin_stream.shape[0], 1)
bin_final = cv2.resize(bin_stream, (height, width))
- Those methods may produce different results.
np.fromfile
is not adequate for opening png images, it does not read all bytes. Use it strictly for opening binary (or txt) files, as per its documentation - When converting to exe's to images using Nataraj's method, some bytes at the end of the file might be discarded, so if you load both an image and an exe using the methods above their results after reshaping/resizing might not be the same.
Copyright 2022 Adeilson Silva
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.