MalNet is Malware Detector that uses deep machine learning algorithms to classify windows executable to malware or benign. The model constructed using Keras with 2 hidden layers of fully connected neurons (250 neurons and 100 neurons respectively) and was trained by EMBER Dataset (600K files) and achieved 95% accuracy in the test set (200K files)
First install the requirements
pip install -r requirements.txt
There is already trained model in MalNet\TrainedModel
you can use it to classify PE files.
Python predictor.py
If you want to train the model yourself:
Process the EMBER dataset and generate vectorized training input
Python MalNet\features_extractor.py
The vectorized version of the features will be generated in MalNel\Vectors
.
To train the model and save the parameters of the model.
Python MalNet\train.py
The model will saved in MalNet\TrainedModel
.
The apis_extractor.py
script parses through the dataset and generates the most 1000 imported malicious Windows APIs by the malwares to be used in the features vector. The generated APIs are in MalNet/apis.txt.
The features_processing.py
script is class to extract and process the features from the dataset (JSON) or from the raw executable files (EXE/DLL/SYS)
The features_extractor.py
script used to generate a vectorized version of the JSON data set.
The train.py
script is used to train the model.
The predict.py
script is used to classify PE files.
The features vector consists of 1,347 columns from different features of PE:
- 1000 columns for the predefined malicious APIs (listed in MalNet/apis.txt).
- 256 columns for the entropy of each byte.
- 21 columns for
virtual size
,raw size
,virtual address
,entropy
andcharacteristics
of the sections.text
,.data
andrsrc
. - 22 columns for different values from
Optional Header
andFile Header
such assizeof_code
,sizeof_headers
,sizeof_heap_commit
,symbols
... - 7 columns for strings description in the executable such as number of strings, average length, number of URLs, number of file paths, number of registry keys...
- 30 columns for
virtual size
andvirtual address
of the 15 entries of Data Directories. - 11 columns for some extra strong indicators of malwares such as the entropy of the whole file, indicator for section that has both READ and EXECUTE attributes (indicates self-modifying code) or whether the entry point is not on .text or .CODE section (otherwise it's probably executable hijacking)