Skip to content

Latest commit

 

History

History
46 lines (37 loc) · 2.4 KB

README.md

File metadata and controls

46 lines (37 loc) · 2.4 KB

botnet_active_learning

This repository contains code used for experiments in my BSc final thesis, “Multi-class Classification of Botnet Detection by Active Learning.”

Thesis in Nutshell

The process of labeling malware samples and network traffic is a costly endeavor in the cybersecurity industry.
This active learning framework enables the efficient creation of effective ML models using a limited amount of data.
This thesis focuses on benchmarking well-known query strategies to determine which strategy and parameters can achieve the best results with the fewest data samples.

Figure 1. Cycle of Active Learning

AL_cycle

Figure 2. Uncertainty Sampling VS Query by Committee VS Random Sampling

US_QbC

Figure 3. Ranked Batch-mode Sampling VS Random Sampling

Ranked

Conclusion

  • Margin Sampling is the optimal strategy in terms of stability and convergence speed.
  • If multiple instances are required in each iteration, Ranked Batch-mode Sampling with a small unlabeled pool may perform well.

Setup Instruction (For those who want to run the code)

Environment

To get started, clone the repository:

git clone https://github.com/kei5uke/botnet-active-learning.git

Then, change your current directory and install the dependencies:

cd active_learning
pip install -r requirements.txt

Next, install the MedBIoT and N-BaIoT datasets and store them in the /dataset directory
The file structure is shown in the directory, so be sure to install the datasets accordingly
You can find the datasets here:

Dataset Pickels

Only a small portion of the datasets is used for the experiments
To generate dataset pickles, run python3 Make_df_MedBIoT.py and python3 Make_df_N-BaIoT.py

Experiment

Change common variables in global_variable.py and shared variables in each file
Now you are ready to run any experimental code in /active_learning directory