This project aims to analyze and classify a real network traffic dataset to detect malicious/benign traffic records. It compares and tunes the performance of several Machine Learning algorithms to maintain the highest accuracy and lowest False Positive/Negative rates.
The dataset used in this demo is: CTU-IoT-Malware-Capture-34-1.
- It is part of Aposemat IoT-23 dataset.
- A labeled dataset with malicious and benign IoT network traffic.
- This dataset was created as part of the Avast AIC laboratory with the funding of Avast Software.
The project is implemented in four distinct steps simulating the essential data processing and analysis phases.
- Each step is represented in a corresponding notebook inside notebooks.
- Intermediary data files are stored inside the data path.
- Trained models are stored inside models.
Corresponding notebook: initial-data-cleaning.ipynb
Implemented data exploration and cleaning tasks:
- Loading the raw dataset file into pandas DataFrame.
- Exploring dataset summary and statistics.
- Fixing combined columns.
- Dropping irrelevant columns.
- Fixing unset values and validating data types.
- Checking the cleaned version of the dataset.
- Storing the cleaned dataset to a csv file.
Corresponding notebook: data-preprocessing.ipynb
Implemented data processing and transformation tasks:
- Loading dataset file into pandas DataFrame.
- Exploring dataset summary and statistics.
- Analyzing the target attribute.
- Encoding the target attribute using LabelEncoder.
- Handling outliers using IQR (Inter-quartile Range).
- Handling missing values:
- Impute missing categorical features using KNeighborsClassifier.
- Impute missing numerical features using KNNImputer.
- Scaling numerical attributes using MinMaxScaler.
- Encoding categorical features: handling rare values and applying One-Hot Encoding.
- Checking the processed dataset and storing it to a csv file.
Corresponding notebook: model-training.ipynb
Trained and analyzed classification models:
- Naive Bayes: ComplementNB
- Decision Tree: DecisionTreeClassifier
- Logistic Regression: LogisticRegression
- Random Forest: RandomForestClassifier
- Support Vector Classifier: SVC
- K-Nearest Neighbors: KNeighborsClassifier
- XGBoost: XGBClassifier
Evaluation method:
- Cross-Validation Technique: Stratified K-Folds Cross-Validator
- Folds number: 5
- Shuffled: Enabled
Results were analyzed and compared for each considered model.
Corresponding notebook: model-tuning.ipynb
Model tuning details:
- Tuned model: Support Vector Classifier - SVC
- Tuning method: GridSearchCV
- Results were analyzed before/after tuning.