The aim of this project is to apply unsupervised machine learning to perform intrusion analysis and detection on a network traffic dataset. The data set was first cleaned and processed, and PCA was applied for dimensionality reduction, then we implemented KMeans algorithm to perform data clustering and analysis.
The dataset used in this demo is: KDD Cup 1999 - SA provided by sklearn library:
- Since the original KDD Cup '99 dataset was initially created to produce a large training set for supervised learning algorithms, there is a large proportion of abnormal data that is unrealistic in the real world, and inappropriate for unsupervised anomaly detection.
- For this reason, we used the transformed SA version by sklearn which is obtained by selecting all the normal data, and a small proportion of abnormal data.
- The original data set is labeled with a class attribute, but this label was ignored since we are dealing with an unsupervised machine learning problem.
The project is implemented in three distinct steps simulating the essential data processing and analysis phases.
- Each step is represented in a corresponding notebook inside notebooks.
- Intermediate data files are stored as outputs/inputs for each processing phase.
- The data files were not uploaded to this repository due to constraints on the upload size. However, the whole analysis is easily reproducible.
Corresponding notebook: data-cleaning.ipynb
Implemented data cleaning tasks:
- Loading the dataset from sklearn library.
- Exploring dataset summary and statistics.
- Decoding byte string objects.
- Dropping irrelevant columns.
- Checking null values.
- Checking the cleaned version of the dataset.
- Storing the cleaned dataset to a csv file.
Corresponding notebook: data-preprocessing.ipynb
Implemented data processing and transformation tasks:
- Loading dataset file into pandas DataFrame.
- Exploring dataset summary and statistics.
- Exploring categorical features and combining less-frequent values.
- Encoding categorical features using One-Hot Encoding.
- Implementing data normalization using Standard Scaler.
- Checking the processed dataset and storing it to a csv file.
Corresponding notebook: data-analysis.ipynb
Implemented data analysis tasks:
- Loading dataset file into pandas DataFrame.
- Implementing dimensionality reduction using Principal Component Analysis (PCA).
- Selecting the best k value for KMeans algorithm using Elbow Method.
- Implementing KMeans algorithm for data clustering.
- Performing anomaly detection based on KMeans clusters.