Troubleshooting an Intrusion Detection Dataset: the CICIDS2017 Case Study

This repository contains the code used for our paper. The code performs the labelling and benchmarking for the CICIDS 2017 dataset after it has been processed by our modified version of the CICFlowMeter tool.

Note that all of this is research code.

If you use the code in this repository, please cite our paper:

        @inproceedings{engelen2021troubleshooting,
        title={Troubleshooting an Intrusion Detection Dataset: the CICIDS2017 Case Study},
        author={Engelen, Gints and Rimmer, Vera and Joosen, Wouter},
        booktitle={2021 IEEE Security and Privacy Workshops (SPW)},
        pages={7--12},
        year={2021},
        organization={IEEE}
        }

An extended documentation of our paper can be found here.

How to use this repository

First, head over to the website of the CICIDS 2017 dataset and download the raw version of the dataset (PCAP file format). There are 5 files in total, one for each day.

Then, run our our modified version of the CICFlowMeter tool on the data obtained in the previous step:

Start the CICFlowMeter tool
Under the "NetWork" menu option, select "Offline"
For "Pcap dir", choose the directory containing the 5 PCAP files of the CICIDS 2017 dataset
For "Output dir", choose the "UnlabelledDataset" directory of this WTCM2021-Code project.
Keep the default values for the "Flow TimeOut" and "Activity Timeout" parameters (120000000 and 5000000 respectively)

This will generate 5 CSV files with the flows extracted from the raw PCAP files.

After this, verify the TIME_DIFFERENCE, INPUT_DIR, OUTPUT_DIR and PAYLOAD_FILTER_ACTIVE attributes in the labelling_CSV_flows.py script, and then run it (no need to specify any command-line options). This will label all the flows in the CSV files generated by the CICFlowMeter tool.

Then, run the MakeDataNumpyFriendly.py script, which will convert the labelled CSV files into a single numpy array. Note that, in our experiments, we chose to relabel all "Attempted" flows as BENIGN. If you wish to keep them separate, make sure to change the numerical labels in the convertToNumericalLabels(flows_list_of_dict) function.

Finally, run the Benchmarking_RF.py script to perform benchmarking on the dataset using a Random Forest classifier. Random seeds and various other options can be specified in the first few lines of the script.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.idea		.idea
Class-based_metrics		Class-based_metrics
FeatureImportance		FeatureImportance
Figures		Figures
LabelledDataset		LabelledDataset
NumpyFriendlyData		NumpyFriendlyData
Scores		Scores
UnlabelledDataset		UnlabelledDataset
.DS_Store		.DS_Store
.gitignore		.gitignore
Benchmarking_RF.py		Benchmarking_RF.py
LICENSE		LICENSE
MakeDataNumpyFriendly.py		MakeDataNumpyFriendly.py
README.md		README.md
labelling_CSV_flows.py		labelling_CSV_flows.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Troubleshooting an Intrusion Detection Dataset: the CICIDS2017 Case Study

How to use this repository

About

Releases

Packages

Languages

License

GintsEngelen/WTMC2021-Code

Folders and files

Latest commit

History

Repository files navigation

Troubleshooting an Intrusion Detection Dataset: the CICIDS2017 Case Study

How to use this repository

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages