Wednesday, May 22, 2019
8:03 PM
Abstract
This repository is a research journal for a Kaggle competition. At the time I was working on the problem, the competition was already finished.
The goal is to practice in solving a data science problem with many variables (82 in this case) and figure out which one is the most important.
In this research, XGboost classifier was used to build a data model and predict whether or not a given computer will be infected by a virus. Then, using a "plot_importance" method extract the essential conditions for a PC to be hit by malware.
Data Description:
Dimensionality: 82 variables
Train dataset size: ~1M rows
Test dataset size:
Method
ML method was used in this exercise is Gradient Boosted Decision Tree Classifier (xgboost). With a parameter estimated using a grid search method.
Data Preparing
I started the research with understanding the data I will be working with:
Train dataset contains 82 variables; it's hard to explore each of them visually or compare side by side. A lot of "pair plots" will not be representative and challenging to compare with each other.
I can roughly separate all features to "categorical" and "numerical" variables for more representation.
Here's a top unbalance "categorical" variable:
And top "numerical" variables (also filtered by parentage of missing values): Census_DeviceFamily, ProductName OsVer, Platform, Census_FlightRing, Census_OSArchitecture, and Processor are very unbalanced (more 90% are in the biggest bucket).
In the categorical variables, MachineIdentifier seems to be useless in this analysis and can be dropped.
All unbalanced and missing variables will be filtered out:
To fit the XGBoost classifier, all variables will be transformed into categories.
Results
A baseline accuracy for the training dataset is (Using dummy classifier) AUC = 0.66.
Using a grid search cross validations next parameters were found:
Best params: { 'max_depth': 10 , 'min_child_weight': 5 , 'subsample': 0.8 , 'colsample_bytree': 0.9 , 'eta': 0.3 , 'objective': 'binary:logistic' , 'eval_metrics': 'auc'}
AUC learning curve graph:
Using explored parameters in the final model next "importance plot" was found:
According to the final model, the most important feature is AvSigVersion, which is Windows Defender state information. It makes sense since antivirus quality usually is a key factor which helps prevent the infection.
Conclusion
During this exercise, I tested a different approach to work with XGboost: using sparse and dense matrixes, exclude unbalanced and low presented variables and feature selection with a xgboost kit.
Even though I didn't active the original competition goal (evaluate the model on the testing dataset), I found the results of my model meaningful and interpreted.