You can view the live demo here.
- Overview
- Objective
- Insights from EDA
- Feature Engineering (Summary)
- Precision-Recall Dilemma
- Techniques for Class Imbalanced
- Models and Evaluation Metrics
- Model Selection - Conclusion
- Resources and References
Classification, a fundamental task in machine learning, involves predicting categories or labels for data points. In this project, I'll be taking on a classification task which involves predicting whether an employee will be promoted or not based on a range of features. What makes this task fun and challenging is its additional layer of complexity courtesy of a highly imbalanced dataset. My goal here is to learn/apply the best techniques when it comes to imbalanced datasets and compare how they impact relevant model metrics.
I have tried painting this as a real-world instance in which my assignment is to develop a robust employee promotion prediction model while addressing the challenges an imbalanced dataset poses. To achieve this, I'll pursue the following goals:
-
Exploratory Data Analysis (EDA): Conduct a thorough exploration of the dataset to uncover patterns, trends, and potential insights. The purpose of EDA is to help understand the distribution of features, identify potential outliers, and gather preliminary information about feature relevance.
-
Handling Imbalanced Data: Imbalanced datasets can lead to biased model outcomes. Also, classification common metrics such as accuracy might not be the best metric for model performance with respect to an imbalanced dataset. In the notebook, I explored various techniques to handle this imbalance, such as oversampling, undersampling, and using specialized algorithms designed for imbalanced data.
The dataset contains about 54,000 rows and 14 columns. The CSV file, along with the column description, can be found on Kaggle. Below are some of the insights gained after performing an EDA.
- As pointed out earlier, the data is highly imbalanced. Below is the proportion of employees who are promoted to those who are not.
-
While males constitute a larger portion of the workforce (about 60%), 8.9% of females have successfully earned promotions which are higher than that of males, 8.3%.
-
The
number_of_training
column contains ten unique values (1-10) which represents an employee's total number of trainings. Part of my EDA was to see how this column relate to whether an employee gets promoted or not. The plot below shows the proportion of employees across each uniquenumber_of_training
. Moving from left to right, we observe that the proportion of promoted employees decreases as the respectivenumber_of_training
increases - a negative correlation.
-
A positive correlation was observed between the number of promoted employees and previous year rating.
-
The violin plot below shows that promoted employees on average have a higher
average_training_score
. Another observation was that anaverage_training_score
of 60 seem like a benchmark for promotion. This observation is roughly true if we consider non promoted employees. Most non promoted employees have anaverage_training_score
of 50 as shown by the larger width of the violin plot. To further back this, just 25% of promoted employees have less than anaverage_training_score
of 60.
-
Outliers: I have decided not to drop 'outliers' in this case for two reasons. First, the dataset already contains very little positive class and second, the range of values in the columns is not too far apart. For example, all numeric columns in the data roughly range from 1 to 99.
-
Missing Values: Only two columns contain missing values -
education
andprevious_year_rating
. All NANs ineducation
were replace with the word 'unknown' and 'zero' inprevious_year_rating
. -
Encoding: Columns in which order counts were ordinal-encoded while the reverse were one-hot-encoded. This is because ML algorithms will assume that two nearby values are more similar than two distant values in the case of ordinal encoding.
A crossroad I find myself is choosing whether to optimize for precision or recall. FN in this task is interpreted as predicting an employee won't be promoted (negative), but in reality, they are promoted (positive). FP is interpreted as predicting an employee will be promoted, but in reality, they are not promoted. Optimizing for precision means we want to avoid false positives as much as possible while optimizing for recall means we want to avoid false negatives as much as possible. Both have their pros and cons and since there are no business metrics here, I have chosen to go with F1-Score (combination of precision and recall).
Below are all of the techniques applied in the notebook
-
Random Oversampling: This involves randomly duplicating examples from the minority class and adding them to the training dataset.
-
SMOTE (Synthetic Minority Oversampling Technique): SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.
-
SMOTE + Tomek links: A combination of over-sampling the minority class and under-sampling the majority class.
-
Class Weights: Control the balance of positive and negative weights
I have decided not to include undersampling techniques such as 'random undersampling' and 'undersampling using Tomek Links' due to the drastic reduction in the number of samples and poor performance.
Three different algorithms were used in the notebook: Logistic Regression, Random Forest, and XGBoost. Below are the performance comparison of the base model (without any sampling technique)
Model | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
XGBoost | 0.942438 | 0.887955 | 0.349119 | 0.501186 |
Random Forest | 0.935504 | 0.817035 | 0.285242 | 0.422857 |
Tunned Random Forest | 0.931034 | 0.969136 | 0.172907 | 0.293458 |
Logistic Regression | 0.921091 | 0.586345 | 0.160793 | 0.252377 |
The performance of tree-based algorithms such as Random Forest, lightGBM, and XGBoost especially are generally not affected by imbalanced data due to their averaging nature. This is not supreme, however. There are cases when they perform poorly on imbalanced data.
I have decided to choose Random Forest as my base comparison model in order to see how different sampling techniques affect relevant metrics. Below is how the different sampling techniques compare
Random Forest with | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
Random Oversampling | 0.929028 | 0.626459 | 0.354626 | 0.452883 |
No Under/Oversampling | 0.937694 | 0.841945 | 0.305066 | 0.447858 |
SMOTE Oversampling | 0.933224 | 0.716749 | 0.320485 | 0.442922 |
Class Weight | 0.937238 | 0.845912 | 0.296256 | 0.438825 |
SMOTE + Tomek | 0.932129 | 0.705000 | 0.310573 | 0.431193 |
Optimizing for precision might result in
-
Missing out on promoting a deserving employee can lead to demotivation, decreased job satisfaction, and potentially higher turnover.
-
Loss of talented employees who may seek better opportunities elsewhere.
Optimizing for recall might result in
-
Promoting an employee who isn't ready for the new role could lead to underperformance and dissatisfaction in the new position.
-
Misallocation of resources, as promotions require investments in training and development.
My aim for this project is not channeled to any business plan but rather to gain experience on the different techniques used for imbalanced data.