Data Source: Kaggle
The project is aimed to develop Machine Learning models and make comparative prediction from "IBM HR Analytics Employee Attrition & Performance" fictional data (1470 rows of data) that could better predict in employee attrition.
Tools: Pandas, Numpy, Seaborn, Matplotlib, Scikit-Learn, Tensorflow, Keras
To avoid AI misunderstanding when interpreting data, 2 variables (X) are made based on their data type and converting categorical variable (X_cat) into numerical using scikit-learn and concatenate both of them back.
Variables:
- Categorical(X_cat): Anything from fields exclude Attrition that has object data type
- Numerical(X_numerical): Anything from fields that has numerical data type.
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder()
X_cat = onehotencoder.fit_transform(X_cat).toarray()
- Logistic Regression
- Random Forest
- Deep Learning Model
- Training: 1102 (75%)
- Test: 368 (25%)
- Logistic regression is best used to predict binary outputs with two possible values labeled "0" or "1".
- Logistic model output can be one of two classes: stayed/left, pass/fail, win/lose, etc.
- Logistic regression algorithm works by implementing a linear equation first with independent predictors to predict a value.
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
- Decision Trees are supervised Machine Learning technique where the data is split according to a certain condition/parameter.
- Random Forest Classifier is a type of ensemble algorithm.
- It creates a set of decision trees from randomly selected subset of training set.
- It then combines votes from different decision trees to decide the final class of the test object.
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
- Input layer = 50 (from table fields)
- Hidden layer = 3 layers (dense, 500 neurons each, relu activation function)
- Output = 1 (sigmoid activation function)
- Epochs = 100
- Batch size = 50
Confusion Matrix: Logistic Regression(left), Random Forest(mid), and Deep Learning(right)
Method | Accuracy (%) |
---|---|
Logistic Regression | 89 |
Random Forest | 85 |
Deep Learning | 83 |
Based on analysis with 3 different Machine Learning Methods, Logistic Regression has highest Accuracy (89%) and best suitable to be applied to predict employee attriction.
- https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset
- https://matplotlib.org/3.5.0/plot_types/index.html
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
- https://seaborn.pydata.org/generated/seaborn.heatmap.html
- https://seaborn.pydata.org/generated/seaborn.countplot.html
- https://seaborn.pydata.org/generated/seaborn.kdeplot.html
- https://www.tensorflow.org/api_docs/python/tf/keras/Sequential
- https://www.tensorflow.org/guide/keras/train_and_evaluate
- https://www.tensorflow.org/api_docs/python/tf/keras/Model
- https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense
- https://towardsdatascience.com/a-practical-guide-to-implementing-a-random-forest-classifier-in-python-979988d8a263