This project is to predict a person's salary lies in either 50K+ or 50K-.
-
The Adult-Census-Income is from kaggle:
-
Dataset Features:
Salary, age, workclass, fnlwgt, education, education_num, marital-status, occupation, relationship, race, sex, capital-gain,capital-Loss, hours-per-week, native-country
-
The
Salary
Feature is the label we want to predict.
-
Transforming Feature
Some Features contains string values, which can't be calculated in numpy library. Convert categorical values into numerical values.
Method: Label Encoding, One-Hot Encoding
-
Imputing Missing Values
Some Features contain the missing values like NAN, ? or space. Imputate missing value via feature mode.
Method: Feature Mode, Median and Mean
-
Dealing with Imbalanced Data
Salary Feature is unbalanced labeled. The salary > 50k is around 80% and <= 50K is 20%.
Method: Bagging and Undersampling
-
Random Sampling of Training Data
From training dataset, use Undersampling method by selecting a subset of the majority examples to match the number of minority examples to create a balanced dataset.
-
Building Classifier
Classification Algorithms Selected:
K-nearest-neighborhood
,Support Vector Machine
,Logistic Regression
,Random Forest
,Navie Bayes
,Decision Tree
,Adaboost Decision Tree
-
Ensemble Learning
Each instance of training and test data is classified 0 (corresponding to less or equal than 50K dollars annually) or 1 (corresponding to greater than 50K annually) using the learned classifiers.
Final prediction is made by taking a majority vote model among the predictions of these classifiers
-
Bagging Decision tree:
Confusion matrix:
Prediction Truth Truth <=50K >50K <=50K 0.63 0.13 >50K 0.05 0.19 Prediction accuracy for instances label <= 50K is
0.83
.Prediction accuracy for instances label > 50K is
0.80
.Overall Test Accuracy is
0.82
. -
Random Forest
Confusion matrix:
Prediction Truth Truth <=50K >50K <=50K 0.62 0.15 >50K 0.04 0.20 Prediction accuracy for instances label <= 50K is
0.81
.Prediction accuracy for instances label > 50K is
0.85
.Overall Test Accuracy is
0.82
. -
Logistic regression
Confusion matrix:
Prediction Truth Truth <=50K >50K <=50K 0.64 0.12 >50K 0.04 0.19 Prediction accuracy for instances label <= 50K is
0.84
.Prediction accuracy for instances label > 50K is
0.82
.Overall Test Accuracy is
0.83
. -
K-Neighbor
Confusion matrix:
Prediction Truth Truth <=50K >50K <=50K 0.61 0.15 >50K 0.05 0.19 Prediction accuracy for instances label <= 50K is
0.80
.Prediction accuracy for instances label > 50K is
0.80
.Overall Test Accuracy is
0.80
. -
Support Vector Machine(SVM)
Confusion matrix:
Prediction Truth Truth <=50K >50K <=50K 0.63 0.13 >50K 0.05 0.19 Prediction accuracy for instances label <= 50K is
0.83
.Prediction accuracy for instances label > 50K is
0.80
.Overall Test Accuracy is
0.82
. -
Naïve Bayes
Confusion matrix:
Prediction Truth Truth <=50K >50K <=50K 0.61 0.15 >50K 0.04 0.20 Prediction accuracy for instances label <= 50K is
0.80
.Prediction accuracy for instances label > 50K is
0.84
.Overall Test Accuracy is
0.81
. -
Ensemble: Majority Vote for 7 Learned Classifiers
Confusion matrix:
Prediction Truth Truth <=50K >50K <=50K 0.63 0.15 >50K 0.04 0.20 Prediction accuracy for instances label <= 50K is
0.82
.Prediction accuracy for instances label > 50K is
0.84
.Overall Test Accuracy is
0.83
.