A binary classification model that can classify whether a credit card data is fraudulent or legit.
Data has already available here. The dataset contains transactions made by credit cards in September 2013 by European cardholders. Dataset presents transactions that occurred in two days, which contains 492 frauds transactions out of 284,807 transactions. The dataset is highly imbalanced.
Initially datasets contains 1081 duplicate rows. After removing those it reduces to 283,726 observations, where positive (fraudulent) class contains only 0.001667% of data and negative class contains 0.998333% of data. To reduce imbalanced property, under and over sampling has done where over sampling performs well. On the other hand, under sampling perfomrs poorly.
Datasets is trained using Decision Tree
, Random Forest
and Gradient Boosting
classification models. Showing each models performance, their training time and lackings.
In the table I showed the Precision, Recall, F1 score and accuracy for three models.
Model | Precision | Recall | F1-score | Accuracy | ||||
---|---|---|---|---|---|---|---|---|
fraudulent | Non-fraudulent | fraudulent | Non-fraudulent | fraudulent | Non-fraudulent | Training | Testing | |
Decision Tree | 0.53 | 1.00 | 0.53 | 1.00 | 0.53 | 1.00 | 1.00 | 0.9993 |
Random Forest | 0.94 | 1.00 | 0.53 | 1.00 | 0.68 | 1.00 | 1.00 | 0.9997 |
Gradient Boosting | 0.03 | 1.00 | 0.80 | 0.99 | 0.07 | 0.99 | 0.9849 | 0.985 |
Most important feature for training Decision Tree
, Random Forest
and Gredient Boosting
is V14
.
Random Forest
is fitting 2 folds for each of 12 candidates, totalling 24 fits. On the other hand, Gredient Boosting
is fitting 2 folds for each of 9 candidates, totalling 18 fits.
Random Forest
performed best for max depth = 30
and n estimators = 75
where n estimators
was 25, 50, 75
and max depth
was 10, 20, 30, 40
. On the other hand, Gredient Boosting
performed best for max depth = 4
and n estimators = 30
where n estimators
was 20, 25, 30
and max depth
was 2, 3, 4
.
Mean Test Score
of Gradient Boosting
is lower than Random Forest
. But Mean Fit Time
of Gradient Boosting
is higher than Random Forest
.
From the image we can see there is a Positive Correlation
between Training Time
and Estimators
for Random Forest Classifier
.