Early Diabetes Prediction based on patients' symptoms

Logistic Regression, Decision Tree, & Random Forest Classification Algorithms

This project aims to train Machine Learning classification models to predict whether a patient is at risk of developing diabetes.
Key indicators (symptoms) correlated to diabetes will be identified.
This project will be useful to clinicians, as it will help them understand how to better avoid missing the False Positive cases who may go undiagnosed.

Data Souce:

Collected data from UCI repository - https://archive.ics.uci.edu/dataset/529/early+stage+diabetes+risk+prediction+dataset

PART 1: Data Collection & Cleaning

PART 2: Analysis of data using Vizualization and Statistics

Questions to answer:

1. Is obesity related to diabetes status? or is it independent

   <p>Chi2ContingencyResult(statistic=2.3274739583333344, pvalue=0.12710799319896815, dof=1, expected_freq=array([[166.15384615,  33.84615385],
           [265.84615385,  54.15384615]]))<p>

The P-Value of 0.127 shows there is no significant evidence that there is a relationship between obsesity status and diabetes.
The expected freq array shows the values if everything was independent. Any deviation from the independent cross tab will be deemed dependent.
Therefore array shown is similar to the chi2 cross tab numbers, and can say that obsesity and class are independent.

2. Is gender related to diabetes status?

   Chi2ContingencyResult(statistic=103.03685927972558, pvalue=3.289703730553317e-24, dof=1, expected_freq=array([[126.15384615,  73.84615385],
   [201.84615385, 118.15384615]]))

The p-value is 3.29e-24, meaning that there is significant evidence that there is a relationship between a postitive diabetes status and gender.
The Independent array values are significantly differnt from the cross tab values therefore shows that obsesity status and gender are dependednt (there is a relationship)
There is significantly more females with diabetes compared to those without diabestes. (173)
There is significantly more males with out diabetes (181) than those with diabetes.

3. Is polyuria related to obesity status?

   Chi2ContingencyResult(statistic=227.86583895496773, pvalue=1.7409117803442155e-51, dof=1, expected_freq=array([[100.76923077,  99.23076923],
   [161.23076923, 158.76923077]]))

The P-Value of 1.74e-51 shows that there is strong significant evidence that there is a relationship between Polyuria status and diabetes.

4. Is gender related to polyuria status?

          Chi2ContingencyResult(statistic=36.49184228561174, pvalue=1.5330652930649977e-09, dof=1, expected_freq=array([[165.26153846,  96.73846154],
          [162.73846154,  95.26153846]]))

P-Value of 1.53e-09 means that there is a strong evidence that there is a relationship between being a female and having a positive polyuria status.

5. Is there a relationship between age and diabetic status?

Box plots don't seem to show a big difference between the means and medians of patients with and without diabetes, as there is a lot of overlap.

Correlation heatmap

For gender, if negative, it is female, if positive, male

Diabetes is positively correlated (0.449) with females, therefore females are 44.9% more likely to have diabetes than males.
Patients with polyuria are 66.6% more likely to have diabetes than those who do not have polyuria.
Patients with polydipsia are 64.9% more likely to have diabetes than those who do not have polydipsia.
Patients who have sudden weightloss are 43.7% more likely to have diabetes.
Patients with partial paresis are 43.2% more likely to have diabetes.
Female patients are 32.8% less likely to have alopecia than male patients.

Top 5 important features to look up for obsesity based on the decision tree model are:

Polydipsia
gender (Female)
Polyuria
age
alopecia

feature importance 3 polydipsia 0.424208 1 isfemale 0.131796 2 polyuria 0.115480 14 alopecia 0.074900 0 age 0.066944 9 itching 0.049823 11 delayed healing 0.042698 6 polyphagia 0.033854 10 irritability 0.022087 7 genital thrush 0.016927 13 muscle stiffness 0.010640 8 visual blurring 0.006771 12 partial paresis 0.001938 4 sudden weight loss 0.001935 5 weakness 0.000000 15 obesity 0.000000

DummyClassifier model to establish a baseline

26 False positive patients - patients do not have diabetes but are predicted to have - An issue because they would have to go through additional tests and anxiety.
24 False negative patients - patients have diabetes but were predicted not to have it- An issue because the disease is left undiagnozed and dangerous for patient's health.
Accuracy of the Dummy classification model is is 52%, which is not high enough and therefore can try more models.

Logistic regression model

lower False positives (5) and False Negatives(5) in the logistic model
Accuracy score has gone up from 52% in dummy model to 90% in logistic model
Precision, recall and f1-scores have all increased as well
This means that the logistic model a better model to predict diabetes than the dummy classifier model

Decision Tree model

The Decision Tree model identified 4 False Positive and Zero False Negative Values
The Accuracy Score of the Decision Tree model is the highest at 95% compared to logistic regression model at 90%

Random Forest Model

The Random Forest model identified 1 False Negative and 4 False Positive cases.
The Random Forest Classification model has an accuracy of 95%
This is slighly lower compared to the Decision tree regression model accuracy of 96%

The overall best model in this analysis is the Decision tree at 95% prediction accuracy.

Reference:

Early stage diabetes risk prediction dataset.. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C5VG8H.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Early Diabetes Prediction based on patients' symptoms

Data Souce:

PART 1: Data Collection & Cleaning

PART 2: Analysis of data using Vizualization and Statistics

Questions to answer:

1. Is obesity related to diabetes status? or is it independent

2. Is gender related to diabetes status?

3. Is polyuria related to obesity status?

4. Is gender related to polyuria status?

5. Is there a relationship between age and diabetic status?

Correlation heatmap

Top 5 important features to look up for obsesity based on the decision tree model are:

DummyClassifier model to establish a baseline

Logistic regression model

Decision Tree model

Random Forest Model

The overall best model in this analysis is the Decision tree at 95% prediction accuracy.

Reference:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Early Diabetes Prediction based on patients' symptoms

Data Souce:

PART 1: Data Collection & Cleaning

PART 2: Analysis of data using Vizualization and Statistics

Questions to answer:

1. Is obesity related to diabetes status? or is it independent

2. Is gender related to diabetes status?

3. Is polyuria related to obesity status?

4. Is gender related to polyuria status?

5. Is there a relationship between age and diabetic status?

Correlation heatmap

Top 5 important features to look up for obsesity based on the decision tree model are:

DummyClassifier model to establish a baseline

Logistic regression model

Decision Tree model

Random Forest Model

The overall best model in this analysis is the Decision tree at 95% prediction accuracy.

Reference: