This project is the implementation of a classifier for diabetes, which uses the XGBoost
library to train the model.
After training, the model should decide whether a person has diabetes disease or not.
The dataset includes more than 70000 records that have been collected from patients.
Dataset has 22 columns:
Diabetes_binary
, HighBP
, High Cholesterol
, Cholesterol Check
, BMI
, Smoker
,
Stroke
, HeartDiseaseorAttack
, Physical Activity
, Fruits
, Veggies
, Heavy Alcohol Consumption
,
Any Health Care
, No Doctor because of Cost
, General Health
, Mental Health
, Physical Health
, Difficulty Walking
, Sex
, Age
, Education
, Income
.
XGBoost is a decision-tree-based
ensemble Machine Learning algorithm that uses a gradient-boosting
framework.
In prediction problems involving unstructured data (images, text, etc.), artificial neural networks tend to outperform all other algorithms or frameworks.
However, when it comes to small-to-medium structured/tabular data, decision tree-based algorithms are considered best-in-class right now.
Project has 6 steps:
- Import libraries
- Getting the data
- Preprocessing: load dataset, rename column names, handle
Null
values, normalize, and convert categorical features to numerical features withOneHotEncoding
andMin-Max
. - Build XGBoost classifier model: create a
XGBClassifier
, train the model, print accuracy, plotconfusion_matrix
, and plotprecision-recall
curve. - Set hyperparameters (use GridSearchCV)
- Visualization
Check the full description (in Persian)
If you have any questions, feel free to ask me:
📩 arminzolfagharid@gmail.com