The dataset for this project is from Kaggle
- Step 1: Install and import necessary libraries
- Step 2: Reading and exploring data
- Step 3: Data cleaning and preprocessing
- Step 4: Descriptive Statistics
- Step 5: Data Analysis and visualization
- Step 6: Predictive models
- pandas
- matplotlib
- seaborn
- plotly.express and plotly.graph_objects
- numpy
- scipy.stats
- sklearn - linear regression, logistic regression, train_test_split, Decision tree classfier.
We need to install / import the necessary libraries needed for out data analysis.
- Then we will read the csv file downloaded from Kaggle and read it into jupyter notebook using
pd.read_csv
The attributes of the data set are as follows:- Age: age of the patient [years]
- Sex: sex of the patient [M: Male, F: Female]
- ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
- RestingBP: resting blood pressure [mm Hg]
- Cholesterol: serum cholesterol [mm/dl]
- FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
- RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
- MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
- ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
- Oldpeak: oldpeak = ST [Numeric value measured in depression]
- ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
- HeartDisease: output class [1: heart disease, 0: Normal]
- Next we will look at the shape and information of our dataset.
- Our dataset consists of 918 rows and 12 columns.
- Here is what our data info looks like:
- Moving on, we will check for out null values and duplicated values.
- I found some inconsistencies in our data for
serum_cholesterol
feature which had the value of 0 for 172 rows, which isn't possible since there can't be a 0 for serum cholesterol. - This is the same case for RestingBP but it only has 1 row with a 0 value.
- Since serum cholesterol is a significant to our analysis, we can drop the rows with 0 values. Same with
RestingBP
.
Now that our dataset is clean, we will get into some statistical analysis.
-
Here is the summary statistics of our data set:
- From the above image we can note deduce that the average age of our dataset is 53 with a median of 54 and a standard deviation of 9.5. We can also see that the youngest person from our dataset is 28 and the oldest person is 77.
- Similarly we can look at other features of our dataset and note their statistical significance.
-
Now lets look at the correlation between our features:
- The highest correlation of our target variable is with
Oldpeak
and the lowest correlation is withMaxHr
which is actually a negative value. Cholesterol
andAge
have the next highest correlation with target variable.
- The highest correlation of our target variable is with
Now lets look at some visualizations from our dataset.
-
I first started by diving my features into categorial and numerical, then I plotted them into count plots and distribution plots.
-
Here is a distribution plot for
Age
- Here we see a right skewed distribution.
- The majority of people from our dataset lies in age group between 60-70.
-
Lets look at a distribution plot for
OldPeak
Old Peak
is a sign of damage to the heart muscle, the higher the value, the more damage to the heart muscle.- For our dataset, the peak at 0 indicates that more than 300 do not have heart damage and around 80 have moderate damage with the value of 2.
- we can use this to identify those who are at risk for heart disease.
-
Lets look at serum cholesterol distribution.
- We all are aware that according to research high cholesterol is a risk factor for heart diseases.
- The normal range for serum cholesterol is less than 200 mg/dl, borderline high is between 200 to 239 mg/dl and high is above 240 mg/dl
Source - In our distribution plot, the peak is at around 200 mg/dl which indicates a normal range.
- Those who have a serum cholesterol higher than 200 mg/dl which from our distribution plot is a small number of people, they are at risk for heart disease.
-
Lets look at how many people in our dataset have heart diseases and how many don't.
- 52.28% don't have any heart disease and 47.72% have heart diseases.
-
Further lets look at gender distribution between people with heart disease and those without.
- As we can see that the number of male is higher than that of female hence men are more at greater risk of heart disease.
-
Lastly lets look at some box plots for serum cholesterol and old peak.
- In the first boxplot for cholesterol people with no heart disease have a lower median than those with heart disease but its evident that females with heart disease have the highest median.
- As for the old peak, people with heart disease have a higher variability than those with no heart disease. Females have a lower median than males.
After analyzing our data we can conclude the following
- Males have a higher risk for heart diseases.
- Cholesterol is a risk factor, with a correlation coefficient of 0.1.
- Old peak is also high in correlation, with a correlation coefficient of 0.5