Skip to content

Commit

Permalink
Merge pull request #91 from UBC-MDS/remove-hardcoded-values
Browse files Browse the repository at this point in the history
  • Loading branch information
BChangs99 authored Dec 16, 2024
2 parents ca42db5 + eb13f54 commit 92f52d6
Show file tree
Hide file tree
Showing 17 changed files with 137 additions and 117 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
53 changes: 24 additions & 29 deletions index.html

Large diffs are not rendered by default.

66 changes: 47 additions & 19 deletions report/content/_analysis.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,19 +17,22 @@ if 'Unnamed: 0' in dst_cv_results.columns:
dt_test_accuracy_cv = round(dst_cv_results.loc[dst_cv_results[''] == 'test_accuracy', 'mean'].values[0] * 100, 2)
dt_train_accuracy_cv = round(dst_cv_results.loc[dst_cv_results[''] == 'train_accuracy', 'mean'].values[0] * 100, 2)
dt_test_recall_cv = round(dst_cv_results.loc[dst_cv_results[''] == 'test_recall', 'mean'].values[0] * 100, 2)
dt_test_precision_cv = round(dst_cv_results.loc[dst_cv_results[''] == 'test_precision', 'mean'].values[0] * 100, 2)
dt_test_f1_cv = round(dst_cv_results.loc[dst_cv_results[''] == 'test_f1', 'mean'].values[0] * 100, 2)
Markdown(dst_cv_results.to_markdown(index = False))
```

Cross-validation results show a high train accuracy of 1, indicating perfect fit on the training data, with a test accuracy of 0.74,
suggesting good generalization. The model exhibits a precision of 0.709, recall of 0.737, and an F1-score of 0.718 on the test set,
with relatively low variability (standard deviations of 0.059, 0.069, and 0.079, respectively).
These results indicate a reasonable trade-off between precision and recall.
Cross-validation results show a high train accuracy of 1, indicating perfect fit on the training data, with
a test accuracy of `{python} dt_test_accuracy_cv`%, suggesting good generalization. The model exhibits a precision of `{python} dt_test_precision_cv`%,
recall of `{python} dt_test_recall_cv`%, and an F1-score of `{python} dt_test_f1_cv`% on the test set,
with relatively low variability. These results indicate a reasonable trade-off between precision and recall.

![Confusion Matrix of Decision Tree Model](../results/tables/decision_tree/decision_tree_confusion_matrix.png){#fig-conf-m-dt}

Confusion matrix reveals a higher number of false positives (29) compared to false negatives (25),
indicating some misclassification of the negative class (0).
Confusion matrix reveals a higher number of false positives compared to false negatives,
indicating some misclassification of the negative class (target = 0).

### Decision Tree: Test Results

Expand All @@ -38,14 +41,20 @@ indicating some misclassification of the negative class (0).
#| tbl-cap: "Test results of Decision Tree Model"
dt_precision_class_0 = round(dst_model_results.loc[dst_model_results[''] == '0', 'precision'].values[0] * 100, 2)
dt_recall_class_0 = round(dst_model_results.loc[dst_model_results[''] == '0', 'recall'].values[0] * 100, 2)
dt_f1_class_0 = round(dst_model_results.loc[dst_model_results[''] == '0', 'f1-score'].values[0] * 100, 2)
dt_precision_class_1 = round(dst_model_results.loc[dst_model_results[''] == '1', 'precision'].values[0] * 100, 2)
dt_recall_class_1 = round(dst_model_results.loc[dst_model_results[''] == '1', 'recall'].values[0] * 100, 2)
dt_f1_class_1 = round(dst_model_results.loc[dst_model_results[''] == '1', 'f1-score'].values[0] * 100, 2)
dt_accuracy_test = round(dst_model_results.loc[dst_model_results[''] == 'accuracy', 'precision'].values[0] * 100, 2)
Markdown(dst_model_results.to_markdown(index = False))
```

Test results show balanced performance across classes, with precision for class 0 (0.727) and class 1 (0.771),
and F1-scores of 0.777 and 0.701 for class 0 and 1, respectively. The model's overall accuracy is 0.744.
Test results show balanced performance across classes, with precision for class 0 (`{python} dt_precision_class_0`%)
and class 1 (`{python} dt_precision_class_1`%), and F1-scores of `{python} dt_f1_class_0`% and `{python} dt_f1_class_1`%
for class 0 and 1, respectively. The model's overall accuracy is `{python} dt_accuracy_test`%.

### Logistic Regression: Cross-Validation Results
```{python}
Expand All @@ -58,28 +67,40 @@ if 'Unnamed: 0' in lg_cv_results.columns:
lg_test_accuracy_cv = round(lg_cv_results.loc[lg_cv_results[''] == 'test_accuracy', 'mean'].values[0] * 100, 2)
lg_train_accuracy_cv = round(lg_cv_results.loc[lg_cv_results[''] == 'train_accuracy', 'mean'].values[0] * 100, 2)
lg_test_recall_cv = round(lg_cv_results.loc[lg_cv_results[''] == 'test_recall', 'mean'].values[0] * 100, 2)
lg_test_precision_cv = round(lg_cv_results.loc[lg_cv_results[''] == 'test_precision', 'mean'].values[0] * 100, 2)
lg_test_f1_cv = round(lg_cv_results.loc[lg_cv_results[''] == 'test_f1', 'mean'].values[0] * 100, 2)
Markdown(lg_cv_results.to_markdown(index = False))
```

Cross-validation results show strong performance with a test accuracy of 0.826 and train accuracy of 0.873.
The model's test precision (0.822), recall (0.8), and F1-score (0.809) demonstrate a balance between precision and recall,
with a slightly higher train performance. The standard deviations indicate minimal variability in performance across the folds.
Cross-validation results show strong performance with a test accuracy of `{python} lg_test_accuracy_cv`%
and train accuracy of `{python} lg_train_accuracy_cv`%.
The model's test precision (`{python} lg_test_precision_cv`%), recall (`{python} lg_test_recall_cv`%), and F1-score (`{python} lg_test_f1_cv`%)
demonstrate a balance between precision and recall, with a slightly higher train performance.
The standard deviations indicate minimal variability in performance across the folds.

![Confusion Matrix of Logistic Regression Model](../results/tables/logistic_regression/logistic_regression_confusion_matrix.png){#fig-conf-m-lg}

Confusion matrix indicates good performance with fewer false positives (17) compared to false negatives (19) for the positive class.
Confusion matrix indicates good performance with fewer false positives compared to false negatives for the positive class.

### Logistic Regression: Coefficients

```{python}
lg_coeff = pd.read_csv("../results/tables/logistic_regression/logreg_coefficients.csv")
chest_pain_type_asymptomatic_coef = round(lg_coeff.loc[lg_coeff['Feature'] == 'onehotencoder__chest_pain_type_asymptomatic', 'Coefficient'].values[0], 2)
num_of_vessels_0_coef = round(lg_coeff.loc[lg_coeff['Feature'] == 'onehotencoder__num_of_vessels_0.0', 'Coefficient'].values[0], 2)
```

![Coefficients of Logistic Regression Model](../results/tables/logistic_regression/logreg_coefficients.png){#fig-coef-lg}

The coefficients of the Logistic Regression model reflect the impact of each feature on the likelihood of a positive outcome,
with positive coefficients indicating an increased likelihood and negative coefficients indicating a decreased likelihood.
For example, features like `chest_pain_type_asymptomatic` (1.18), `thalassemia_reversable defect` (0.97), and `num_of_vessels_2.0` (0.91)
are positively associated with the likelihood of heart disease, meaning that higher values of these features increase the chances of
a positive outcome. In contrast, features such as `num_of_vessels_0.0` (-1.26), `chest_pain_type_non-anginal pain` (-0.93),
and `thalassemia_normal` (-0.58) show a negative relationship, meaning they decrease the likelihood of heart disease. The coefficients for scaled features like `age`, `cholesterol`, and `resting_blood_pressure` (e.g., 0.34 for `cholesterol`) indicate their contribution to the prediction, with the magnitude of the coefficient reflecting the strength of their influence on the model's outcome. Features with larger absolute coefficient values, such as `num_of_vessels_0.0` and `chest_pain_type_asymptomatic`, have a more significant impact on the model’s prediction.
For example, features like `chest_pain_type_asymptomatic` (`{python} chest_pain_type_asymptomatic_coef`)
are positively associated with the likelihood of heart disease, meaning that higher values of these features increase the chances of
a positive outcome (presence of hear disease). In contrast, features such as `num_of_vessels_0.0`% (`{python} num_of_vessels_0_coef`)
show a negative relationship, meaning they decrease the likelihood of heart disease.

### Logistic Regression: Test Results

Expand All @@ -89,12 +110,19 @@ For example, features like `chest_pain_type_asymptomatic` (1.18), `thalassemia_r
#|
lg_precision_class_0 = round(lg_model_results.loc[lg_model_results[''] == '0', 'precision'].values[0] * 100, 2)
lg_recall_class_0 = round(lg_model_results.loc[lg_model_results[''] == '0', 'recall'].values[0] * 100, 2)
lg_f1_class_0 = round(lg_model_results.loc[lg_model_results[''] == '1', 'f1-score'].values[0] * 100, 2)
lg_precision_class_1 = round(lg_model_results.loc[lg_model_results[''] == '1', 'precision'].values[0] * 100, 2)
lg_recall_class_1 = round(lg_model_results.loc[lg_model_results[''] == '1', 'recall'].values[0] * 100, 2)
lg_f1_class_1 = round(lg_model_results.loc[lg_model_results[''] == '1', 'f1-score'].values[0] * 100, 2)
lg_accuracy_test = round(lg_model_results.loc[lg_model_results[''] == 'accuracy', 'precision'].values[0] * 100, 2)
Markdown(lg_model_results.to_markdown(index = False))
```

Test results show that Logistic Regression outperforms the Decision Tree in terms of overall accuracy (0.844).
For precision and recall, class 0 achieves precision of 0.827 and recall of 0.896, while class 1 has precision of
0.868 and recall of 0.786, leading to an F1-score of 0.86 for class 0 and 0.825 for class 1.
Test results show that Logistic Regression outperforms the Decision Tree in terms of overall accuracy (`{python} lg_accuracy_test`%).
For precision and recall, class 0 achieves precision of `{python} lg_precision_class_0`% and recall of `{python} lg_recall_class_0`%,
while class 1 has precision of `{python} lg_precision_class_1`% and recall of `{python} lg_recall_class_1`%,
leading to an F1-score of `{python} lg_f1_class_0`% for class 0 and `{python} lg_f1_class_1`% for class 1.
24 changes: 13 additions & 11 deletions report/content/_discussion.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,17 @@ and demographic factors.
#### Overall Accuracy of the Classification Models

Both the logistic regression and decision tree models demonstrated strong predictive capabilities.
The decision tree model achieved an accuracy of 73.33%, with a precision of 72.22% and recall of 81.25%
for class 0 (absence of heart disease) and a precision of 75.0% and recall of 64.29% for class 1 (presence of heart disease).
However, the decision tree model exhibited signs of potential overfitting, as indicated by the 100% training accuracy versus
a test accuracy of 72.5%, suggesting that it might not generalize well to unseen data.

In comparison, the logistic regression model outperformed the decision tree model, achieving an overall accuracy of 81.11%.
It exhibited balanced precision (78.18% for class 0 and 85.71% for class 1) and recall (89.58% for class 0 and 71.43% for class 1),
with a test accuracy of 87.5%, and a training accuracy of 89.6%. This higher performance suggests that logistic regression provides
a more reliable model for predicting heart disease, with fewer issues related to overfitting and better generalization to new data.
The decision tree model achieved an accuracy of `{python} dt_accuracy_test`%, with a precision of `{python} dt_precision_class_0`% and recall of `{python} dt_recall_class_0`%
for class 0 (absence of heart disease) and a precision of `{python} dt_precision_class_1`% and recall of `{python} dt_recall_class_1`% for class 1 (presence of heart disease).
However, the decision tree model exhibited signs of potential overfitting, as indicated by the `{python} dt_train_accuracy_cv`% training accuracy versus
a test accuracy of `{python} dt_accuracy_test`%, suggesting that it might not generalize well to unseen data.

In comparison, the logistic regression model outperformed the decision tree model, achieving an overall accuracy of `{python} lg_accuracy_test`%.
It exhibited balanced precision (`{python} lg_precision_class_0`% for class 0 and `{python} lg_precision_class_1`% for class 1)
and recall (`{python} lg_recall_class_0`% for class 0 and `{python} lg_recall_class_1`% for class 1),
with a test accuracy of `{python} lg_accuracy_test`%, and a training accuracy of `{python} lg_train_accuracy_cv`%.
This higher performance suggests that logistic regression provides a more reliable model for predicting heart disease,
with fewer issues related to overfitting and better generalization to new data.

#### Key Predictive Features for Heart Disease

Expand All @@ -36,9 +38,9 @@ confirming their relevance in the prediction of heart disease.

The ability to predict whether an individual might develop heart disease based on health indicators and demographic factors was
evaluated through both models. Logistic regression demonstrated a strong capability to predict the likelihood of heart disease,
with particularly high recall for the absence of heart disease (89.58%), meaning the model was effective in identifying healthy
with particularly high recall for the absence of heart disease (`{python} lg_recall_class_1`%), meaning the model was effective in identifying healthy
individuals. Conversely, the decision tree model showed a more balanced performance but struggled to consistently predict class 1
(heart disease), as reflected by its lower recall (64.29%) for this group.
(heart disease), as reflected by its lower recall (`{python} dt_recall_class_1`%) for this group.

Overall, the findings suggest that machine learning models, particularly logistic regression, hold promise for accurately
predicting heart disease diagnoses. However, there is still room for improvement in refining these models to reduce misclassifications,
Expand Down
53 changes: 24 additions & 29 deletions report/heart_disease_predictor_report.html

Large diffs are not rendered by default.

Binary file modified report/heart_disease_predictor_report.pdf
Binary file not shown.
Binary file modified results/eda_plot/feature_densities_by_diagnosis.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified results/models/decision_tree.pkl
Binary file not shown.
Binary file modified results/models/logistic_regression.pkl
Binary file not shown.
4 changes: 2 additions & 2 deletions results/tables/decision_tree/decision_tree_cv_results.csv
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
,mean,std
fit_time,0.001,0.0
score_time,0.002,0.0
fit_time,0.002,0.002
score_time,0.006,0.002
test_accuracy,0.74,0.059
train_accuracy,1.0,0.0
test_precision,0.709,0.069
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
,mean,std
fit_time,0.002,0.001
score_time,0.002,0.0
fit_time,0.007,0.003
score_time,0.004,0.0
test_accuracy,0.826,0.021
train_accuracy,0.873,0.009
test_precision,0.822,0.054
Expand Down
50 changes: 25 additions & 25 deletions results/tables/logistic_regression/logreg_coefficients.csv
Original file line number Diff line number Diff line change
@@ -1,26 +1,26 @@
Feature,Coefficient
onehotencoder__chest_pain_type_asymptomatic,1.1838672562061838
onehotencoder__thalassemia_reversable defect,0.9657968582775411
onehotencoder__num_of_vessels_2.0,0.914273306990384
onehotencoder__exercise_induced_angina_yes,0.6288076002699492
standardscaler__sex,0.6045607179898266
onehotencoder__num_of_vessels_1.0,0.5644600036491417
onehotencoder__slope_flat,0.35035154466635776
standardscaler__resting_blood_pressure,0.340495546509579
standardscaler__cholesterol,0.3244733216396428
standardscaler__st_depression,0.304389572995754
onehotencoder__slope_downsloping,0.2405322386242731
onehotencoder__rest_ecg_left ventricular hypertrophy,0.19607845057825268
onehotencoder__chest_pain_type_atypical angina,0.1904812312471063
onehotencoder__rest_ecg_ST-T wave abnormality,0.17303560432627466
standardscaler__age,0.05633345043789224
standardscaler__fasting_blood_sugar,-0.16843768707612017
standardscaler__max_heart_rate,-0.18641022684835776
onehotencoder__num_of_vessels_3.0,-0.22288813387853396
onehotencoder__rest_ecg_normal,-0.3690988378526316
onehotencoder__thalassemia_fixed defect,-0.388411089249939
onehotencoder__chest_pain_type_typical angina,-0.442410277986553
onehotencoder__thalassemia_normal,-0.5773705519757063
onehotencoder__slope_upsloping,-0.590868566238734
onehotencoder__chest_pain_type_non-anginal pain,-0.9319229924148412
onehotencoder__num_of_vessels_0.0,-1.255829959709096
onehotencoder__chest_pain_type_asymptomatic,1.1838672562033263
onehotencoder__thalassemia_reversable defect,0.9657968582786678
onehotencoder__num_of_vessels_2.0,0.9142733069914617
onehotencoder__exercise_induced_angina_yes,0.6288076002703239
standardscaler__sex,0.6045607179889861
onehotencoder__num_of_vessels_1.0,0.5644600036470457
onehotencoder__slope_flat,0.3503515446664692
standardscaler__resting_blood_pressure,0.3404955465111282
standardscaler__cholesterol,0.32447332163941606
standardscaler__st_depression,0.3043895729937989
onehotencoder__slope_downsloping,0.24053223862408943
onehotencoder__rest_ecg_left ventricular hypertrophy,0.1960784505781805
onehotencoder__chest_pain_type_atypical angina,0.19048123124637503
onehotencoder__rest_ecg_ST-T wave abnormality,0.17303560432690512
standardscaler__age,0.05633345043781977
standardscaler__fasting_blood_sugar,-0.16843768707539558
standardscaler__max_heart_rate,-0.1864102268478972
onehotencoder__num_of_vessels_3.0,-0.22288813387665835
onehotencoder__rest_ecg_normal,-0.3690988378536457
onehotencoder__thalassemia_fixed defect,-0.38841108925053175
onehotencoder__chest_pain_type_typical angina,-0.4424102779839873
onehotencoder__thalassemia_normal,-0.5773705519766958
onehotencoder__slope_upsloping,-0.590868566239119
onehotencoder__chest_pain_type_non-anginal pain,-0.9319229924142742
onehotencoder__num_of_vessels_0.0,-1.2558299597104094
Binary file modified results/tables/logistic_regression/logreg_coefficients.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 92f52d6

Please sign in to comment.