Skip to content

Commit

Permalink
Merge pull request #90 from UBC-MDS/final-fixes
Browse files Browse the repository at this point in the history
 Final Fixes
  • Loading branch information
dd5124 authored Dec 15, 2024
2 parents 9277507 + 18bf244 commit ca42db5
Show file tree
Hide file tree
Showing 15 changed files with 187 additions and 62 deletions.
111 changes: 111 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Changelog
*Modified from our original combined [issue](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/issues/64)*
## Feedback Sources

### [Peer Review Feedback](https://github.com/UBC-MDS/data-analysis-review-2024/issues/12#issuecomment-2537520503)

### [Milestone 1 Feedback](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/issues/35)

### Milestone 2 Feedback:

- Feedback from Milestone 2 was to change the docker-compose file to not specify latest. This was addressed in more details in this [Issue](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/issues/88)

---

In general, all feedback from Milestones and most from peer review were addressed in fixes. A minority few from peer reviews were addressed by explaining our rationale for we didn't follow their suggestions. These specific explanations each have their own GitHub Issues and are linked to within their respective bullet point below.

*Note: Some fixes may have been addressed together in one big PR*

## Feedback Fixes, Commits, and Pull Requests

### **Report**

- Ensure EDA is conducted exclusively on the training dataset to prevent data leakage.
- Addressed by:
- [This commit](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/pull/79/commits/53de6cb46fa01639ee9fe745a43c6669ea679050)
- [This commit](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/pull/79/commits/15cf4c5586995dd7154f4ee238a08d10b5813d93)

- [Expand the "Methods & Results" section with more explanatory notes and keep the "Discussion" focused on providing additional details. Discuss the motivation behind your scoring metrics in the context of "Heart Disease Prediction".](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/pull/89)

- [Provide more detail in the "Methods" section on how the dataset was split for training and testing (e.g., 70-30 split or cross-validation)](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/pull/89)

- [The rendered reports in the repository are different from those created by make all. They are not up-to-date.](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/pull/90)

- [References: Please ensure all softwares used in generating the analysis and report are properly attributed.](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/pull/89)

---

### **Documentation**

- [Pin `requests` version to `==2.24.0` instead of `>=2.24.0`.](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/commit/92e9e74bc2b7b78f7fcb1cf1958d242d9187b836)

- [In the dockerfile, the port is mapped to 34651. However, the message in terminal still shows 8888. This may confuse people who do not carefully read instructions. Is there any way to resolve it?](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/issues/75)

- [Split the quarto render commands into separate lines for clarity:](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/issues/74)

```bash
quarto render heart_disease_predictor_report.qmd --to html
quarto render heart_disease_predictor_report.qmd --to pdf
```

---

### **Code**

- [Ensure consistent script naming (e.g., some scripts use `_heart_disease_predictor`, others do not).](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/commit/c1df4330520f8f65728530e5e9b12ed9b8465860)

- [Implement more specific exceptions (e.g., `InvalidColumnTypeError`, `MissingValueError`) instead of generic ones like `ValueError`.](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/commit/d1e3849dfffb0aa2e32b599f7ac7f58c42afb12d)

- Move helper functions into separate modules for better organization.
- Addressed by:
- [This commit](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/commit/fc240272756a1cde597eacd3924520ff1713c34f)
- [This commit](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/commit/be9ba5c8b969124a002e793d4f22840f44945650)

- Modularize functions to improve code clarity and reusability & Abstract code from the `main()` function into separate functions to improve structure and readability.
- Addressed by:
- [This PR](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/pull/66)
- [This PR](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/pull/79)
- [This PR](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/pull/84)
- [This PR](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/pull/83)

- Unit Testing: Adding unit tests for each validation function would ensure the correctness of the logic when applied to different datasets. It can help catch edge cases and confirm that each function works as expected.
- Addressed by:
- [This PR](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/pull/66)
- [This PR](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/pull/79)
- [This PR](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/pull/84)
- [This PR](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/pull/83)

---

### Milestone 1 Specific Feedback:

#### The below 6 are all addressed in [this PR](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/pull/51)

- Add proper email to code of conduct
- Remove .ds_store file
- Discuss importance/limitation of findings in Abstract/summary
- Clearly identify question in introduction
- Add all references to the dataset in Introduction
- More clearly define target/response variable in Introduction

#### The below 3 are all addressed in [this PR](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/pull/89)

- Important methodology descriptions missing (e.g., did not explain in narrative what metric was being used for model parameter optimization)
- Some important results not displayed
- Findings from project need to be linked back to application domain and questions

---

### **Organization**

- [Move all environment files into a separate folder to tidy up the repository. #72](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/issues/72)
- [Remove redundant lock files](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/issues/71)
- [Organize files in the `report` folder into subfolders for better structure.](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/commit/704fb0f46221944bcf70efff16d43fae6b21e9d5)

---

### **Warnings**

- [Address and resolve the warning messages that appear when running `make all`. Follow the instructions provided in the messages to eliminate them.](https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor/commit/ecf10b313713e36ddaed93f853acd98397680faa)

---
16 changes: 8 additions & 8 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,8 @@ ${MODEL_DIR}/logistic_regression.pkl \
${TBL_DIR}/logistic_regression/logistic_regression_confusion_matrix.png \
${TBL_DIR}/logistic_regression/logistic_regression_cv_results.csv \
${TBL_DIR}/logistic_regression/logreg_coefficients.csv \
${TBL_DIR}/logistic_regression/logreg_coefficients.png : scripts/fit_heart_disease_predictor.py
python scripts/fit_heart_disease_predictor.py \
${TBL_DIR}/logistic_regression/logreg_coefficients.png : scripts/fit_model.py
python scripts/fit_model.py \
--x-train=${PROC_DIR}/X_train_transformed.csv \
--y-train=${PROC_DIR}/y_train.csv \
--model=logistic_regression \
Expand All @@ -96,8 +96,8 @@ ${TBL_DIR}/logistic_regression/logreg_coefficients.png : scripts/fit_heart_disea
# Fit the decision tree model and generate data and figures
${MODEL_DIR}/decision_tree.pkl \
${TBL_DIR}/decision_tree/decision_tree_confusion_matrix.png \
${TBL_DIR}/decision_tree/decision_tree_cv_results.csv : scripts/fit_heart_disease_predictor.py
python scripts/fit_heart_disease_predictor.py \
${TBL_DIR}/decision_tree/decision_tree_cv_results.csv : scripts/fit_model.py
python scripts/fit_model.py \
--x-train=${PROC_DIR}/X_train_transformed.csv \
--y-train=${PROC_DIR}/y_train.csv \
--model=decision_tree \
Expand All @@ -109,16 +109,16 @@ ${TBL_DIR}/decision_tree/decision_tree_cv_results.csv : scripts/fit_heart_diseas
evals: ${TBL_DIR}/decision_tree/classification_report.csv ${TBL_DIR}/logistic_regression/classification_report.csv

# Generate classification report for Decision Tree
${TBL_DIR}/decision_tree/classification_report.csv : scripts/evaluate_heart_disease_predictor.py
python scripts/evaluate_heart_disease_predictor.py \
${TBL_DIR}/decision_tree/classification_report.csv : scripts/evaluate_model.py
python scripts/evaluate_model.py \
--x-test=${PROC_DIR}/X_test_transformed.csv \
--y-test=${PROC_DIR}/y_test.csv \
--pipeline-from=${MODEL_DIR}/decision_tree.pkl \
--results-to=${TBL_DIR}/decision_tree

# Generate classification report for Logistic Regression
${TBL_DIR}/logistic_regression/classification_report.csv : scripts/evaluate_heart_disease_predictor.py
python scripts/evaluate_heart_disease_predictor.py \
${TBL_DIR}/logistic_regression/classification_report.csv : scripts/evaluate_model.py
python scripts/evaluate_model.py \
--x-test=${PROC_DIR}/X_test_transformed.csv \
--y-test=${PROC_DIR}/y_test.csv \
--pipeline-from=${MODEL_DIR}/logistic_regression.pkl \
Expand Down
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ Follow the instructions below to reproduce the analysis.

*NOTE 3*: If you are getting any errors with libraries or such, you may want to make sure the docker container and image are up to date. We've found deleting the image completely from your Docker Desktop the best method to ensure it has all been deleted, and the latest image will be pulled.

*NOTE 4*: If you'd rather build the environment locally, you can do so either using the already provided [`environment.yml`](environment.yml) file or with `conda-lock` with any of the 4 provided platforms in the root of this repository.

3. To run the analysis, regenerate the data, and generate the HTML and PDFs, open a terminal (in the docker jupyter lab) and run the following commands:

- `make clean` (To clean up (remove) all files generated by the analysis)
Expand All @@ -58,7 +60,9 @@ python -m pytest tests/test_save_classification_report.py

Or if you want to run them all at once, in the root folder enter:

```pytest```
```
pytest
```


#### Clean up
Expand Down
29 changes: 17 additions & 12 deletions index.html

Large diffs are not rendered by default.

29 changes: 17 additions & 12 deletions report/heart_disease_predictor_report.html

Large diffs are not rendered by default.

Binary file modified report/heart_disease_predictor_report.pdf
Binary file not shown.
Binary file modified results/eda_plot/feature_densities_by_diagnosis.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified results/models/decision_tree.pkl
Binary file not shown.
Binary file modified results/models/logistic_regression.pkl
Binary file not shown.
4 changes: 2 additions & 2 deletions results/tables/decision_tree/decision_tree_cv_results.csv
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
,mean,std
fit_time,0.004,0.001
score_time,0.006,0.001
fit_time,0.001,0.0
score_time,0.002,0.0
test_accuracy,0.74,0.059
train_accuracy,1.0,0.0
test_precision,0.709,0.069
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
,mean,std
fit_time,0.009,0.005
score_time,0.008,0.003
fit_time,0.002,0.001
score_time,0.002,0.0
test_accuracy,0.826,0.021
train_accuracy,0.873,0.009
test_precision,0.822,0.054
Expand Down
50 changes: 25 additions & 25 deletions results/tables/logistic_regression/logreg_coefficients.csv
Original file line number Diff line number Diff line change
@@ -1,26 +1,26 @@
Feature,Coefficient
onehotencoder__chest_pain_type_asymptomatic,1.1838672562033263
onehotencoder__thalassemia_reversable defect,0.9657968582786678
onehotencoder__num_of_vessels_2.0,0.9142733069914617
onehotencoder__exercise_induced_angina_yes,0.6288076002703239
standardscaler__sex,0.6045607179889861
onehotencoder__num_of_vessels_1.0,0.5644600036470457
onehotencoder__slope_flat,0.3503515446664692
standardscaler__resting_blood_pressure,0.3404955465111282
standardscaler__cholesterol,0.32447332163941606
standardscaler__st_depression,0.3043895729937989
onehotencoder__slope_downsloping,0.24053223862408943
onehotencoder__rest_ecg_left ventricular hypertrophy,0.1960784505781805
onehotencoder__chest_pain_type_atypical angina,0.19048123124637503
onehotencoder__rest_ecg_ST-T wave abnormality,0.17303560432690512
standardscaler__age,0.05633345043781977
standardscaler__fasting_blood_sugar,-0.16843768707539558
standardscaler__max_heart_rate,-0.1864102268478972
onehotencoder__num_of_vessels_3.0,-0.22288813387665835
onehotencoder__rest_ecg_normal,-0.3690988378536457
onehotencoder__thalassemia_fixed defect,-0.38841108925053175
onehotencoder__chest_pain_type_typical angina,-0.4424102779839873
onehotencoder__thalassemia_normal,-0.5773705519766958
onehotencoder__slope_upsloping,-0.590868566239119
onehotencoder__chest_pain_type_non-anginal pain,-0.9319229924142742
onehotencoder__num_of_vessels_0.0,-1.2558299597104094
onehotencoder__chest_pain_type_asymptomatic,1.1838672562061838
onehotencoder__thalassemia_reversable defect,0.9657968582775411
onehotencoder__num_of_vessels_2.0,0.914273306990384
onehotencoder__exercise_induced_angina_yes,0.6288076002699492
standardscaler__sex,0.6045607179898266
onehotencoder__num_of_vessels_1.0,0.5644600036491417
onehotencoder__slope_flat,0.35035154466635776
standardscaler__resting_blood_pressure,0.340495546509579
standardscaler__cholesterol,0.3244733216396428
standardscaler__st_depression,0.304389572995754
onehotencoder__slope_downsloping,0.2405322386242731
onehotencoder__rest_ecg_left ventricular hypertrophy,0.19607845057825268
onehotencoder__chest_pain_type_atypical angina,0.1904812312471063
onehotencoder__rest_ecg_ST-T wave abnormality,0.17303560432627466
standardscaler__age,0.05633345043789224
standardscaler__fasting_blood_sugar,-0.16843768707612017
standardscaler__max_heart_rate,-0.18641022684835776
onehotencoder__num_of_vessels_3.0,-0.22288813387853396
onehotencoder__rest_ecg_normal,-0.3690988378526316
onehotencoder__thalassemia_fixed defect,-0.388411089249939
onehotencoder__chest_pain_type_typical angina,-0.442410277986553
onehotencoder__thalassemia_normal,-0.5773705519757063
onehotencoder__slope_upsloping,-0.590868566238734
onehotencoder__chest_pain_type_non-anginal pain,-0.9319229924148412
onehotencoder__num_of_vessels_0.0,-1.255829959709096
Binary file modified results/tables/logistic_regression/logreg_coefficients.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.
File renamed without changes.

0 comments on commit ca42db5

Please sign in to comment.