DSCI522-2425-25-heart_disease_predictor

Authors: Anna Nandar, Brian Chang, Celine Habashy, Yeji Sohn

About

We built models using decision trees and logistic regression algorithms to predict the presence of heart disease based on health-related features. On an unseen dataset, our models achieved an overall accuracy of 84.4%. Logistic regression demonstrated better interpretability, with high precision and recall. Some features, such as fasting blood sugar, showed lower importance than anticipated. Moving forward, we plan to explore ensemble methods like Random Forest and Gradient Boosting to improve accuracy and consider incorporating additional clinical data for deeper insights.

The data set that was used in this project is from Cleveland database. It was sourced from the UCI Machine Learning Repository (R. Detrano, et al. 1989) and can be found here. This dataset includes features such as age, chest pain type, blood pressure, cholesterol, and more, alongside a binary diagnosis label (presence or absence of heart disease).

Report

The final report can be found rendered in HTML here.

Usage

Follow the instructions below to reproduce the analysis.

Setup

Clone this GitHub repository. git clone https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor.git
Navigate to root of the project folder in your IDE where you have cloned it.

Running the analysis

At the root of the project in a terminal, enter docker-compose up
In the terminal, navigate to the URL in the docker compose logs that start with the http://127.0.0.1:PORT_NUMBER/lab?token=

NOTE: You will need to replace the port number with PORT 34651 to navigate to the proper port inside docker

NOTE 2: If you are taken to an authentication screen, please take the token from the logs from where you saw http://127.0.0.1:PORT_NUMBER/lab?token=...token..is..here..., and paste it into the login screen's login with token

NOTE 3: If you are getting any errors with libraries or such, you may want to make sure the docker container and image are up to date. We've found deleting the image completely from your Docker Desktop the best method to ensure it has all been deleted, and the latest image will be pulled.

NOTE 4: If you'd rather build the environment locally, you can do so either using the already provided environment.yml file or with conda-lock with any of the 4 provided platforms in the root of this repository.

To run the analysis, regenerate the data, and generate the HTML and PDFs, open a terminal (in the docker jupyter lab) and run the following commands:
- make clean (To clean up (remove) all files generated by the analysis)
- make all (To generate all the files needed, including the report)

*NOTE: Please see Running individual parts of the analysis using Make to run individual parts only.

Running tests

At the root of the project in a terminal, enter:

python -m pytest tests/test_validate.py
python -m pytest tests/test_create_dir_if_not_exist.py
python -m pytest tests/test_load_data.py
python -m pytest tests/test_save_classification_report.py

Or if you want to run them all at once, in the root folder enter:

pytest

Clean up

To make sure the docker container was properly cleaned up, after typing ctrl + c in the terminal where you launched the docker container, type docker-compose rm

Dependencies

Docker is used to create reproducible instances of this project. The docker image used is based on the quay.io/jupyter/minimal-notebook:notebook-7.0.6 image. Additional dependencies aside from this image and the below dependencies by Conda are specified in the Dockerfile

Conda is also used to manage the software dependencies for this project. All dependencies are specified in the environment.yml.

Dependencies:
  - python=3.11
  - pip=24.3.1
  - pandas=2.2.2
  - ipykernel=6.29.5
  - nb_conda_kernels=2.5.1
  - scipy=1.14.1
  - matplotlib=3.9.3
  - scikit-learn=1.5.2
  - requests=2.32.3
  - seaborn=0.13.2
  - ucimlrepo=0.0.7
  - pandera=0.20.2
  - quarto=1.5.57
  - click=8.1.7
  - tabulate=0.9.0
  - lmodern (this is installed by the Dockerfile)
  - make (this is installed by the Dockerfile)
  - deepchecks=0.18.1
  - pytest=8.34

Developer Notes

Developer Dependencies

conda (version 24.11.0 or higher)
conda-lock (version 2.5.7 or higher)
Docker

Adding a new dependency

Add the dependency to the environment.yml file on a new branch.
Run conda-lock -k explicit --file environment.yml -p linux-64 to update the conda-linux-64.lock file
Re-build the Docker image locally to ensure it still runs.
Test the container locally by running it and ensuring your new dependencies are working
Push the changes to GitHub.
Update your local docker-compose.yml file on your branch to use the new container image (line 3 in the docker-compose.yml file where it starts with "image:..."

Note: Right now it will always use the latest Docker Image anyways so for Milestone 2 Step 6 is not needed

Send a pull request to merge the changes into the main branch.

Running individual parts of the analysis using Make

You may also generate individual parts at a time

To generate all the raw, cleaned, and processed data

Make data

To generate the EDA

Make figures

To generate the model, model training results, and model training figures

Make fits

To generate the final scoring results and the classification report figures

Make evals

To generate the html and pdf report again from the QMD:

Make report/heart_disease_predictor_report.html report/heart_disease_predictor_report.pdf

License

This project was created with the MIT License

References

Heart disease. UCI Machine Learning Repository. (n.d.). https://archive.ics.uci.edu/dataset/45/heart+disease

Detrano, R.C., Jánosi, A., Steinbrunn, W., Pfisterer, M.E., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. The American journal of cardiology, 64 5, 304-10 .

Van Rossum, G., & Drake, F. (2009). Python 3 Reference Manual. CreateSpace.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

team, The pandas development. 2020. “Pandas-Dev/Pandas: Pandas.” Zenodo. https://doi.org/10.5281/zenodo.3509134.

Bantilan, Niels. 2020. “Pandera: Statistical Data Validation of Pandas Dataframes.” In Proceedings of the 19th Python in Science Conference, edited by Meghann Agarwal, Chris Calloway, Dillon Niederhut, and David Shupe, 116–24. https://doi.org/ 10.25080/Majora-342d178e-010 .

Name		Name	Last commit message	Last commit date
Latest commit History 260 Commits
.github/workflows		.github/workflows
data		data
heart_disease_predictor_report_files/figure-html		heart_disease_predictor_report_files/figure-html
report		report
results		results
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conda-linux-64.lock		conda-linux-64.lock
conda-osx-64.lock		conda-osx-64.lock
conda-osx-arm64.lock		conda-osx-arm64.lock
conda-win-64.lock		conda-win-64.lock
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
index.html		index.html
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DSCI522-2425-25-heart_disease_predictor

About

Report

Usage

Setup

Running the analysis

Running tests

Clean up

Dependencies

Developer Notes

Developer Dependencies

Adding a new dependency

Running individual parts of the analysis using Make

To generate all the raw, cleaned, and processed data

To generate the EDA

To generate the model, model training results, and model training figures

To generate the final scoring results and the classification report figures

To generate the html and pdf report again from the QMD:

License

References

About

Releases 6

Packages

Contributors 4

Languages

License

UBC-MDS/DSCI522-2425-25-heart_disease_predictor

Folders and files

Latest commit

History

Repository files navigation

DSCI522-2425-25-heart_disease_predictor

About

Report

Usage

Setup

Running the analysis

Running tests

Clean up

Dependencies

Developer Notes

Developer Dependencies

Adding a new dependency

Running individual parts of the analysis using Make

To generate all the raw, cleaned, and processed data

To generate the EDA

To generate the model, model training results, and model training figures

To generate the final scoring results and the classification report figures

To generate the html and pdf report again from the QMD:

License

References

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 4

Languages

Packages