Heart Disease Prediction Model Development

Dataset taken from https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset.

Exploratory data analysis, model development and model explainability for the heart disease web application. The EDA and modelling (Logistic regression, AutoML and Gradient-boosted Trees) were performed in Azure Databricks (files in Databricks_workspace), and tracked with MLFlow.

Posteriorly, the GBT model was replicated in the local environment, in create_model.py, which saves the model and all the used datasets to /data/. The model explainability was implemented with SHAP in explain_model.py.

Getting Started

Dependencies

If you wish to run with docker:

Docker

Linux: To install Docker on Linux, follow the instructions for your specific distribution on the Docker website.

Windows: If you're using Windows, you can install Docker Desktop by downloading it from the Docker Desktop for Windows page.

Installing

Without Docker container

To install this application without using a docker container, follow these steps:

Clone this repository to your local machine:

git clone https://github.com/leo-cb/HeartDiseasePrediction_ModelDev.git

Install dependencies:
```
pip install -r requirements.txt
```

With docker container

To install this application using docker, follow these steps:

Clone this repository to your local machine:

git clone https://github.com/leo-cb/HeartDiseasePrediction_ModelDev.git

Create docker image:

docker build -t heartdisease_modeldev .

Executing program

Without docker

To run the scripts without docker, follow these steps:

Execute create_model.py to create the GBT model and output it to /data/:
```
python create_model.py
```
Execute explain_model.py to output the SHAP plots to /images/ and show them:
```
python explain_model.py --show-plots
```

With docker

To run the scripts with docker, follow these steps:

Execute create_model.py to create the GBT model and output it to /data/:
```
docker run -it heartdisease_modeldev:latest python create_model.py
```

Execute explain_model.py to output the SHAP plots to /images/:

docker run -it heartdisease_modeldev:latest python explain_model.py

Description

The following steps were taken:

Exploratory data analysis done in Azure Databricks with Pyspark

Files: Databricks_workspace/eda.ipynb

Databricks workspace

Feature importances

Modelling and model tracking with MLFlow

Modelling was performed with Logistic Regression, AutoML and Gradient-boosted Trees models in Azure Databricks with Pyspark. Model tracking performed with MLFlow. The chosen model for production was the one with the highest AUC in the test set (GBT with 9 features corresponding to the 9 highest feature importances).

Files:: Databricks_workspace/model.py

Logistic Regression in MLFlow

GBT in MLFlow

Runs with different feature sets in MLFlow

F1-score between different MLFlow runs

Local GBT model creation

Files: create_model.py

ML explainability with Shapley

Files: explain_model.py

SHAP Summary plot

SHAP Bar plot

Containerization with Docker

Files: Dockerfile

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Databricks_workspace		Databricks_workspace
data		data
images		images
input		input
mlartifacts/0/e11dfa9e0f83415a9dd014723dd1cd26/artifacts/model		mlartifacts/0/e11dfa9e0f83415a9dd014723dd1cd26/artifacts/model
mlruns/0		mlruns/0
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
create_model.py		create_model.py
explain_model.py		explain_model.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Heart Disease Prediction Model Development

Getting Started

Dependencies

Docker

Installing

Without Docker container

With docker container

Executing program

Without docker

With docker

Description

Exploratory data analysis done in Azure Databricks with Pyspark

Modelling and model tracking with MLFlow

Local GBT model creation

ML explainability with Shapley

Containerization with Docker

About

Releases

Packages

Languages

leo-cb/HeartDiseasePrediction_ModelDev

Folders and files

Latest commit

History

Repository files navigation

Heart Disease Prediction Model Development

Getting Started

Dependencies

Docker

Installing

Without Docker container

With docker container

Executing program

Without docker

With docker

Description

Exploratory data analysis done in Azure Databricks with Pyspark

Modelling and model tracking with MLFlow

Local GBT model creation

ML explainability with Shapley

Containerization with Docker

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages