Dataset taken from https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset.
Exploratory data analysis, model development and model explainability for the heart disease web application. The EDA and modelling (Logistic regression, AutoML and Gradient-boosted Trees) were performed in Azure Databricks (files in Databricks_workspace), and tracked with MLFlow.
Posteriorly, the GBT model was replicated in the local environment, in create_model.py, which saves the model and all the used datasets to /data/. The model explainability was implemented with SHAP in explain_model.py.
If you wish to run with docker:
Linux: To install Docker on Linux, follow the instructions for your specific distribution on the Docker website.
Windows: If you're using Windows, you can install Docker Desktop by downloading it from the Docker Desktop for Windows page.
To install this application without using a docker container, follow these steps:
- Clone this repository to your local machine:
git clone https://github.com/leo-cb/HeartDiseasePrediction_ModelDev.git
- Install dependencies:
pip install -r requirements.txt
To install this application using docker, follow these steps:
- Clone this repository to your local machine:
git clone https://github.com/leo-cb/HeartDiseasePrediction_ModelDev.git
- Create docker image:
docker build -t heartdisease_modeldev .
To run the scripts without docker, follow these steps:
- Execute create_model.py to create the GBT model and output it to /data/:
python create_model.py
- Execute explain_model.py to output the SHAP plots to /images/ and show them:
python explain_model.py --show-plots
To run the scripts with docker, follow these steps:
- Execute create_model.py to create the GBT model and output it to /data/:
docker run -it heartdisease_modeldev:latest python create_model.py
- Execute explain_model.py to output the SHAP plots to /images/:
docker run -it heartdisease_modeldev:latest python explain_model.py
The following steps were taken:
Files: Databricks_workspace/eda.ipynb
Databricks workspace
Feature importances
Modelling was performed with Logistic Regression, AutoML and Gradient-boosted Trees models in Azure Databricks with Pyspark. Model tracking performed with MLFlow. The chosen model for production was the one with the highest AUC in the test set (GBT with 9 features corresponding to the 9 highest feature importances).
Files:: Databricks_workspace/model.py
Logistic Regression in MLFlow
GBT in MLFlow
Runs with different feature sets in MLFlow
F1-score between different MLFlow runs
Files: create_model.py
Files: explain_model.py
SHAP Summary plot
SHAP Bar plot
Files: Dockerfile