This repository contains a completed cap-stone project for the Udacity "Applying AI to EHR Data" course, part of the AI for Healthcare Nanodegree program. It has been reviewed by Udacity instructors and met project specifications.
A hypothetical healthcare company is preparing for Phase III clinical trial testing for its novel diabetes drug. The drug requires administering and patient monitoring over a duration of 5-7 days in the hospital. Target patients are those who are likely to be in the hospital for this duration of time, so there will be no significant additional costs for drug administration and patient monitoring. The goal of this project is to utilize Electronic Health Record (EHR) information to build a regression model that can predict the hospitalization time for a patient, and then use this model to select/filter patients for this study.
A Deep Neural Network regression model was built to predict the days of hospitalization duration for patients in this dataset. These predictions are converted to binary prediction of whether to include or exclude that patient from the clinical trial.
This project utilizes EHR data by transforming line-level data into an appropriate data representation at the encounter level (per patient visit level), and then apply filtering, preprocessing, and feature engineering of key medical code sets. TensorFlow Feature Column API was used to prepare features and TensorFlow Probability Layers were used to create the regression model.
The completed regression model achieved binary predication accuracy of 0.77, precision of 0.71, recall of 0.61, and F1-score of 0.66. It can be further optimized by maximizing precision, recall, or F1-score with trade-off between precision and recall.
For full discussion, please read the "Model Evaluation Metrics" section of src\student_project_EY_completed.ipynb
.
To understand model biases across key demographic groups, model predictions were analyzed with the UChicago Aequitas toolkit.
Udacity provided a synthetic dataset(denormalized at the line level augmentation) built off of the UC Irvine Diabetes re-admission dataset.
The dataset can be found in /src/data/final_project_dataset.csv
.
- Set up your Anaconda environment.
- Clone
https://github.com/ElliotY-ML/Predict_Diabetic_Patient_Hospitalization_Duration.git
GitHub repo to your local machine. - Open
src/student_project_EY_completed.ipynb
with Jupyter Notebook to explore EDA, feature transformations, feature columns, model training, inference, and bias analysis.
Using Anaconda consists of the following:
- Install
miniconda
on your computer, by selecting the latest Python version for your operating system. If you already haveconda
orminiconda
installed, you should be able to skip this step and move on to step 2. - Create and activate a new
conda
environment.
Download the latest version of miniconda
that matches your system.
Linux | Mac | Windows | |
---|---|---|---|
64-bit | 64-bit (bash installer) | 64-bit (bash installer) | 64-bit (exe installer) |
32-bit | 32-bit (bash installer) | 32-bit (exe installer) |
Install miniconda on your machine. Detailed instructions:
- Linux: https://docs.conda.io/en/latest/miniconda.html#linux-installers
- Mac: https://docs.conda.io/en/latest/miniconda.html#macosx-installers
- Windows: https://docs.conda.io/en/latest/miniconda.html#windows-installers
For Windows users, these following commands need to be executed from the Anaconda prompt as opposed to a Windows terminal window. For Mac, a normal terminal window will work.
These instructions also assume you have git
installed for working with Github from a terminal window, but if you do not, you can download that first with the command:
conda install git
Create local environment
- Clone the repository, and navigate to the downloaded folder. This may take a minute or two to clone due to the included image data.
git clone https://github.com/ElliotY-ML/Predict_Diabetic_Patient_Hospitalization_Duration.git
cd Predict_Diabetic_Patient_Hospitalization_Duration
-
Create (and activate) a new environment, named
udacity-ehr-env
with Python 3.8. If prompted to proceed with the install(Proceed [y]/n)
type y.- Linux or Mac:
conda create -n udacity-ehr-env source activate udacity-ehr-env
- Windows:
conda create --name udacity-ehr-env activate udacity-ehr-env
At this point your command line should look something like:
(udacity-ehr-env) <User>:USER_DIR <user>$
. The(udacity-ehr-env)
indicates that your environment has been activated, and you can proceed with further package installations. -
Install a few required pip packages, which are specified in the requirements text file. Be sure to run the command from the project root directory since the requirements.txt file is there.
pip install -r requirements.txt
The original Udacity project instructions can be read in Udacity_Project_Instructions.md
.
Project Overview
- Project Instructions & Prerequisites
- Learning Objectives
- Data Preparation and Exploratory Data Analysis
- Create Categorical Features with TensorFlow Feature Columns
- Create Continuous/Numerical Features with TensorFlow Feature Columns
- Build Deep Learning Regression Model with Sequential API and TensorFlow Probability Layers
- Evaluating Potential Model Biases with Aequitas Toolkit
Begin by opening /src/student_project_EY_completed.ipynb
with Jupyter Notebook.
Inputs:
- Udacity Dataset:
src/data/final_project_dataset.csv
- Admission Type ID:
src/data_schema_references/IDs_mapping.csv
- NDC Codes to Drugs Lookup Table:
src/data_schema_references/ndc_lookup_table.csv
- Dataset Schema:
src/data_schema_references/project_data_schema.csv
- NDC Codes to Drugs Lookup Table (copy):
src/medication_lookup_tables/final_ndc_lookup_table
Output:
- Trained Deep neural network regression model with TensorFlow Probability Layers in notebook
- Predictions output in
/out/pred_test_df3.csv
- Data preparation begins in section 3. The project dataset is imported into a pandas DataFrame. There are medical code reference tables in
src/data_schema_references
that translate medical and medicine codes into descriptions. These are also imported into dataframes. - An exploratory data analysis to understand the data and demographics.
- The dataset is then transformed from the line level into an aggregated encounter level. In other words, individual medical code entries are aggregated by individual patient visits.
- Select categorical and numerical features to use for the model.
- Split dataset into a 60%/20%/20% train/validation/test split and ensure that the demographics are reflective of the overall dataset. The
patient_dataset_splitter
function was completed instudent_utils.py
module. - Use TensorFlow Feature Columns API to create categorical features and embedding columns for each feature. The
create_tf_categorical_feature_cols
function was completed instudent_utils.py
module. - Use TensorFlow Feature Columns API to create numeric features. The
create_tf_numeric_feature
function was completed instudent_utils.py
module. - Build and train a deep learning regression model with Keras Dense Layers and TensorFlow Probability Layers.
- Convert regression output to classification for patient selection
- Use scikit-learn
classification_report
andconfusion_matrix
functions to compare patient selection performance of trained model against actual hospitalization durations. - Evaluate potential model biases with Aequitas toolkit. Visualizations show if there are model biases for gender and race demographics.
This project is licensed under the MIT License - see the LICENSE.md