External Validation for AKI Predictive Models using PCORnet CDM data

This is the working directory building and validating predictive models for Acute Kidney Injury (AKI) based on PCORNET common data model across GPC sites.

by Xing Song, with Mei Liu, Russ Waitman Medical Informatics Division, Univeristy of Kansas Medical Center

Copyright (c) 2018 Univeristy of Kansas Medical Center
Share and Enjoy according to the terms of the MIT Open Source License.

Background

Acute Kidney Injury (AKI) is a common and highly lethal health problem, affecting 10-15% of all hospitalized patients and >50% of the intensive care unit (ICU) patients. In this application, we propose to build predictive models to identify patients at risk for hospital-acquired AKI and externally validate the models using the PCORnet (Patient Centered Outcomes Research Network)13 common data model (CDM) infrastructure (GPC#711,GPC#742). The project will be carried out with the following aims:

Aim 1 - KUMC: Building predictive models on single-site data. We will develop and internally cross-validate machine learning based predictive models for in-hospital AKI using electronic medical record (EMR) data from the University of Kansas Medical Center’s (KUMC) PCORnet CDM. As co-I of the PCORnet network Greater Plains Collaborative (GPC), PI of this project has direct access to the KUMC CDM for model development.
* Task 1.1: developing R-markdown implementation for data extraction and quality check
* Task 1.2: exploratory data analysis (e.g. strategies for data cleaning and representation, feature engineering)
* Task 1.3: benchmarking with replication of current state-of-art prediction model
* Task 1.4: developing new models
Aim 2 - GPC: Validating predictive models on multi-site data. We will implement an automated analytic package with built in data extraction and predictive modeling from Aim 1 for distributed execution within two PCORnet clinical data research networks (CDRNs), namely GPC led by Dr. Waitman and Veterans Health Administration (VHA) site led by Dr. Matheny in pSCANNER. All prototyping will be done on the KUMC CDM.
* Task 2.1: deploying R-markdown implementation codes for quanlity check, and reporting results to KUMC
* Task 2.2: deploying R-markdown implementation codes or stand-along SQL codes to extract line-item data and transfer to KUMC
--- Alternatively, site can choose to re-use GROUSE data without performing any local extractions.
* Task 2.3 (distributed analytics): deploying R-markdown implementaion codes for external validations of predictive models ( Not fully tested yet )

Aim 3 - KUMC: Interpreting the AKI prediction model by analyzing and visualizing top important predictors
* Task 3.1: rank features based on their importance in improving model performance
* Task 3.2: use SHAP value to evaluate marginal effects of top important predictors
* Task 3.3: create dashboard for visualizing feature ranking and marginal plots
Aim 4 - KUMC: Developing joint KL-divergence and adjMMD metrics to infer model transportability and potentially identify reasons imparing model transportability
* Task 4.1: identify overlapped and site-specific feature space
* Task 4.2: develop metrics, joint KL-divergence and adjMMD, to quantify joint distribution hetergenity, considering missing value patterns
* Task 4.3: invetigate the usability of the new metrics

Data Preprocessing

For each patient in the data set, we collected all variables that are currently populated in the PCORnet CDM schema over both source and target sites, which includes:

general demographic details (i.e., age, gender and race),
all structured clinical variables that are currently supported by PCORnet CDM Version 4,
- including diagnoses (ICD-9 and ICD-10 codes), procedures (ICD and CPT codes),
- lab tests (LOINC codes), medications (RXNORM and NDC codes),
- selected vital signs (e.g., blood pressure, height, weight, BMI)30.

All variables are time-stamped and every patient in the dataset was represented by a sequence of clinical events construed by clinical observation vectors aggregated on daily basis.

The initial feature set contained more than 30,000 distinct features. We performed an automated curation process as follows:

systematically identified extreme values of numerical variables (e.g., lab test results and vital signs) that are beyond 1st and 99th percentile as outliers and removed them;
performed one-hot-coding on categorical variables (e.g., diagnosis and procedure codes) to convert them into binary representations;
used the cumulative-exposure-days of medications as predictors instead of a binary indicator for the sheer existence of that medication;
when repeated measurements presented within certain time interval, we chose the most recent value;
when measurements are missing for a certain time interval, we performed a common sampling practice called sample-and-hold which carried the earlier available observation over;
introduced additional features such as lab value changes since last observation or daily blood pressure trends, which have been shown to be predictive of AKI10.

The discrete-time survival model (DTSA) required converting the encounter-level data into an Encounter-Period data set with discrete time interval indicator (i.e. day1, day2, day3,...). More details about this conversion can be found in the format_data() and get_dsurv_temporal() functions from /R/util.R. As shown in the figure below: ,

AKI patient at days of AKI onset contributed to positive outcomes, while earlier non-AKI days of AKI patients as well as daily outcomes of truely non-AKI patients (i.e. who never progressed to any stage of AKI) contributed to nefative outcomes.

Requirements

In order for sites to extract AKI cohort, run predictive models and generate final report, the following infrastructure requirement must be satisfied:

R program: R Program (>=3.3.0) is required and R studio (>= 1.0.136) is preferred to be installed as well for convenient report generation.
DBMS connection: Valid channel should also be established between R and DBMS so that communication between R and CDM database can be supported.
Dependencies: A list of core R packages as well as their dependencies are required. However, their installations have been included in the codes.

DBI (>=0.2-5): for communication between R and relational database
ROracle (>=1.3-1): an Oracle JDBC driver
RJDBC: a SQL sever driver
RPostgres: a Postgres driver
rmarkdown (>=1.10): for rendering report from .Rmd file (Note: installation may trip over dependencies digest and htmltools (>=0.3.5), when manually installation is required).
dplyr (>=0.7.5): for efficient data manipulation
tidyr (>=0.8.1): for efficient data manipulation
magrittr (>=1.5): to enable pipeline operation
stringr (>=1.3.1): for handling strings
knitr (>=1.11): help generate reports
kableExtra: for generating nice tables
ggplot2 (>=2.2.1): for generating nice plots
ggrepel: to avoid overlapping labels in plots
openxlsx (>=4.1.0): to save tables into multiple sheets within a single .xlsx file
RCurl: for linkable descriptions (when uploading giant mapping tables are not feasible)
XML: for linkable descriptions (when uploading giant mapping tables are not feasible)
xgboost: for effectively training the gradient boosting machine
pROC: for calculating receiver operating curve
PRROC: for calculating precision recall curve
ParBayesianOptimization: a parallel implementation for baysian optimizaion for xgboost
doParallel: provide backend for parallelization

Site Usage for Model Validation

The following instructions are for extracting cohort and generating final report from a DBMS data source (specified by DBMS_type) (available options are: Oracle, tSQL, PostgreSQL(not yet))

Part I: Study Cohort and Variable Extraction

Get AKI_CDM code

download the AKI_CDM repository as a .zip file, unzip and save folder as path-to-dir/AKI_CDM
OR
clone AKI_CDM repository (using git command):
i) navigate to the local directory path-to-dir where you want to save the project repository and
type command line: $ cd <path-to-dir>
ii) clone the AKI_CDM repository by typing command line: $ git clone https://github.com/kumc-bmi/AKI_CDM

Prepare configeration file config.csv and save in the AKI_CDM project folder
i) download the config_<DBMS_type>_example.csv file according to DBMS_type
ii) fill in the content accordingly (or you can manually create the file using the following format)

username password access cdm_db_name/sid cdm_db_schema temp_db_schema

your_username your_passwd host:port database name(tSQL)/SID(Oracle) current CDM schema default schema

iii) save as config.csv under the same directory

Extract AKI cohort and generate final report
i) setup working directory
- In r-studio environment, simply open R project AKI_CDM.Rproj within the folder OR
- In plain r environment, set working directory to where AKI_CDM locates by runing setwd("path-to-dir/AKI_CDM")

ii) edit r script render_report.R by specifying the following parameters:
- which_report: which report you want to render (default is ./report/AKI_CDM_EXT_VALID_p1_QA.Rmd, but there will be more options in the future)
- DBMS_type: what type of database the current CDM is built on (available options are: Oracle(default), tSQL)
- driver_type: what type of database connection driver is available (available options are: OCI or JDBC for oracle; JDBC for sql server)
- start_date, end_date: the start and end date of inclusion period, in "yyyy-mm-dd" format (e.g. "2010-01-01", with the quotes)

iii) run Part I of r script render_report.R after assigning correct values to the parameters in ii)

iv) collect and report all output files from /output folder
-- a. AKI_CDM_EXT_VALID_p1_QA.html - html report with description, figures and partial tables
-- b. AKI_CDM_EXT_VALID_p1_QA_TBL.xlsx - excel with full summary tables

Remark: all the counts (patient, encounter, record) are masked as "<11" if the number is below 11

Part II: Validate Existing Predictive Models and Retrain Predictive Models with Local Data (Not fully tested yet)

Validate the given predictive model trained on KUMC's data

i) download the predictive model package, "AKI_model_kumc.zip", from the securefile link shared by KUMC. Unzip the file and save everything under ./data/model_kumc (remark: make sure to save the files under the correct directory, as they will be called later using the corresponding path)

ii) continue to run Part II.0 of the r script render_report.R after completing Part I. Part II.0 will only depend on tables already extracted from Part I (saved locally in the folder ./data/raw/...), no parameter needs to be set up.

iii) continue to run Part II.1 of the r script render_report.R after completing Part II.0. Part II.1 will only depend on tables already extracted from Part II.0 (saved locally in the folder ./data/preproc/...), no parameter needs to be set up.

iv) collect and report the two new output files from /output folder
-- a. AKI_CDM_EXT_VALID_p2_1_Benchmark.html - html report with description, figures and partial tables
-- b. AKI_CDM_EXT_VALID_p2_1_Benchmark_TBL.xlsx - excel with full summary tables
Retrain the model using local data and validate on holdout set

i) download the data dictionary, "feature_dict.csv", from the securefile link shared by KUMC and save the file under "./ref/" (remark: make sure to save the file under the correct directory, as it will be called later using the corresponding path)

ii) continue to run (optionally, if already run for Part II.1) Part II.0 of the r script render_report.R after completing Part I. Part II.0 will only depend on tables already extracted from Part I (saved locally in the folder ./data/raw/...), no parameter needs to be set up.

iii) continue to run Part II.2 of the r script render_report.R after completing Part I. Part II.2 will only depend on tables already extracted from Part II.0 (saved locally in the folder ./data/preproc/...), no parameter needs to be set up.

iv) collect and report the two new output files from /output folder
-- a. AKI_CDM_EXT_VALID_p2_2_Retrain.html - html report with description, figures and partial tables
-- b. AKI_CDM_EXT_VALID_p2_2_Retrain_TBL.xlsx - excel with full summary tables

*Remark: As along as Part I is completed, Part II.1 and Part II.2 can be run independently, based on each site's memory and disk availability.

Run the distribution_analysis.R script to calculate the adjMMD and joint KL-divergence of distribution hetergenity among top important variables of each model. adjMMD is an effective metric which can be used to assess and explain model transportability.

Benchmarking

a. It takes about 2 ~ 3 hours to complete Part I (AKI_CDM_EXT_VALID_p1_QA.Rmd). At peak time, it will use about 30 ~ 40GB memory, especially when large tables like Precribing or Lab tables are loaded in. Total size of output for Part I is about 6MB.

b. It takes about 60 ~ 70 hours (6hr/task) to complete Part II.0 (AKI_CDM_EXT_VALID_p2_0_Preprocess.Rmd). At peak time, it will use about 50 ~ 60GB memory, especially at the preprocessing stage. Total size of intermediate tables and output for Part II.0 is about 2GB.

c. It takes about 25 ~ 30 hours to complete Part II.1 (AKI_CDM_EXT_VALID_p2_Benchmark.Rmd). At peak time, it will use about 30 ~ 40GB memory, especially at the preprocessing stage. Total size of intermediate tables and output for Part II.1 is about 600MB.

d. It takes about 40 ~ 50 hours to complete Part II.2 (AKI_CDM_EXT_VALID_p2_Retrain.Rmd). At peak time, it will use about 30 ~ 40GB memory, especially at the preprocessing stage. Total size of intermediate tables and output for Part II.2 is about 800MB.

SHAP Value Interpretation

We used Shapely Additive exPlanations (SHAP) values to evaluate the marginal effects of the shared top important variables of interests 34. Specifically, the SHAP values evaluated how the logarithmic odds ratio changed by including a factor of certain value for each individual patient. The SHAP values not only captured the global patterns of effects of each factor but also demonstrated the patient-level variations of the effects. We also estimated 95% bootstrapped confidence intervals of SHAP values for each selected feature based on 100 bootstrapped samples. Visit our SHAP value dashboard for more details.

Adjusted Maxinum Mean Discrepancy (adjMMD)

The Maximum Mean Discrepancy (MMD) has been widely used in transfer learning studies for maximizing the similarity among distributions of different domains. Here we modified the classic MMD by taking the missing pattern and feature importance into consideration, which is used to measure the similarities of distributions for the same feature between training and validation sites. Visit our paper Cross-Site Transportability of an Explainable Artificial Intelligence Model for Acute Kidney Injury Prediction (DOI:10.1038/s41467-020-19551-w) for more details.

updated 11/10/2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

External Validation for AKI Predictive Models using PCORnet CDM data

Background

Data Preprocessing

Requirements

Site Usage for Model Validation

Part I: Study Cohort and Variable Extraction

Part II: Validate Existing Predictive Models and Retrain Predictive Models with Local Data (Not fully tested yet)

Benchmarking

SHAP Value Interpretation

Adjusted Maxinum Mean Discrepancy (adjMMD)

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 471 Commits
.ipynb_checkpoints		.ipynb_checkpoints
R		R
config		config
dashboard		dashboard
data		data
figure		figure
inst		inst
output		output
ref		ref
report		report
src		src
test		test
.gitignore		.gitignore
AKI_CDM.Rproj		AKI_CDM.Rproj
README.md		README.md
render_report.R		render_report.R

kumc-bmi/AKI_CDM

Folders and files

Latest commit

History

Repository files navigation

External Validation for AKI Predictive Models using PCORnet CDM data

Background

Data Preprocessing

Requirements

Site Usage for Model Validation

Part I: Study Cohort and Variable Extraction

Part II: Validate Existing Predictive Models and Retrain Predictive Models with Local Data (Not fully tested yet)

Benchmarking

SHAP Value Interpretation

Adjusted Maxinum Mean Discrepancy (adjMMD)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages