Authors: Zhenbang Wu, Anant Dadu, Michael Nalls, Faraz Faghri, Jimeng Sun
Published at: NeurIPS 2024 Datasets and Benchmarks Track (Spotlight)
- [19 Dec 2024] The trained model weights are released
- [11 Dec 2024] A sample dataset with ~100 patients is added
- [11 Dec 2024] Code for dataset creation, model training, and response evaluation is released
- Core Dependencies
- Data Download
- Model Download
- Evaluate
- Train
- Dataset Creation
- Notes on Model Enhancements
- Citation
python 3.9
torch 2.3.0
transformers 4.44.0
peft 0.10.0
The MIMIC-Instr dataset will be hosted on PhysioNet once the preparation and review process is complete.
A sample dataset generated from the MIMIC-IV Demo database is available in the sample_data
directory.
For early access to the full dataset, please reach out to Zhenbang Wu (zw12@illinois.edu) with your CITI training report.
The pre-trained model checkpoints can be found on the Hugging Face model hub: zzachw12/llemr-v1.
You can load the model using the following code snippet:
from peft import PeftModel
from src.model.init_llemr import init_llemr
# Define paths for the base model and LoRA weights
llm_pretrained_model_name_or_path = "lmsys/vicuna-7b-v1.5"
lora_name_or_path = "zzachw12/llemr-v1"
# Initialize the base model and tokenizer
model, tokenizer = init_llemr(llm_pretrained_model_name_or_path, hidden_size=1027)
# Integrate the LoRA weights into the model
model = PeftModel.from_pretrained(model, lora_name_or_path)
Note: This model requires pre-computed event embeddings generated by BiomedBERT. Follow Evaluate to preprocess the data, generate the response, and evaluate the model.
-
Download the MIMIC-Instr dataset from PhysioNet
-
Run steps 1, 4, 7, 8 in Data Generation to prepare the event sequence data and pre-compute the event embeddings
-
Generate the model response with query_llemr.ipynb
-
Compare the model response with the GPT-4 reference answer with eval.ipynb (need OpenAI Azure service)
-
Summarize the results with summary_eval.ipynb
-
Download the MIMIC-Instr dataset from PhysioNet
-
Run steps 1, 4, 7, 8 in Data Generation to prepare the event sequence data and pre-compute the event embeddings
-
Run the training script train.py:
- CMD:
sh src/train/train.sh
- CMD:
-
Download the MIMIC-IV in the
raw_data
directory -
Download the MIMIC-IV-Note dataset in the
raw_data
directory -
Run the following jupyter notebook to select the patient cohort: 01_cohort_selection.ipynb
-
Run the following jupyter notebooks to prepare the event sequence data:
-
- Extract events:
- 02_event_static.ipynb
- 02_event_hosp_diagnoses_icd.ipynb
- 02_event_hosp_labevents.ipynb
- 02_event_hosp_microbiologyevents.ipynb
- 02_event_hosp_prescriptions.ipynb
- 02_event_hosp_transfers.ipynb
- 02_event_icu_chartevents.ipynb
- 02_event_icu_inputevents.ipynb
- 02_event_icu_outputevents.ipynb
- 02_event_icu_procedureevents.ipynb
-
- Merge events: 03_merge_events.ipynb
-
-
Run the following jupyter notebooks to generate the instruction tuning data:
- Run this only if you want to generate the instruction tuning data on your own
-
- Generate the schema alignment subset:
- 04_template_qa_event.ipynb
- 04_paraphrase_qa_event.ipynb (need OpenAI Azure service)
-
- Generate the instruction following subset:
- 04_generate_qa_note.ipynb (need OpenAI Azure service)
-
Split the data into train, validation, and test sets:
-
Pre-compute the event embeddings with 06_precompute_event_embeddings.py:
- CMD:
sh src/preprocess/precompute_event_embeddings.sh
- CMD:
-
Generate the GPT-4 reference answer with query_gpt4.ipynb
This repository incorporates several minor improvements over the original implementation described in the paper:
-
Enhanced Event Encoder:
- Replaced ClinicalBERT (
emilyalsentzer/Bio_ClinicalBERT
) with BiomedBERT-large (microsoft/BiomedNLP-BiomedBERT-large-uncased-abstract
), improving the quality of event embeddings
- Replaced ClinicalBERT (
-
Improved Event Embedding:
- Concatenated event timestamps and numeric values (where available) to the final event embeddings, resulting in better representation of time-sensitive and quantitative data
-
Expanded Dataset:
- Increased the size of the clinical reasoning subset to 100K examples, doubling the data from the original 50K subset for more comprehensive coverage.
-
Unified Training Approach:
- Adopted a single-step training process that integrates schema alignment and clinical reasoning subsets simultaneously, streamlining the training pipeline
These advancements collectively enhance the model's ability to interpret and reason with EHR data, delivering superior performance compared to its predecessor.
If you find this work useful, please cite:
@inproceedings{
wu2024instruction,
title={Instruction Tuning Large Language Models to Understand Electronic Health Records},
author={Zhenbang Wu and Anant Dadu and Michael Nalls and Faraz Faghri and Jimeng Sun},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=Dgy5WVgPd2}
}
* Note: The teaser image above the title is generated by ChatGPT.