Skip to content
/ llemr Public

NeurIPS'24 DB (Spotlight) | Instruction Tuning Large Language Models to Understand Electronic Health Records

License

Notifications You must be signed in to change notification settings

zzachw/llemr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Instruction Tuning Large Language Models to Understand Electronic Health Records

Authors: Zhenbang Wu, Anant Dadu, Michael Nalls, Faraz Faghri, Jimeng Sun

Published at: NeurIPS 2024 Datasets and Benchmarks Track (Spotlight)

[📑Paper] [🪧Poster] [📽️Slides]

Release

  • [19 Dec 2024] The trained model weights are released
  • [11 Dec 2024] A sample dataset with ~100 patients is added
  • [11 Dec 2024] Code for dataset creation, model training, and response evaluation is released

Contents

Core Dependencies

python 3.9
torch 2.3.0
transformers 4.44.0
peft 0.10.0

Data Download

The MIMIC-Instr dataset will be hosted on PhysioNet once the preparation and review process is complete.

A sample dataset generated from the MIMIC-IV Demo database is available in the sample_data directory.

For early access to the full dataset, please reach out to Zhenbang Wu (zw12@illinois.edu) with your CITI training report.

Model Download

The pre-trained model checkpoints can be found on the Hugging Face model hub: zzachw12/llemr-v1.

You can load the model using the following code snippet:

from peft import PeftModel
from src.model.init_llemr import init_llemr

# Define paths for the base model and LoRA weights
llm_pretrained_model_name_or_path = "lmsys/vicuna-7b-v1.5"
lora_name_or_path = "zzachw12/llemr-v1"

# Initialize the base model and tokenizer
model, tokenizer = init_llemr(llm_pretrained_model_name_or_path, hidden_size=1027)

# Integrate the LoRA weights into the model
model = PeftModel.from_pretrained(model, lora_name_or_path)

Note: This model requires pre-computed event embeddings generated by BiomedBERT. Follow Evaluate to preprocess the data, generate the response, and evaluate the model.

Evaluate

  1. Download the MIMIC-Instr dataset from PhysioNet

  2. Run steps 1, 4, 7, 8 in Data Generation to prepare the event sequence data and pre-compute the event embeddings

  3. Generate the model response with query_llemr.ipynb

  4. Compare the model response with the GPT-4 reference answer with eval.ipynb (need OpenAI Azure service)

  5. Summarize the results with summary_eval.ipynb

Train

  1. Download the MIMIC-Instr dataset from PhysioNet

  2. Run steps 1, 4, 7, 8 in Data Generation to prepare the event sequence data and pre-compute the event embeddings

  3. Run the training script train.py:

    • CMD: sh src/train/train.sh

Dataset Creation

  1. Download the MIMIC-IV in the raw_data directory

  2. Download the MIMIC-IV-Note dataset in the raw_data directory

  3. Run the following jupyter notebook to select the patient cohort: 01_cohort_selection.ipynb

  4. Run the following jupyter notebooks to prepare the event sequence data:

  5. Run the following jupyter notebooks to generate the instruction tuning data:

  6. Split the data into train, validation, and test sets:

  7. Pre-compute the event embeddings with 06_precompute_event_embeddings.py:

    • CMD: sh src/preprocess/precompute_event_embeddings.sh
  8. Generate the GPT-4 reference answer with query_gpt4.ipynb

Notes on Model Enhancements

This repository incorporates several minor improvements over the original implementation described in the paper:

  1. Enhanced Event Encoder:

    • Replaced ClinicalBERT (emilyalsentzer/Bio_ClinicalBERT) with BiomedBERT-large (microsoft/BiomedNLP-BiomedBERT-large-uncased-abstract), improving the quality of event embeddings
  2. Improved Event Embedding:

    • Concatenated event timestamps and numeric values (where available) to the final event embeddings, resulting in better representation of time-sensitive and quantitative data
  3. Expanded Dataset:

    • Increased the size of the clinical reasoning subset to 100K examples, doubling the data from the original 50K subset for more comprehensive coverage.
  4. Unified Training Approach:

    • Adopted a single-step training process that integrates schema alignment and clinical reasoning subsets simultaneously, streamlining the training pipeline

These advancements collectively enhance the model's ability to interpret and reason with EHR data, delivering superior performance compared to its predecessor.

Citation

If you find this work useful, please cite:

@inproceedings{
    wu2024instruction,
    title={Instruction Tuning Large Language Models to Understand Electronic Health Records},
    author={Zhenbang Wu and Anant Dadu and Michael Nalls and Faraz Faghri and Jimeng Sun},
    booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
    year={2024},
    url={https://openreview.net/forum?id=Dgy5WVgPd2}
}

* Note: The teaser image above the title is generated by ChatGPT.

About

NeurIPS'24 DB (Spotlight) | Instruction Tuning Large Language Models to Understand Electronic Health Records

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published