Skip to content

Latest commit

 

History

History
177 lines (116 loc) · 7.17 KB

README.md

File metadata and controls

177 lines (116 loc) · 7.17 KB

Instruction Tuning Large Language Models to Understand Electronic Health Records

Authors: Zhenbang Wu, Anant Dadu, Michael Nalls, Faraz Faghri, Jimeng Sun

Published at: NeurIPS 2024 Datasets and Benchmarks Track (Spotlight)

[📑Paper] [🪧Poster] [📽️Slides]

Release

  • [19 Dec 2024] The trained model weights are released
  • [11 Dec 2024] A sample dataset with ~100 patients is added
  • [11 Dec 2024] Code for dataset creation, model training, and response evaluation is released

Contents

Core Dependencies

python 3.9
torch 2.3.0
transformers 4.44.0
peft 0.10.0

Data Download

The MIMIC-Instr dataset will be hosted on PhysioNet once the preparation and review process is complete.

A sample dataset generated from the MIMIC-IV Demo database is available in the sample_data directory.

For early access to the full dataset, please reach out to Zhenbang Wu (zw12@illinois.edu) with your CITI training report.

Model Download

The pre-trained model checkpoints can be found on the Hugging Face model hub: zzachw12/llemr-v1.

You can load the model using the following code snippet:

from peft import PeftModel
from src.model.init_llemr import init_llemr

# Define paths for the base model and LoRA weights
llm_pretrained_model_name_or_path = "lmsys/vicuna-7b-v1.5"
lora_name_or_path = "zzachw12/llemr-v1"

# Initialize the base model and tokenizer
model, tokenizer = init_llemr(llm_pretrained_model_name_or_path, hidden_size=1027)

# Integrate the LoRA weights into the model
model = PeftModel.from_pretrained(model, lora_name_or_path)

Note: This model requires pre-computed event embeddings generated by BiomedBERT. Follow Evaluate to preprocess the data, generate the response, and evaluate the model.

Evaluate

  1. Download the MIMIC-Instr dataset from PhysioNet

  2. Run steps 1, 4, 7, 8 in Data Generation to prepare the event sequence data and pre-compute the event embeddings

  3. Generate the model response with query_llemr.ipynb

  4. Compare the model response with the GPT-4 reference answer with eval.ipynb (need OpenAI Azure service)

  5. Summarize the results with summary_eval.ipynb

Train

  1. Download the MIMIC-Instr dataset from PhysioNet

  2. Run steps 1, 4, 7, 8 in Data Generation to prepare the event sequence data and pre-compute the event embeddings

  3. Run the training script train.py:

    • CMD: sh src/train/train.sh

Dataset Creation

  1. Download the MIMIC-IV in the raw_data directory

  2. Download the MIMIC-IV-Note dataset in the raw_data directory

  3. Run the following jupyter notebook to select the patient cohort: 01_cohort_selection.ipynb

  4. Run the following jupyter notebooks to prepare the event sequence data:

  5. Run the following jupyter notebooks to generate the instruction tuning data:

  6. Split the data into train, validation, and test sets:

  7. Pre-compute the event embeddings with 06_precompute_event_embeddings.py:

    • CMD: sh src/preprocess/precompute_event_embeddings.sh
  8. Generate the GPT-4 reference answer with query_gpt4.ipynb

Notes on Model Enhancements

This repository incorporates several minor improvements over the original implementation described in the paper:

  1. Enhanced Event Encoder:

    • Replaced ClinicalBERT (emilyalsentzer/Bio_ClinicalBERT) with BiomedBERT-large (microsoft/BiomedNLP-BiomedBERT-large-uncased-abstract), improving the quality of event embeddings
  2. Improved Event Embedding:

    • Concatenated event timestamps and numeric values (where available) to the final event embeddings, resulting in better representation of time-sensitive and quantitative data
  3. Expanded Dataset:

    • Increased the size of the clinical reasoning subset to 100K examples, doubling the data from the original 50K subset for more comprehensive coverage.
  4. Unified Training Approach:

    • Adopted a single-step training process that integrates schema alignment and clinical reasoning subsets simultaneously, streamlining the training pipeline

These advancements collectively enhance the model's ability to interpret and reason with EHR data, delivering superior performance compared to its predecessor.

Citation

If you find this work useful, please cite:

@inproceedings{
    wu2024instruction,
    title={Instruction Tuning Large Language Models to Understand Electronic Health Records},
    author={Zhenbang Wu and Anant Dadu and Michael Nalls and Faraz Faghri and Jimeng Sun},
    booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
    year={2024},
    url={https://openreview.net/forum?id=Dgy5WVgPd2}
}

* Note: The teaser image above the title is generated by ChatGPT.