We expose the following functionalities discussed in the paper
- Report cleaning with large language models
- Training an image classifier for detecting positive findings
- Finetuning LLaMA on predicted conditions and indications
- Generating reports with Pragmatic-LLaMA
- Evaluating generated reports
This code mainly works for MIMIC-CXR [1], but can be adapted slightly to work with other chest X-ray datasets. We will discuss the relevant changes in each section. Make sure to obtain the license if you are working with MIMIC-CXR.
There are two environments to be installed: one for the project in general requirements.txt
and one specifically for evaluation eval_requirements.txt
. We need a separate environment for evaluation due to version conflicts coming from RadGraph.
The input to our model is a tuple of (image, indication). Please follow the relevant dataset instructions to obtain them, especially the indication section. For MIMIC-CXR, you can use create_section_files.py to extract the indication, findings, and impression sections.
Our code uses CheXbert [2] to label reports and RadGraph [3] for evaluation. Please download the two model checkpoints here and put them under "./models/".
Please refer to the instructions under /image_model/
for how to train an image classifier to detect positive findings according to the CheXbert labels.
deepspeed --num_gpus=<insert> report_cleaning.py --chexbert_path <insert> --dataset_path <insert> --output_dir <insert>
This cleaning method works best for one sentence at a time. The process to split sentences and keep track of their report IDs can be specific to the dataset, so we leave that implementation to the user.
We first format the indication and groundtruth CheXbert labels into a JSON file that can be used to finetune LLaMA.
python format_llama_input.py --indication_path <insert> --impression_path <insert> --outpath <insert>
Follow Alpaca's finetuning instructions to finetune a LLaMA model to generate radiology reports. Put the path to the above JSON file for --data_path.
Insert the path to your finetuned Pragmatic-LLaMA model, path to indications, path to the directory containing the vision model and tuned classification thresholds, and specify an output path for predicted vision labels. This helps save time on subsequent runs on the same images by not having to re-run the classifier.
python pragmatic_llama_inference.py --llama_path <insert> --indication_path <insert> --vision_path <insert> --image_path <insert> --vision_out_path <insert> --outpath <insert>
Once you have generated reports using Pragmatic-LLaMA or any other model, they can be evaluate using the command below. Note that --out_path should be a CSV file. If it ends with "filename.csv", there will also be an output file that ends with "filename_avg.csv" that contains the average (over all evaluated reports) scores of metrics. filename.csv itself saves the scores per-report. We graciously borrow much of the evaluation code from Yu, Endo, and Krishnan et al. [4]
python evaluate.py --gt_path <insert> --gen_path <insert> --out_path <insert>
[1] Johnson, Alistair, Pollard, Tom, Mark, Roger, Berkowitz, Seth, and Steven Horng. "MIMIC-CXR Database" (version 2.0.0). PhysioNet (2019). https://doi.org/10.13026/C2JT1Q.
[2] Smit, Akshay, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y. Ng, and Matthew P. Lungren. "CheXbert: combining automatic labelers and expert annotations for accurate radiology report labeling using BERT." arXiv preprint arXiv:2004.09167 (2020).
[3] Jain, Saahil, Ashwin Agrawal, Adriel Saporta, Steven QH Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon et al. "Radgraph: Extracting clinical entities and relations from radiology reports." arXiv preprint arXiv:2106.14463 (2021).
[4] Yu, Feiyang, Mark Endo, Rayan Krishnan, Ian Pan, Andy Tsai, Eduardo Pontes Reis, Eduardo Kaiser Ururahy Nunes Fonseca et al. "Evaluating progress in automatic chest x-ray radiology report generation." Patterns 4, no. 9 (2023).