The advancement of medical image understanding and reasoning critically depends on building high-quality visual instruction data, which is costly and labor-intensive to obtain, particularly in the medical domain. To mitigate this data-starving issue, we introduce Self-Training Large Language and Vision Assistant for Medicine (STLLaVA-Med).
Self-Training Large Language and Vision Assistant for Medical Question-Answering [paper][HF Model]
Guohao Sun, Can Qin, Huazhu Fu, Linwei Wang, Zhiqiang Tao
Medical data usage and performance comparision between LLaVA-Med and our method.
Self-training pipeline for transforming a general Vision-Language assistant to medical expert.
2024.10.24
🌟 We have released our checkpoints!2024.09.20
🌟 We will release our checkpoints soon!2024.09.20
🌟 Our paper has been accepted by EMNLP 2024 (main conference).2024.06.10
🌟 Our paper and code was released!
- Install Package
conda create -n stllava python=3.10 -y
conda activate stllava
pip install --upgrade pip # enable PEP 660 support
cd STLLaVA-Med
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
Visual instructional data
This project utilizes vision instructional data provided by LLaVA-Med 60k_inline_mention
. However, due to disabled image URL, we fillterd out the origional data to ours own version in this project.
DPO data
This project auto-generate the preference dataset using the model itself and guided by GPT-4o. We sample 10k medical images from PMC-15M. You may download the dataset via STLLaVA-Med-DPO.
Training consists of two stages: (1) visual self-questioning instruction tuning stage, teaching the model to ask questions and follow multimodal instructions; (2) preference optimization.
Training script with DeepSpeed ZeRO-3 and lora: sqllava_med.sh
.
--mm_projector_type cluster
: the prototype extractor & a two-layer MLP vision-language connector.--mm_projector_type mlp2x_gelu
: a two-layer MLP vision-language connector.--vision_tower openai/clip-vit-large-patch14-336
: CLIP ViT-L/14 336px.--image_aspect_ratio pad
: this pads the non-square images to square, instead of cropping them; it slightly reduces hallucination.--version v1_sq
: training for visual self-questioning.--vit_lora_enable
: optimize vision encoder using vit lora.
Training script with DeepSpeed ZeRO-3 and lora: dpo_finetune.sh
.
--version v1
: training for visual self-questioning.
Please download raw images of datasets (VQA-RAD, SLAKE, PVQA) for medical VQA tasks.
Evaluate models on a diverse set of 3 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.
If you find this code to be useful for your research, please consider citing.
@inproceedings{Sun2024STLLaVAMedSL,
title={STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical},
author={Guohao Sun and Can Qin and Huazhu Fu and Linwei Wang and Zhiqiang Tao},
booktitle = {EMNLP},
year={2024},
}