Official code of paper: ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning.
Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu*, Yonggang Wen
conda create -n adem python=3.8 -y
conda activate adem
# install pytorch
conda install pytorch==1.13.1 torchvision==0.14.1 -c pytorch
# install dependencies
pip install -r requirements.txt
Data Preparation
the data preparation instruction is borrowed from LaVIN.
- For ScienceQA, please prepare the dataset from the official repo.
- For Multimodal Chatbot, download the images in train2014 split from MSCOCO, and obtain the prepared 52k text-only and 158k text-image instruction-following data from here.
- Obtain the weights of LLaMA from this form (official) or Download LLaMA-7B and LLaMA-13B from HuggingFace (unofficial).
After that, the file structure should look like:
ADEM-VL/
|-- adem
|-- train.py
......
|-- data/
|-- problem.json
|-- pid_splits.json
|-- captions.json
|-- all_data.json
|-- images
|-- train2014 # MSCOCO 2014
|-- val2014 # MSCOCO 2014
|-- train # ScienceQA train image
|-- val # ScienceQA val image
|-- test # ScienceQA test image
|-- weights
|-- tokenizer.model
|--7B
|-- params.json
|-- consolidated.00.pth
|--13B
|-- params.json
|-- consolidated.00.pth
|-- consolidated.01.pth
Reproduce the performance of LaVIN-7B.
ScienceQA
torchrun --nproc_per_node 8 train.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --caption_file /path/to/data/captions.json --llama_model_path /path/to/data/weights/ --llm_model 7B --max_seq_len 512 --batch_size 2 --accum_iter 2 --epochs 20 --warmup_epochs 2 --blr 9e-3 --weight_decay 0.02 --adapter_dim 12 --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64 --dataset sqa
COCO caption
torchrun --nproc_per_node 8 train.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --caption_file /path/to/data/captions.json --llama_model_path /path/to/data/weights/ --llm_model 7B --max_seq_len 512 --batch_size 2 --accum_iter 2 --epochs 5 --warmup_epochs 0.1 --blr 9e-3 --weight_decay 0.02 --adapter_dim 12 --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64 --dataset coco_caption
Instruction following
torchrun --nproc_per_node 8 train.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --caption_file /path/to/data/captions.json --llama_model_path /path/to/data/weights/ --llm_model 7B --max_seq_len 512 --batch_size 2 --accum_iter 2 --epochs 15 --warmup_epochs 0.2 --blr 9e-3 --weight_decay 0.02 --adapter_dim 12 --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64 --dataset instruction
To train on fewer GPUs, you can reduce the number of gpus in the scripts and increase gradient accumulation via --accum_iter
to guarantee the total batch size of 32.
Evaluate fine-tuned model on each tasks.
ScienceQA
python eval_sqa.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --model 7B --adapter_path ./output_dir --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64
COCO caption
# prepare required packages
pip install pycocoevalcap pycocotools
python eval_caption.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --model 7B --adapter_path ./output_dir --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64
Instruction following
- MME
- Download MME images and eval_tool from the MME repo.
- Run the following command to obtain model predictions:
python eval_instruction.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --model 7B --adapter_path ./output_dir --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64
- Calculate MME results by executing the calculation script comes from the MME eval_tool.
- More tasks
Evaluation on more tasks can be achieved in a similar way as MME based on tookits like VLMEvalKit and vlm-evaluation.
Model | Task | Results | Weights | Training log |
---|---|---|---|---|
LLaMA-7B | ScienceQA | Averaged accuracy=94.01 | [Link] | [Link] |
LLaMA-7B | COCO caption | BLEU-4=38.5, CIDEr=130.1 | [Link] | [Link] |
LLaMA-7B | Instruction following | MME-P=969.7, MME-C=258.9 | [Link] | [Link] |
If you find this work helpful, please cite our paper:
@misc{hao2024ademvladaptiveembeddedfusion,
title={ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning},
author={Zhiwei Hao and Jianyuan Guo and Li Shen and Yong Luo and Han Hu and Yonggang Wen},
year={2024},
eprint={2410.17779},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.17779},
}
This repo borrows some data and codes from LaVIN, MemVP, and BLIP. Thanks for their great works.