ADEM-VL

Official code of paper: ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning.

Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu*, Yonggang Wen

Preparation

conda create -n adem python=3.8 -y
conda activate adem

# install pytorch
conda install pytorch==1.13.1 torchvision==0.14.1 -c pytorch

# install dependencies
pip install -r requirements.txt

Data Preparation

the data preparation instruction is borrowed from LaVIN.

For ScienceQA, please prepare the dataset from the official repo.
For Multimodal Chatbot, download the images in train2014 split from MSCOCO, and obtain the prepared 52k text-only and 158k text-image instruction-following data from here.
Obtain the weights of LLaMA from this form (official) or Download LLaMA-7B and LLaMA-13B from HuggingFace (unofficial).

After that, the file structure should look like:

ADEM-VL/
  |-- adem
  |-- train.py
  ......
  |-- data/
      |-- problem.json
      |-- pid_splits.json
      |-- captions.json
      |-- all_data.json
      |-- images
          |-- train2014      # MSCOCO 2014
          |-- val2014        # MSCOCO 2014
          |-- train          # ScienceQA train image
          |-- val            # ScienceQA val image
          |-- test           # ScienceQA test image
      |-- weights
          |-- tokenizer.model
              |--7B
                  |-- params.json
                  |-- consolidated.00.pth
              |--13B
                  |-- params.json
                  |-- consolidated.00.pth
                  |-- consolidated.01.pth

Fine-tuning

Reproduce the performance of LaVIN-7B.

ScienceQA

torchrun --nproc_per_node 8 train.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --caption_file /path/to/data/captions.json --llama_model_path /path/to/data/weights/ --llm_model 7B --max_seq_len 512 --batch_size 2 --accum_iter 2 --epochs 20 --warmup_epochs 2 --blr 9e-3 --weight_decay 0.02 --adapter_dim 12 --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64 --dataset sqa

COCO caption

torchrun --nproc_per_node 8 train.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --caption_file /path/to/data/captions.json --llama_model_path /path/to/data/weights/ --llm_model 7B --max_seq_len 512 --batch_size 2 --accum_iter 2 --epochs 5 --warmup_epochs 0.1 --blr 9e-3 --weight_decay 0.02 --adapter_dim 12 --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64 --dataset coco_caption

Instruction following

torchrun --nproc_per_node 8 train.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --caption_file /path/to/data/captions.json --llama_model_path /path/to/data/weights/ --llm_model 7B --max_seq_len 512 --batch_size 2 --accum_iter 2 --epochs 15 --warmup_epochs 0.2 --blr 9e-3 --weight_decay 0.02 --adapter_dim 12 --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64 --dataset instruction

To train on fewer GPUs, you can reduce the number of gpus in the scripts and increase gradient accumulation via --accum_iter to guarantee the total batch size of 32.

Evaluation

Evaluate fine-tuned model on each tasks.

ScienceQA

python eval_sqa.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --model 7B --adapter_path ./output_dir --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64

COCO caption

# prepare required packages
pip install pycocoevalcap pycocotools

python eval_caption.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --model 7B --adapter_path ./output_dir --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64

Instruction following

MME

Download MME images and eval_tool from the MME repo.
Run the following command to obtain model predictions:

python eval_instruction.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --model 7B --adapter_path ./output_dir --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64

Calculate MME results by executing the calculation script comes from the MME eval_tool.

More tasks

Evaluation on more tasks can be achieved in a similar way as MME based on tookits like VLMEvalKit and vlm-evaluation.

Model Zoo

Model	Task	Results	Weights	Training log
LLaMA-7B	ScienceQA	Averaged accuracy=94.01	[Link]	[Link]
LLaMA-7B	COCO caption	BLEU-4=38.5, CIDEr=130.1	[Link]	[Link]
LLaMA-7B	Instruction following	MME-P=969.7, MME-C=258.9	[Link]	[Link]

Citation

If you find this work helpful, please cite our paper:

@misc{hao2024ademvladaptiveembeddedfusion,
      title={ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning}, 
      author={Zhiwei Hao and Jianyuan Guo and Li Shen and Yong Luo and Han Hu and Yonggang Wen},
      year={2024},
      eprint={2410.17779},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.17779}, 
}

Acknowledgement

This repo borrows some data and codes from LaVIN, MemVP, and BLIP. Thanks for their great works.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
adem		adem
clip		clip
util		util
README.md		README.md
demo.py		demo.py
engine.py		engine.py
eval_caption.py		eval_caption.py
eval_mme.py		eval_mme.py
eval_sqa.py		eval_sqa.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ADEM-VL

Preparation

Fine-tuning

Evaluation

Model Zoo

Citation

Acknowledgement

About

Releases 1

Packages

Contributors 2

Languages

Hao840/ADEM-VL

Folders and files

Latest commit

History

Repository files navigation

ADEM-VL

Preparation

Fine-tuning

Evaluation

Model Zoo

Citation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages