Skip to content
/ ADEM-VL Public

PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"

Notifications You must be signed in to change notification settings

Hao840/ADEM-VL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ADEM-VL

Official code of paper: ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning.

Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu*, Yonggang Wen

Preparation

conda create -n adem python=3.8 -y
conda activate adem

# install pytorch
conda install pytorch==1.13.1 torchvision==0.14.1 -c pytorch

# install dependencies
pip install -r requirements.txt

Data Preparation

the data preparation instruction is borrowed from LaVIN.

  • For ScienceQA, please prepare the dataset from the official repo.
  • For Multimodal Chatbot, download the images in train2014 split from MSCOCO, and obtain the prepared 52k text-only and 158k text-image instruction-following data from here.
  • Obtain the weights of LLaMA from this form (official) or Download LLaMA-7B and LLaMA-13B from HuggingFace (unofficial).

After that, the file structure should look like:

ADEM-VL/
  |-- adem
  |-- train.py
  ......
  |-- data/
      |-- problem.json
      |-- pid_splits.json
      |-- captions.json
      |-- all_data.json
      |-- images
          |-- train2014      # MSCOCO 2014
          |-- val2014        # MSCOCO 2014
          |-- train          # ScienceQA train image
          |-- val            # ScienceQA val image
          |-- test           # ScienceQA test image
      |-- weights
          |-- tokenizer.model
              |--7B
                  |-- params.json
                  |-- consolidated.00.pth
              |--13B
                  |-- params.json
                  |-- consolidated.00.pth
                  |-- consolidated.01.pth

Fine-tuning

Reproduce the performance of LaVIN-7B.

ScienceQA

torchrun --nproc_per_node 8 train.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --caption_file /path/to/data/captions.json --llama_model_path /path/to/data/weights/ --llm_model 7B --max_seq_len 512 --batch_size 2 --accum_iter 2 --epochs 20 --warmup_epochs 2 --blr 9e-3 --weight_decay 0.02 --adapter_dim 12 --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64 --dataset sqa

COCO caption

torchrun --nproc_per_node 8 train.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --caption_file /path/to/data/captions.json --llama_model_path /path/to/data/weights/ --llm_model 7B --max_seq_len 512 --batch_size 2 --accum_iter 2 --epochs 5 --warmup_epochs 0.1 --blr 9e-3 --weight_decay 0.02 --adapter_dim 12 --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64 --dataset coco_caption

Instruction following

torchrun --nproc_per_node 8 train.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --caption_file /path/to/data/captions.json --llama_model_path /path/to/data/weights/ --llm_model 7B --max_seq_len 512 --batch_size 2 --accum_iter 2 --epochs 15 --warmup_epochs 0.2 --blr 9e-3 --weight_decay 0.02 --adapter_dim 12 --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64 --dataset instruction

To train on fewer GPUs, you can reduce the number of gpus in the scripts and increase gradient accumulation via --accum_iter to guarantee the total batch size of 32.

Evaluation

Evaluate fine-tuned model on each tasks.

ScienceQA

python eval_sqa.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --model 7B --adapter_path ./output_dir --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64

COCO caption

# prepare required packages
pip install pycocoevalcap pycocotools

python eval_caption.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --model 7B --adapter_path ./output_dir --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64

Instruction following

  • MME
  1. Download MME images and eval_tool from the MME repo.
  2. Run the following command to obtain model predictions:
python eval_instruction.py --data_root /path/to/data/ --clip_root /path/to/data/weights/clip/ --model 7B --adapter_path ./output_dir --alpha 0.1 --beta 0.01 --drop_ratio 0.1 --down_sample_num 256 64
  1. Calculate MME results by executing the calculation script comes from the MME eval_tool.
  • More tasks

Evaluation on more tasks can be achieved in a similar way as MME based on tookits like VLMEvalKit and vlm-evaluation.

Model Zoo

Model Task Results Weights Training log
LLaMA-7B ScienceQA Averaged accuracy=94.01 [Link] [Link]
LLaMA-7B COCO caption BLEU-4=38.5, CIDEr=130.1 [Link] [Link]
LLaMA-7B Instruction following MME-P=969.7, MME-C=258.9 [Link] [Link]

Citation

If you find this work helpful, please cite our paper:

@misc{hao2024ademvladaptiveembeddedfusion,
      title={ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning}, 
      author={Zhiwei Hao and Jianyuan Guo and Li Shen and Yong Luo and Han Hu and Yonggang Wen},
      year={2024},
      eprint={2410.17779},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.17779}, 
}

Acknowledgement

This repo borrows some data and codes from LaVIN, MemVP, and BLIP. Thanks for their great works.

About

PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages