Skip to content

Latest commit

 

History

History
96 lines (67 loc) · 3.41 KB

Readme.md

File metadata and controls

96 lines (67 loc) · 3.41 KB

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers (CVPRW 2024, Oral)
Tanvir Mahmud, Shentong Mo, Yapeng Tian, Diana Marculescu

MA-AVT is a new parameter-efficient audio-visual transformer employing deep modality alignment for multimodal semantic feature correspondence.

MA-AVT Illustration

Environment

To setup the environment, please simply run

pip install -r requirements.txt

Datasets

AVE

Data can be downloaded from Audio-visual event localization in unconstrained videos

VGGSound

Data can be downloaded from Vggsound: A large-scale audio-visual dataset

CREMA-D

Data can be downloaded from CREMA-D: Crowd-sourced emotional multimodal actors dataset

Train

For training the MA-AVT model, please run

        python train_avm_vit.py --id MA-AVT --dataset vggsound \
                --data_dir /path/to/LAVISH/data/VGGSound --batch_size 256 --epochs 50 \
                --num_class 309 --output_dir /path/to/outputs/ --fps 1 --lr 0.01 \
                --lr_step 15 --mode train --model ma_avt --vis_encoder_type vit \
                --vit_type base --pretrained --multiprocessing_distributed --ngpu 4 \
                --LSA --print_freq 10 --num_vis 24 --n_audio_tokens 5 --n_vis_tokens 5 \
                --n_shared_tokens 5 --bg_label -1 --bg_cls --bg_prob 0.2 --unimodal_token \
                --multimodal_token --contrastive blockwise_sep --port 23145

Test

For testing and visualization, simply run

      python train_avm_vit.py --id MA-AVT --dataset vggsound --mode test \
                --data_dir /path/to/LAVISH/data/VGGSound --batch_size 256 --epochs 50 \
                --num_class 309 --output_dir /path/to/outputs/ --fps 1 --lr 0.01 \
                --lr_step 15 --mode test --model ma_avt --vis_encoder_type vit \
                --vit_type base --pretrained --multiprocessing_distributed --ngpu 4 \
                --LSA --print_freq 10 --num_vis 24 --n_audio_tokens 5 --n_vis_tokens 5 \
                --n_shared_tokens 5 --bg_label -1 --bg_cls --bg_prob 0.2 --unimodal_token \
                --multimodal_token --contrastive blockwise_sep --port 23145

👍 Acknowledgments

This codebase is based on LAVISH and OGM-GE. Thanks for their amazing works.

LICENSE

MA-AVT is licensed under a UT Austin Research LICENSE.

Citation

If you find this work useful, please consider citing our paper:

BibTeX

@misc{mahmud2024maavt,
      title={MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers}, 
      author={Tanvir Mahmud and Shentong Mo and Yapeng Tian and Diana Marculescu},
      year={2024},
      eprint={2406.04930},
      archivePrefix={arXiv}
  }

Contributors