by Paul Waligora1, Haseeb Aslam1, Osama Zeeshan1, Soufiane Belharbi1, Alessandro Lameiras Koerich1, Marco Pedersoli1, Simon Bacon2, Eric Granger1
1 LIVIA, Dept. of Systems Engineering, ÉTS, Montreal, Canada
2 Dept. of Health, Kinesiology & Applied Physiology, Concordia University, Montreal, Canada
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention. This framework can exploit the complementary nature of diverse modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, our JMT fusion architecture integrates the individual modality embeddings, allowing the model to effectively capture inter- and intra-modal relationships. Extensive experiments on two challenging expression recognition tasks – (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice) and (2) pain estimation on the Biovid dataset (with face and biosensors) – indicate that our JMT fusion can provide a cost-effective solution for MMER. Empirical results show that MMER systems with our proposed fusion allow us to outperform relevant baseline and state-of-the-art methods.
Code: Pytorch 1.9.0, made for the 6th-ABAW challenge.
@InProceedings{Waligora-abaw-24,
title={Joint Multimodal Transformer for Emotion Recognition in the Wild},
author={Waligora, P. and Aslam, H. and Zeeshan, O. and Belharbi, S. and
Koerich, A. L. and Pedersoli, M. and Bacon, S. and Granger, E.},
booktitle={CVPRw},
year={2024}
}
cd Transformer_fusion
conda create --name myenv --file requirements.txt
conda activate YOUR_ENV_NAME
- Vision:
I3D
,R2D
. - Audio:
ResNet18
,wavLM
(features)
#!/usr/bin/env bash
CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh
conda activate YOUR_ENV_NAME
# ==============================================================================
cudaid=$1
export CUDA_VISIBLE_DEVICES=$cudaid
python main.py \
--opt__name_optimizer sgd \
--opt__lr 0.0001 \
--opt__weight_decay 0.0 \
--opt__name_lr_scheduler mystep \
--opt__step_size 100 \
--opt__gamma 0.1 \
--v_dropout 0.0 \
--a_dropout 0.0 \
--num_heads 1 \
--num_layers 1 \
--freeze_vision_R2D1 True \
--freeze_vision_I3D True \
--freeze_audio_ResNet18 True \
--split DEFAULT \
--l_vision_backbones R2D1 \
--l_audio_backbones wavLM+ResNet18 \
--init_w_R2D1 KINETICS400 \
--init_w_I3D KINETICS400 \
--init_w_ResNet18 IMAGENET \
--goal TRAINING \
--train_params__take_n_videos 2 \
--val_params__take_n_videos 2 \
--R2D1_ft_dim_reduce MAX \
--joint_modalities TRANSFORMER \
--dump_best_model_every_time True \
--output_format SELF_ATTEN \
--intra_modal_fusion encoder_plus_self_attention \
--max_epochs 1 \
--train_params__seq_length 512 \
--train_params__subseq_length 32 \
--train_params__stride 1 \
--train_params__dilation 4 \
--train_params__batch_size 32 \
--train_params__num_workers 16 \
--train_params__pin_memory True \
--train_params__shuffle True \
--train_params__use_more_vision_data_augm False \
--train_params__use_more_audio_data_augm False \
--val_params__num_workers 8 \
--SEED 0 \
--Mode Training \
--exp_id 03_09_2024_10_20_28_318104__2676163
#!/usr/bin/env bash
CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh
conda activate YOUR_ENV_NAME
# ==============================================================================
cudaid=$1
export CUDA_VISIBLE_DEVICES=$cudaid
python main.py \
--opt__name_optimizer sgd \
--opt__lr 0.0001 \
--opt__weight_decay 0.0 \
--opt__name_lr_scheduler mystep \
--opt__step_size 100 \
--opt__gamma 0.1 \
--v_dropout 0.15 \
--a_dropout 0.15 \
--num_heads 1 \
--num_layers 1 \
--freeze_vision_R2D1 False \
--freeze_vision_I3D True \
--freeze_audio_ResNet18 True \
--split DEFAULT \
--l_vision_backbones R2D1 \
--l_audio_backbones None \
--init_w_R2D1 KINETICS400 \
--init_w_I3D KINETICS400 \
--init_w_ResNet18 IMAGENET \
--goal PRETRAINING \
--train_params__take_n_videos -1 \
--val_params__take_n_videos -1 \
--R2D1_ft_dim_reduce MAX \
--use_joint_representation True \
--dump_best_model_every_time True \
--output_format SELF_ATTEN \
--max_epochs 5 \
--train_params__seq_length 512 \
--train_params__subseq_length 32 \
--train_params__stride 1 \
--train_params__dilation 4 \
--train_params__batch_size 32 \
--train_params__num_workers 8 \
--train_params__pin_memory True \
--train_params__shuffle True \
--train_params__use_more_vision_data_augm False \
--train_params__use_more_audio_data_augm False \
--val_params__num_workers 8 \
--SEED 0 \
--Mode Training \
--exp_id 03_09_2024_10_20_28_318104__2676163
#!/usr/bin/env bash
CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh
conda activate YOUR_ENV_NAME
# ==============================================================================
cudaid=$1
export CUDA_VISIBLE_DEVICES=$cudaid
python main.py \
--Mode Eval \
--eval_set test \
--fd_exp ABSOLUTE_PATH_TO_THE_EXP_FOLDER