Code for CVPR 2021 paper Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing
We aim at identifying the audible and visible events and their temporal location in videos. Note that the visual and audio events might be asynchronous.
Please refer to https://github.com/YapengTian/AVVP-ECCV20 for downloading the LLP Dataset and the preprocessed audio and visual features.
Put the downloaded r2plus1d_18
, res152
, vggish
features into the feats
folder.
The training includes three stages.
We first train a base model using MIL and our proposed contrastive learning.
cd step1_train_base_model
python main_avvp.py --mode train --audio_dir ../feats/vggish/ --video_dir ../feats/res152/ --st_dir ../feats/r2plus1d_18
We then freeze the trained model and evaluate each video by swapping its audio and visual tracks with other unrelated videos.
cd step2_find_exchange
python main_avvp.py --mode estimate_labels --audio_dir ../feats/vggish/ --video_dir ../feats/res152/ --st_dir ../feats/r2plus1d_18 --model_save_dir ../step1_train_base_model/models/
We then re-train the model from scratch using modality-aware labels.
cd step3_retrain
python main_avvp.py --mode retrain --audio_dir ../feats/vggish/ --video_dir ../feats/res152/ --st_dir ../feats/r2plus1d_18
Please cite the following paper in your publications if it helps your research:
@inproceedings{wu2021explore,
title = {Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing},
author = {Wu, Yu and Yang, Yi},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2021}
}