This repository introduces Weakly-supervised Audio-Visual Video Parsing (AVVP) and Audio-Visual Event Localization (AVEL) task, and collects some related works.
-
Weakly-supervised AVVP: Weakly-supervised Audio-Visual Video Parsing is a task which aims to parse a video into temporal event segements and label them as either audible, visible, or both.
-
AVE: Audio-Visual Event Localization (AVEL) defines an audio-visual event as an event that is both audible and visible in a video segment. It contains a fully and weakly-supervised audio-visual event localization task, and a cross-modality localization task. The former aims to predict the event label for each video segment, while the latter aims to find the position of one modality (visual/auditory), given a segment of synchronized content in the other modality (auditory/visual). Cross-modality localization task includes visual localization from audio (A2V) and audio localization from visual content (V2A).
- LLP: Look, Listen, and Parse (LLP) dataset is the only available dataset for the AVVP task. LLP contains 11,849 YouTube video clips spanning over 25 categories for a total of 32.9 hours collected from AudioSet. A wide range of video events (e.g., human speaking, singing, baby crying, dog barking, violin playing, and car running, and vacuum cleaning etc.) from diverse domains (e.g., human activities, animal activities, music performances, vehicle sounds, and domestic environments) are included in the dataset. Each video is 10s long and has at least 1s audio or visual events. There are 7,202 videos that contain events from more than one event categories and per video has averaged 1.64 different event categories. Individual audio and visual events are annotated with second-wise temporal boundaries for randomly selected 1,849 videos from the LLP dataset. Finally, there totally 6,626 event annotations, including 4,131 audio events and 2,495 visual events for the 1,849 videos, leading to 2,488 audio-visual event annotations. The validation set and testing set have 649 and 1,200 videos with fully annotated labels, respectively.The training set consists of 10,000 videos with weak labels.
- AVE: Audio-Visual Event (AVE) dataset is the only available dataset for the AVE task. It is a subset of AudioSet, which contains 4143 videos covering 28 event categories. Videos in AVE are temporally labeled with audio-visual event boundaries. Each video contains at least one 2s long audio-visual event. The dataset covers a wide range of audio-visual events (e.g., man speaking, woman speaking, dog barking, playing guitar, and frying food etc.) from different domains, e.g., human activities, animal activities, music performances, and vehicle sounds. Each event category contains a minimum of 60 videos and a maximum of 188 videos, and 66.4% videos in the AVE contain audio-visual events that span over the full 10 seconds.
- Weakly-supervised Audio-Visual Video Parsing
- F-scores are applied as metrics at both the segment-level and event-level on individual audio, visual, and audio-visual events. Segment-level F-scores evaluate snippet-wise event labeling performance. Event-level F-scores evaluate the ability to extract events with concatenating positive consecutive snippets in the same event categories, based on mIoU=0.5 as the threshold.
- Type@AV and Event@AV are used to evaluate the overall perforamce. Type@AV computes averaged audio, visual, and audio-visual event evaluation results. Event@AV computes the F-scores considering all considers all audio and visual events for each sample.
- Audio-Visual Event Localization
- Supervised audio-visual event localization (SEL): The overall accuracy of the category prediction of each one-second segment is used as an evaluation metric. “Background” is also a category in this classification task.
- Cross-modality localization (CML): The percentage of correct matchings is used as the evaluation metric. A correct matching defined in this task is that a matched audio/visual segment is exactly the same as its ground truth; otherwise, it will be an incorrect matching.
- CoLeaF: CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing. in Arxiv 2024.
- VAPLAN: Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling. in Arxiv 2024.
- DGM: Multimodal Imbalance-Aware Gradient Modulation for Weakly-Supervised Audio-Visual Video Parsing. in TCSVT 2024.
- LGFNet: Multi-Level Signal Fusion for Enhanced Weakly-Supervised Audio-Visual Video Parsing. in SPL 2024.
- LSLD: Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective. in NeurIPS 2023. code
- VALOR: Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser. in NeurIPS 2023. code
- AVFAS: Multi-Modal and Multi-Scale Temporal Fusion Architecture Search for Audio-Visual Video Parsing. in ACM MM 2023.
- CMPAE: Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception. in CVPR 2023. code
- MGN: Semantic-Aware Multi-modal Grouping for Weakly-Supervised Audio-Visual Video Parsing. in NeurIPS 2022. code
- MM-Pyramid: MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing. in ACM MM 2022. code
- DHHN: DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing. in ACM MM 2022.
- JoMoLD: Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing. in ECCV 2022. code
- Mbias: Investigating Modality Bias in Audio Visual Video Parsing. in Arxiv 2022. code
- DAVPNet: Distributed Audio-Visual Parsing Based On Multimodal Transformer and Deep Joint Source Channel Coding. in ICASSP 2022.
- MTSM: Toward a perceptive pretraining framework for Audio-Visual Video Parsing. in Information Sciences 2022.
- CVCMS: Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing. in NeurIPS 2021.
- CML: Cross-Modal learning for Audio-Visual Video Parsing. in Interspeech 2021. code
- MA: Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing. in CVPR 2021. code
- HAN: Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing. in ECCV 2020. code
- AVE: Audio-Visual Event Localization in Unconstrained Videos. in ECCV 2018. code