SoundQ — Enhanced sound event localization and detection in real 360-degree audio-visual soundscapes.
-
An audio-visual synthetic data generator with spatial audio and 360-degree video.
-
A suite of scripts to perform
data_augmentation
on 360-degree audio and video.-
Integrating audio channel swapping (ACS) as per Wang et al.
-
Integrating video pixel swapping (VPS) as per Wang et al.
-
-
An enhanced audio-visual SELDNet model with comparable performance to the audio-only SELDNet23
- The model integrates Detic, but any other detection model can also be integrated within the training pipeline.
See installation instructions.
We benchmark our model following the DCASE Challenge 2023 Task3 SELD evaluation metric.
The following table includes only the best performing system (as documented in DCASE results). The evaluation metric scores for the test split of the development dataset is given below.
Model | Dataset | ER20° | F20° | LECD | LRCD |
---|---|---|---|---|---|
AO SELDNet23 (baseline) | Ambisonic* | 0.57 | 29.9 % | 21.6° | 47.7 % |
AV SELDNet23 (baseline) | Ambisonic + Video | 1.07 | 14.3 % | 48.0 ° | 35.5 % |
AV SELDNet23 (ours) | Ambisonic* + Video | 0.65 | 24.9 % | 18.7° | 37.5 % |
Legend: AO=audio-only, AV=audio-visual, FOA=first order ambisonics format, *=FOA + Multi-ACCDOA
If you find our work useful, please cite our paper:
@article{roman2024enhanced,
title={Enhanced Sound Event Localization and Detection in Real 360-degree audio-visual soundscapes},
author={Roman, Adrian S and Balamurugan, Baladithya and Pothuganti, Rithik},
journal={arXiv preprint arXiv:2401.17129},
year={2024}
}