A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time. Humans attempting to discriminate between fine-grained audio categories, often replay the same discriminative sounds to increase their prediction confidence. We propose an end-to-end attention-based architecture that through selective repetition attends over the most discriminative sounds across the audio sequence. Our model initially uses the full audio sequence and iteratively refines the temporal segments replayed based on slot attention. At each playback, the selected segments are replayed using a smaller hop length which represents higher resolution features within these segments. We show that our method can consistently achieve state-of-the-art performance across three audio-classification benchmarks: AudioSet, VGG-Sound, and EPIC-KITCHENS-100.
Ensure that the following packages are installed in your machine:
Single-line install with pip:
pip install torch torchvision torchaudio librosa h5py wandb fvcore simplejson psutil tensorboard
! Note that different from PySlowFast, there is no need to add the repo to your Add this repository to $PYTHONPATH
.
-
AudioSet: The following fork of audioset-processing was used to download the full dataset while also using multiple processes to download individual video files. The original repository only uses a main process which makes dataset crawling very slow.
Depending on the availability of videos, the
train
andtest
files should be adjusted as the current repo does not re-sample files that do not exist. -
VGG-Sound: Apart from the repository used to download the dataset and that is referenced by the VGGSound, an alternative way for acquiring the dataset is to use the script from this issue.
-
EPIC-KITCHENS-100: You can follow the same steps as in Auditory SlowFast for downloading and generating the dataset in
.hdf5
format.
Audio files for the datasets are expected in the following formats:
-
AudioSet ->
.flac
(can be changed by editing Line 42 @ datasets/audioloader_audioset.py) -
VGG-Sound ->
.wav
(can be changed by editing Line 41 @ datasets_audioloader_vggsound.py) -
EPIC-KITCHENS-100 -> A single
.hdf5
file containing the entire dataset (see Auditory SlowFast for more info about creating the file).
For training and testing you can run python tools/run_net.py
with the following arguments depending on the dataset:
-
AudioSet
-
--cfg
: The model configuration file to be used fromconfigs/AudioSet
. -
AUDIOSET.AUDIO_DATA_DIR
: The directory containing all AudioSet audio files. -
AUDIOSET.ANNOTATIONS_DIR
: The directory of thetrain
andtest
splits in.pkl
format. They can be found indata/AudioSet
.
-
-
VGG-Sound
-
--cfg
: The model configuration file to be used fromconfigs/VGG-Sound
. -
VGGSOUND.AUDIO_DATA_DIR
: The directory containing all VGG-Sound audio files. -
VGGSOUND.ANNOTATIONS_DIR
: The directory of thetrain
andtest
splits in.pkl
format. They can be found indata/VGG-Sound
.
-
-
EPIC-KITCHENS-100
-
--cfg
: The model configuration file to be used fromconfigs/EPIC-KITCHENS
. -
EPICKITCHENS.AUDIO_DATA_FILE
: The directory containing the EPIC-KITCHENS data file. -
EPICKITCHENS.ANNOTATIONS_DIR
: The directory of thetrain
andtest
splits in.pkl
format. They can be found indata/EPIC-KITCHENS
.
-
The following (usefully notable) arguments can also be used regardless of the dataset:
-
NUM_GPUS
: Explicitly defining the number of GPUs to be used. -
OUTPUT_DIR
: The directory for checkpoints and runs to be saved in. -
TRAIN.CHECKPOINT_FILE_PATH
: The filepath of a checkpoint (either just the encoder or both the encoder+decoder) used to initialize the model during the start of training.
-
TRAIN.ENABLE
: Should be set toFalse
. -
TEST.ENABLE
: Should be set toTrue
. -
TEST.CHECKPOINT_FILE_PATH
: The filepath of a checkpoint (either just the encoder or both the encoder+decoder) used to load the model during evaluation.
This repository is an adjustment of auditory-slow-fast and PySlowFast that also includes batch-wise playbacks.
@article{stergiou2022playitback,
title={Play It Back: Iterative Attention for Audio Recognition},
author={Stergiou, Alexandros and Damen, Dima},
journal={arXiv preprint},
year={2022}}
Apache 2.0