This code provides a PyTorch implementation and pretrained models for SeLaVi (Labelling unlabelled videos from scratch with multi-modal self-supervision), as described in the paper Labelling unlabelled videos from scratch with multi-modal self-supervision.
SeLaVi is an efficient and simple method for learning labels of multi-modal audio-visual data.
(1) Clustering does not come for free
Even very strong feature representations such as a supervisedly pretrained R(2+1)D-18 or MIL-NCE S3D network underperform our method that learns clusters.
(2) Truly multi-modal clustering yields robust clusters
Since our method treats each modality as an augmentation from another, our method learn to give stable predictions even if one modality is degraded.
We provide serveral baseline SeLaVi pre-trained models with R(2+1)-D-18 video and Resnet-9 audio architecture in torchvision format in different datasets.
Method | Dataset | Clusters | Setting | Heads | NMI | Accuracy | url |
---|---|---|---|---|---|---|---|
SeLaVi | AVE | 28 | MA, G, MH | 10 | 66.2% | 57.9% | model |
SeLaVi | Kinetics-Sound | 32 | MA, G, MH | 10 | 47.5% | 41.2% | model |
SeLaVi | Kinetics | 400 | MA, G, MH | 10 | 27.1% | 7.8% | model |
SeLaVi | VGG-Sound | 309 | MA, G, MH | 10 | 55.9% | 31.0% | model |
MA = Modality Alignment, G = Gaussian Marginals, DH = Decorrelated Heads (see paper for details)
Model | NMI | aNMI | aRI | Accuracy | Purity | HMDB-51 (3-fold) | UCF-101 (3-fold) |
---|---|---|---|---|---|---|---|
SeLaVi VGG-Sound | 54.6% | 52.0% | 20.6% | 30.9% | 36.2% | 55.1% (55.4, 54.8, 55.1) | 86.1% (86.0, 85.9, 86.5) |
You can download the csv files for our clusters here: VGG-Sound, Kinetics. Note: as everywhere in the paper, we're only taking a single crop in space and time for generating these.
SeLaVi is an efficient and simple method for learning labels of multi-modal audio-visual data. To interactively visualize the clusters we obtain for Kinetics and VGG-Sound, as we do on our homepage, run:
python3 cluster_vis/get_clusters_vggsounds.py --ckpt_path ${VGG_SOUND_CKPT_PATH};
python3 cluster_vis/get_clusters_kinetics.py --ckpt_path ${KINETICS_CKPT_PATH};
cd cluster_vis;
python3 preprocess.py --kinetics_path selavi_kinetics.pkl --vgg_sound_path selavi_vgg_sounds.pkl
# open index.html in your browser
This repo was tested with Ubuntu 16.04.5 LTS, Python 3.7.5, PyTorch 1.3.1, Torchvision 0.4.1, and CUDA 10.0.
-
Install required packages using
conda env create -f environment.yml
-
Activate conda environment using
conda activate lab_vid
-
Ensure pre-training datasets (VGG-Sound, Kinetics, AVE) are pre-processed such that the folder structure is in the form:
{dataset_name}/{train,val,test}/{class_name}/{video_name}.mp4
N.B. Kinetics-Sound is a subset of Kinetics.
SeLaVi is very simple to implement and experiment with. Our implementation consists of a main.py file from which the following are imported: the dataset definition dataset/AVideoDataset.py, the model architecture model.py, Sinkhorn-knopp algorithm src/sk_utils.py, and some miscellaneous training utilities utils.py.
For example, to train SeLaVi baseline on a single node with 8 gpus for 200 epochs on VGG-Sound, run:
python -m torch.distributed.launch --nproc_per_node=8 main.py \
--root_dir /path/to/VGGSound \
--epochs 200 \
--batch_size 16 \
--base_lr 1e-2 \
--ds_name vgg_sound \
--use_mlp True \
--mlp_dim 309 \
--headcount 10 \
--match True \
--distribution gauss \
--ind_groups 2
Distributed training is available via Slurm. We provide a customizable SBATCH script to reproduce our SeLaVi models. For example, to train SeLaVi on 8 nodes and 64 GPUs with a batch size of 1024 for 200 epochs run:
sbatch ./scripts/master.sh
Note that you might need to remove the copyright header from the sbatch file to launch it.
Set up dist_url
parameter: We refer the user to pytorch distributed documentation (env or file or tcp) for setting the distributed initialization method (parameter dist_url
) correctly. In the provided sbatch files, we use the tcp init method (see * for example).
To evaluate the clustering quality of SeLaVi pretraining:
python3 get_clusters.py \
--dataset {vggsound, kinetics, ave, kinetics_sound}
--root_dir /path/to/dataset \
--weights_path ${WEIGHTS_PATH} \
--output_dir ${OUTPUT_DIR} \
--exp_desc ${EXP_DESC} \
--mode train \
--headcount ${HEADCOUNT}
python3 clustering_metircs.py \
--path ${OUTPUT_DIR}/${EXP_DESC}.pkl \
--ncentroids ${NUM_CLS}
# Set NUM_CLS={kinetics: 400, ave: 28, vggsound: 309, kinetics_sounds: 32}
To evaluate SeLaVi pretraining on video action recognition:
python3 finetune_video.py \
--dataset {ucf101, hmdb51} \
--root_dir /path/to/dataset \
--fold {1,2,3} \
--batch_size 32 \
--workers 10 \
--weights_path ${WEIGHTS_PATH} \
--output_dir ${OUTPUT_DIR} \
--num_clusters ${NUM_CLUSTERS}
To evaluate SeLaVi pretraining on video action retrieval:
python3 video_retrieval.py \
--dataset {ucf101, hmdb51} \
--root_dir /path/to/dataset \
--fold {1,2,3} \
--batch_size 32 \
--workers 10 \
--weights_path ${WEIGHTS_PATH} \
--output_dir ${OUTPUT_DIR}
To evaluate SeLaVi pretraining on video action retrieval:
python3 plot_distributions.py
If you find this repository useful in your research, please cite:
@inproceedings{asano2020labelling,
title={Labelling unlabelled videos from scratch with multi-modal self-supervision},
author={Yuki M. Asano and Mandela Patrick and Christian Rupprecht and Andrea Vedaldi},
year={2020},
booktitle={NeurIPS}
}