Skip to content

Self-Supervised Learning by Cross-Modal Audio-Video Clustering (NeurIPS 2020)

License

Notifications You must be signed in to change notification settings

HumamAlwassel/XDC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PWC PWC PWC PWC

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

[Paper] [Project Website]

This repository holds the pretrained models for the Cross-Modal Deep Clustering (XDC) method presented as a spotlight in NeurIPS 2020.

Self-Supervised Learning by Cross-Modal Audio-Video Clustering. Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran. In NeurIPS, 2020.

Load Pretrained Models

We provide the following pretrained R(2+1)D-18 video models. We report the average top-1 video-level accuracy over all splits on UCF101 and HMDB51 after full-finetuning.

Pretraining Name Description UCF101 HMDB51 Weights
r2plus1d_18_xdc_ig65m_kinetics XDC pretrained on IG-Kinetics 95.5 68.9 [PyTorch] [Caffe2]
r2plus1d_18_xdc_ig65m_random XDC pretrained on IG-Random 94.6 66.5 [PyTorch] [Caffe2]
r2plus1d_18_xdc_audioset XDC pretrained on AudioSet 93.0 63.7 [PyTorch] [Caffe2]
r2plus1d_18_fs_kinetics fully-supervised pretraining on Kinetics 94.2 65.1 [PyTorch] [Caffe2]
r2plus1d_18_fs_imagenet fully-supervised pretraining on ImageNet 84.0 48.1 [PyTorch] [Caffe2]

There are two ways to load the XDC pretrained models in PyTorch: (1) via PyTorch Hub or (2) via source code.

Via PyTorch Hub (Recommended)

⚠️ [Known Issue] Using this way to load XDC models breaks for torchvision v0.13 or higher due to backward incompatible changes introduced in torchvision. Please make sure to use trochvision v0.12 or earlier when loading XDC models via the torch.hub.load() API. Loading models via source code still works as expected.

You can load all our pretrained models using torch.hub.load() API.

import torch

model = torch.hub.load('HumamAlwassel/XDC', 'xdc_video_encoder', 
                        pretraining='r2plus1d_18_xdc_ig65m_kinetics',
                        num_classes=42)

Use the parameter pretraining to specify the pretrained model to load from the table above (default pretrained model is r2plus1d_18_xdc_ig65m_kinetics). Pretrained weights of all layers except the FC classifier layer are loaded. The FC layer (of size 512 x num_classes) is randomly-initialized. Specify the keyword argument num_classes based on your application (default is 400). Run print(torch.hub.help('HumamAlwassel/XDC', 'xdc_video_encoder')) for the model documentation. Learn more about PyTorch Hub here.

Via Source Code

Clone this repo and create the conda environment.

git clone https://github.com/HumamAlwassel/XDC.git
cd XDC
conda env create -f environment.yml
conda activate xdc

Load the pretrained models from the file xdc.py.

from xdc import xdc_video_encoder

model = xdc_video_encoder(pretraining='r2plus1d_18_xdc_ig65m_kinetics',
                          num_classes=42)

Feature Extraction and Model Finetuning

Please refer to the Facebook Video Model Zoo (VMZ) repo for PyTorch/Caffe2 scripts for feature extraction and model finetuning on datasets such as UCF101 and HMDB51.

Please cite this work if you find XDC useful for your research.

@inproceedings{alwassel_2020_xdc,
  title={Self-Supervised Learning by Cross-Modal Audio-Video Clustering},
  author={Alwassel, Humam and Mahajan, Dhruv and Korbar, Bruno and 
          Torresani, Lorenzo and Ghanem, Bernard and Tran, Du},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2020}
}

About

Self-Supervised Learning by Cross-Modal Audio-Video Clustering (NeurIPS 2020)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages