The source code for paper Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion
Datasets list and some visualizations/provided weights are preparing now.
Video datasets are usually scene-dominated, We propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid.
The generated triplet is as below:
With DSM pretrain, the model learn to focus on motion region (Not necessarily actor) powerful without one label available.
Please refer dataset.md for details.
- Python3
- pytorch1.1+
- PIL
- Intel (on the fly decode)
- datasets
- list
- hmdb51: the train/val lists of HMDB51
- ucf101: the train/val lists of UCF101
- kinetics-400: the train/val lists of kinetics-400
- diving48: the train/val lists of diving48
- list
- experiments
- logs: experiments record in detials
- gradientes: grad check
- visualization:
- src
- data: load data
- loss: the loss evaluate in this paper
- model: network architectures
- scripts: train/eval scripts
- augment: detail implementation of Spatio-temporal Augmentation
- utils
- feature_extract.py: feature extractor given pretrained model
- main.py: the main function of finetune
- trainer.py
- option.py
- pt.py: self-supervised pretrain
- ft.py: supervised finetune
bash scripts/kinetics/pt.sh
bash scripts/ucf101/pt.sh
bash scripts/hmdb51/ft.sh
bash scripts/ucf101/ft.sh
bash scripts/kinetics/ft.sh
Following common practice TSN and Non-local. The final video-level result is average by 10 temporal window sampling + corner crop, which lead to better result than clip-level. Refer test.py for details.
bash scripts/hmdb51/pt_and_ft_hmdb51.sh
Notice: More Training Options and ablation study Can be find in scripts
As STCR can be easily extend to other video representation task, we offer the scripts to perform feature extract.
python feature_extractor.py
The feature will be saved as a single numpy file in the format [video_nums,features_dim] for further visualization.
modify line60-line62 in reterival.py.
python reterival.py
Method | UCF101 | HMDB51 |
---|---|---|
Random Initialization | 47.9 | 29.6 |
MoCo Baseline | 62.3 | 36.5 |
DSM(Triplet) | 70.7 | 48.5 |
DSM | 74.8 | 52.5 |
Method | @1 | @5 | @10 | @20 | @50 |
---|---|---|---|---|---|
DSM | 16.8 | 33.4 | 43.4 | 54.6 | 70.7 |
Method | @1 | @5 | @10 | @20 | @50 |
---|---|---|---|---|---|
DSM | 8.2 | 25.9 | 38.1 | 52.0 | 75.0 |
This work is partly based on STN, UEL and MoCo.
If you use our code in your research or wish to refer to the baseline results, pleasuse use the followint BibTex entry.
@inproceedings{wang2021enhancing,
title={Enhancing unsupervised video representation learning by decoupling the scene and the motion},
author={Wang, Jinpeng and Gao, Yuting and Li, Ke and Hu, Jianguo and Jiang, Xinyang and Guo, Xiaowei and Ji, Rongrong and Sun, Xing},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={35},
number={11},
pages={10129--10137},
year={2021}
}