Collaborative Spatiotemporal Feature Learning for Video Action Recognition

This repository provides the official code for the CVPR'19 paper Collaborative Spatiotemporal Feature Learning for Video Action Recognition. The models introduced in the paper were initially implemented in TensorFlow. Here we refactor the code in PyTorch based on the MMAction2 framework.

Installation

Follow the instruction below to setup a valid Python environment.

conda create -n cost python=3.9 -y
conda activate cost
conda install pytorch=1.12.0 torchvision=0.13.0 cudatoolkit=11.3 -c pytorch -y
pip install mmcv-full==1.6.1 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.12.0/index.html
pip install mmaction2==0.24.1

Getting Started

Data Preparation

Use the scripts in tools/data to download and setup the Kinetics-400 and Moments in Time datasets. In our experiments, we download the Kinetics-400 dataset from cvdfoundation/kinetics-dataset. After filtering corrupted videos, we get 241,181 videos for training and 19,877 videos for validation. For Moments in Time, we use the v1 version with 802,244 training videos and 33,900 validation videos from 339 categories. Put the datasets in data/ with the following structure.

data
├── kinetics400
│   ├── annotations
│   │   ├── kinetics_test.csv
│   │   ├── kinetics_train.csv
│   │   └── kinetics_val.csv
│   ├── kinetics400_train_list_videos.txt
│   ├── kinetics400_val_list_videos.txt
│   ├── videos_train
│   └── videos_val
└── mit
    ├── annotations
    │   ├── license.txt
    │   ├── moments_categories.txt
    │   ├── README.txt
    │   ├── trainingSet.csv
    │   └── validationSet.csv
    ├── mit_train_list_videos.txt
    ├── mit_val_list_videos.txt
    └── videos
        ├── training
        └── validation

Train

You can use the following command to train a model.

./tools/run.sh ${CONFIG_FILE} ${GPU_IDS} ${SEED}

Example: train the CoST(b) model on Moments in Time using 8 GPUs with seed 0.

./tools/run.sh configs/cost/costb_r50_8x8x1_48e_mit_rgb.py 0,1,2,3,4,5,6,7 0

Test

You can use the following command to test a model.

tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${NUM_GPUS} [optional arguments]

Example: test the CoST(b) model on Moments in Time using 4 GPUs.

./tools/dist_test.sh configs/cost/costb_r50_8x8x1_48e_mit_rgb.py \
    work_dirs/costb_r50_8x8x1_48e_mit_rgb/best_top1_acc_epoch_48.pth \
    4 --eval top_k_accuracy

Performance

All the following models are trained using 8 RTX 3090 GPUs with a seed of 0. We report top1/top5 accuracies on the validation set of Kinetics-400 and Moments in Time. The 1-clip and 10-clip settings refer to the val_pipelines and test_pipeline in the configuration files respectively. Note that we achieve higher accuracy on Moments in Time and lower accuracy on Kinetics-400, which may be due to the minor difference in implementation details (e.g. video preprocessing).

Kinetics-400

Method	Backbone	Config	Input Size	Our Acc (1 clip)	Our Acc (10 clips)	Paper Acc (10 clips)
C2D	ResNet-50	c2d_r50_8x8x1_160e_kinetics400_rgb.py	8x224x224	66.03/85.63	72.28/89.99	71.5/89.8
I3D¹	ResNet-50	i3d_r50_8x8x1_160e_kinetics400_rgb.py	8x224x224	66.86/86.17	73.40/90.91	73.3/90.4
CoST(a)	ResNet-50	costa_r50_8x8x1_160e_kinetics400_rgb.py	8x224x224	66.20/86.13	72.89/90.46	73.6/90.8
CoST(b)	ResNet-50	costb_r50_8x8x1_160e_kinetics400_rgb.py	8x224x224	66.90/86.25	73.85/90.98	74.1/91.2

Moments in Time

Method	Backbone	Config	Input Size	Our Acc (1 clip)	Our Acc (10 clips)	Paper Acc (10 clips)
C2D	ResNet-50	c2d_r50_8x8x1_48e_mit_rgb.py	8x224x224	28.21/54.65	30.17/56.75	27.9/54.6
I3D¹	ResNet-50	i3d_r50_8x8x1_48e_mit_rgb.py	8x224x224	29.27/56.13	31.04/58.28	29.0/55.3
CoST(a)	ResNet-50	costa_r50_8x8x1_48e_mit_rgb.py	8x224x224	29.10/55.73	30.80/57.94	29.3/55.8
CoST(b)	ResNet-50	costb_r50_8x8x1_48e_mit_rgb.py	8x224x224	30.23/57.41	32.23/59.84	30.1/57.2

¹ The C3D model described in the paper is essentially a variant of I3D.

License

This project is released under the Apache 2.0 license. The scripts in tools are adapted from MMAction2 and follow the original license.

Citation

@inproceedings{li2019collaborative,
  title={Collaborative Spatiotemporal Feature Learning for Video Action Recognition},
  author={Li, Chao and Zhong, Qiaoyong and Xie, Di and Pu, Shiliang},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={7872--7881},
  year={2019}
}

Acknowledgement

CoST is built upon MMAction2. We appreciate all contributors to the excellent framework.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
cost		cost
tools		tools
.gitignore		.gitignore
CoST_teaser.png		CoST_teaser.png
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Collaborative Spatiotemporal Feature Learning for Video Action Recognition

Installation

Getting Started

Data Preparation

Train

Test

Performance

Kinetics-400

Moments in Time

License

Citation

Acknowledgement

About

Releases

Packages

Languages

License

hikvision-research/cost

Folders and files

Latest commit

History

Repository files navigation

Collaborative Spatiotemporal Feature Learning for Video Action Recognition

Installation

Getting Started

Data Preparation

Train

Test

Performance

Kinetics-400

Moments in Time

License

Citation

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages