This repository contains Pytorch implementation of video captioning SOTA models from 2015-2020 on MSVD and MSRVTT datasets. Details are given in below table
Model | Datasets | Paper name | Year | Status | Remarks |
---|---|---|---|---|---|
Mean Pooling | MSVD, MSRVTT | Translating videos to natural language using deep recurrent neural networks[1] | 2015 | Implemented | No temporal modeling |
S2VT | MSVD, MSRVTT | Sequence to Sequence - Video to Text[2] | 2015 | Implemented | Single LSTM as encoder decoder |
SA-LSTM | MSVD, MSRVTT | Describing videos by exploiting temporal structure[3] | 2015 | Implemented | Good Baseline with attention |
Recnet | MSVD, MSRVTT | Reconstruction Network for Video Captioning[4] | 2018 | Implemented | Results did not improve over SA-LSTM with both global and local reconstruction loss |
MARN | MSVD, MSRVTT | Memory-Attended Recurrent Network for Video Captioning[5] | 2019 | Implemented | Memory requirement linearly increases with vocabulary size |
ORG-TRL | MSVD, MSRVTT | Object Relational Graph with Teacher-Recommended Learning for Video Captioning[6] | 2020 | In progress | leavarage GCN for object relational features |
*More recent models will be added in future
- Ubuntu 18.04
- CUDA 11.0
- Nvidia GeForce RTX 2080Ti
- Java 8
- Python 3.8.5
- Pytorch 1.7.0
- Other Python libraries specified in requirements.txt
$ virtualenv .env
$ source .env/bin/activate
(.env) $ pip install --upgrade pip
(.env) $ pip install -r requirements.txt
-
Extract features from network you want to use, and locate them at
<PROJECT ROOT>/<DATASET>/features/<DATASET>_APPEARANCE_<NETWORK>_<FRAME_LENGTH>.hdf5
. To extracted features follow the repository here. Or simply download the already extracted features from given table and locate them in<PROJECT ROOT>/<DATASET>/features/
Dataset Feature Type Inception-v4 InceptionResNetV2 ResNet-101 REsNext-101 MSVD Appearance link link link - MSR-VTT Appearance link link link - MSVD Motion - - - link MSR-VTT Motion - - - link MSVD Object - - link - MSRVTT Object - - link -
You can change hyperparameters by modifying config.py
.
Clone evaluation codes from the official coco-evaluation repo.
(.env) $ git clone https://github.com/tylin/coco-caption.git
(.env) $ mv coco-caption/pycocoevalcap .
(.env) $ rm -rf coco-caption
Or simply copy the pycocoevalcap folder and its contents in the project root.
Follow the demo given in video_captioning.ipynb
.
Follow the demo given in video_captioning.ipynb
.
*MSVD
Model | Pretrained model | BLEU4 | METEOR | ROUGE_L | CIDER | Pretrained |
---|---|---|---|---|---|---|
Mean Pooling | Inceptionv4 | 42.4 | 31.6 | 68.3 | 71.8 | link |
SA-LSTM | InceptionvResNetV2 | 45.5 | 32.5 | 69.0 | 78.0 | link |
S2VT | Inceptionv4 | - | - | - | - | - |
RecNet (global ) | Inceptionv4 | - | - | - | - | - |
RecNet (local) | Inceptionv4 | - | - | - | - | - |
MARN | Inceptionv4, REsNext-101 | 48.5 | 34.4 | 71.4 | 86.4 | link |
ORG-TRL | InceptionResNetV2, REsNext-101 | - | - | - | - |
*MSRVTT
Model | Pretrained model | BLEU4 | METEOR | ROUGE_L | CIDER | Pretrained |
---|---|---|---|---|---|---|
Mean Pooling | Inceptionv4 | 34.9 | 25.5 | 58.12 | 35.76 | link |
SA-LSTM | Inceptionv4 | - | - | - | - | - |
S2VT | Inceptionv4 | - | - | - | - | - |
RecNet (global ) | Inceptionv4 | - | - | - | - | - |
RecNet (local) | Inceptionv4 | - | - | - | - | - |
MARN | Inceptionv4 | - | - | - | - | - |
ORG-TRL | InceptionResNetV2, REsNext-101 | - | - | - | - |
[1] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. In NAACLHLT, 2015.
[2] Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond J. Mooney, Trevor Darrell and Kate Saenko. Sequence to Sequence - Video to Text. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015
[3] Yao, Li, et al. "Describing videos by exploiting temporal structure." Proceedings of the IEEE international conference on computer vision. 2015.
[4] Wang, Bairui, et al. "Reconstruction Network for Video Captioning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
[5] Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. Memory-attended recurrent network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8347–8356, 2019
[6] Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, Zhengjun Zha. Object Relational Graph with Teacher-Recommended Learning for Video Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
I got some of the coding ideas from hobincar/pytorch-video-feature-extractor. For pre-trained appearance feature extraction I have followed this repo and this repo for 3D motion feature extraction. Many thanks!