This repository contains the Tensorflow implementation of our model "Semantically Sensible Video Captioning (SSVC)"
[Code] [Paper] [ArXiv]
Md. Mushfiqur Rahman, Thasin Abedin, Khondokar S. S. Prottoy, Ayana Moshruba, Fazlul Hasan Siddiqui
Install the following dependencies before running the model
- Tensorflow 2.0 install
- tqdm
pip install tqdm
- sklearn
pip install -U scikit-learn
- nltk
pip install nltk
-root
-glove.6B.100d.txt
-MSVD_captions.csv
-models_and_utils
-models.py
-utils.py
-data_picle
-train
-filename1.pkl
-filename2.pkl
...
-test
-filename1.pkl
-filename2.pkl
...
-validation
-filename1.pkl
-filename2.pkl
...
-train.csv
-test.csv
-validation.csv
- Download and extract 'glove.6B.100d.txt' link
- Download the MSVD dataset and create corresponding pickle files using
vid2frames.ipynb
. Split the data in train-test-val sets.Alternate step: Download and extract 'data_pickle.zip'. This compressed file already contains the pickles files of MSVD dataset
- run the train.ipynb file
This file has a detailed list of options. Change the options to adjust the model according to requirements
- Train and evaluation codes are inside the python notebook
SSVC: "A woman is cutting a piece of meat"
GT: "a woman is cutting into the fatty areas of a pork chop"
SS score: 1.0, BLEU1: 1.0, BLEU2: 1.0, BLEU3: 1.0, BLEU4: 1.0
SSVC: "A person is slicing tomato"
GT: "Someone wearing blue rubber gloves is slicing a tomato with a large knife"
SS score: 0.825, BLEU1: 1.0, BLEU2: 1.0, BLEU3: 1.0, BLEU4: 1.0
SSVC: "A woman is cutting a piece of meat"
GT: "a woman is cutting into the fatty areas of a pork chop"
SS score: 0.94, BLEU1: 1.0, BLEU2: 0.84, BLEU3: 0.61, BLEU4: 0.0
Please cite the following:
@article{rahman2021video,
title={Video captioning with stacked attention and semantic hard pull},
author={Rahman, Md Mushfiqur and Abedin, Thasin and Prottoy, Khondokar SS and Moshruba, Ayana and Siddiqui, Fazlul Hasan},
journal={PeerJ Computer Science},
volume={7},
pages={e664},
year={2021},
publisher={PeerJ Inc.}
}