The goal of this project was to tackle the problem of automatic caption generation for images of real world scenes. The work consisted of reimplementing the Neural Image Captioning (NIC) model proposed by Vinyals et al. and running appropriate experiments to test its performance.
The project was carried out as part of the ID2223 "Scalable Machine Learning and Deep Learning" course at KTH Royal Institute of Technology.
pip install -r requirements.txt
git clone https://github.com/cocodataset/cocoapi
cd cocoapi/PythonAPI/; make install; cd ../..
pip3 install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp35-cp35m-linux_x86_64.whl
python fetch_data.py # (to also download test set, run python fetch_data --test)
python train.py
python train.py --sample --checkpoint_file <your-checkpoint-file>
The implemented architecture was based on the following publication:
Experiments were conducted using the Common Objects in Context dataset. The following subsets were used:
- Training: 2014 Contest Train images [83K images/13GB]
- Validation: 2014 Contest Val images [41K images/6GB]
- Test: 2014 Contest Test images [41K images/6GB]
The NIC architecture consists of two models, the Encoder and a Decoder. The Encoder, which is a Convolutional Neural Network, is used to create a (semantic) summary of the image in a form of a fixed sized vector. The Decoder, which is a Recurrent Neural Network, is used to generate the caption in natural language based on the summary vector created by the encoder.
The goal of the project was to implement and train a NIC architecture and evaluate its performance. A secondary goal, was to check how the type of a recurrent unit and the size of the word embeddings in the Decoder (language generator) affects the overall performance of the NIC model.
The Encoder was a ResNet-34
architecture with pre-trained weights on the ImageNet
dataset. The output layer of the network was replaced with a new layer with a size definable by the user. All weights, except from the last layer, were frozen during the training procedure.
The Decoder was a single layer recurrent neural network. Three different recurrent units were tested, Elman
, GRU
, and LSTM
. The Elman
refers to the basic rnn architecture.
Training parameters:
- Number of epochs:
3
- Batch size:
128
(3236 batches per epoch) - Vocabulary size:
15,000
most popular words - Embedding size:
512
(image summary vector, word embeddings) - RNN hidden state size:
512
and1024
- Learning rate:
1e-3
, with LR decay every 2000 batches
Models were implemented in Python
using the PyTorch library. Models were trained either locally or on rented AWS instances (both using GPUs).
Experiments were evaluated in a qualitative and quantitative manner. The qualitative evaluation assessed the coherence of the generated sequences and their relevance given the input image, and was done by us manually. The quantitative evaluation enabled comparison of trained models with reference models from the authors. The following metrics were used: BLEU-1
, BLEU-2
, BLEU-3
, BLEU-4
, ROGUE-L
, METEOR
, and CIDEr
.
Qualitative results are presented on the Validation and Test sets. Results obtained with the reimplemented model are compared with the results obtained by the authors of the article.
Validation Data | |||||||
---|---|---|---|---|---|---|---|
Model | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROGUE-L | CIDEr |
Vinyals et al. (4k subset) | N/A | N/A | N/A | 27.7 | 23.7 | N/A | 85.5 |
elman_512 | 62.5 | 43.2 | 29.1 | 19.8 | 19.5 | 45.6 | 57.7 |
elman_1024 | 61.9 | 42.9 | 28.8 | 19.6 | 19.9 | 45.9 | 58.7 |
gru_512 | 63.9 | 44.9 | 30.5 | 20.8 | 20.4 | 46.6 | 62.9 |
gru_1024 | 64.0 | 45.3 | 31.2 | 21.5 | 21.1 | 47.1 | 66.1 |
lstm_512 | 62.9 | 44.3 | 29.8 | 20.3 | 19.9 | 46.1 | 60.2 |
lstm_1024 | 63.4 | 45.0 | 31.0 | 21.4 | 20.8 | 47.1 | 64.4 |
Test Data | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Model | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROGUE-L | CIDEr | ||||||||
c5 | c40 | c5 | c40 | c5 | c40 | c5 | c40 | c5 | c40 | c5 | c40 | c5 | c40 | ||
Vinyals et al. | 71.3 | 89.5 | 54.2 | 80.2 | 40.7 | 69.4 | 30.9 | 58.7 | 25.4 | 34.6 | 53.0 | 68.2 | 94.3 | 94.6 | |
elman_1024 | 61.8 | 79.9 | 42.8 | 66.2 | 28.7 | 51.9 | 19.5 | 39.8 | 19.9 | 26.7 | 45.7 | 58.4 | 58.0 | 60.0 | |
gru_1024 | 63.8 | 81.2 | 45.0 | 68.1 | 30.1 | 54.4 | 21.3 | 42.5 | 21.0 | 27.8 | 47.0 | 59.5 | 65.4 | 66.4 | |
lstm_1024 | 63.3 | 81.0 | 44.8 | 67.9 | 30.7 | 54.0 | 21.1 | 42.0 | 20.7 | 27.4 | 46.9 | 59.2 | 63.7 | 64.8 |
Note: The "MSCOCO c5" dataset contains five reference captions for every image in the MS COCO training, validation and testing datasets. "MSCOCO c40" contains 40 reference sentences for a randomly chosen 5,000 images from the MS COCO testing dataset[2].
Note2: We assume the score "Vinyals et al." is the top score of the first author of our main reference paper [3]. It comes from the MSCOCO leaderboard at www.codalab.org, where we evaluated our scores. We decided to map the scores from the range [0,1] displayed on the website to [0,100] to keep consistency with previous table and Vinyals et al. Our results for GRU
and LSTM
are on place 84 and 85 of the codalab leader board, respectively.
Captions without errors (left-to-right: Elman, GRU, LSTM)
Captions with minor errors (left-to-right: Elman, GRU, LSTM)
Captions somewhat related to images (left-to-right: Elman, GRU, LSTM)
Captions unrelated to image (left-to-right: Elman, GRU, LSTM)
Studying the results of our experiments, we noted how increasing the number of hidden units describing the RNN state resulted in improved performance across all models, which matched our expectations. However, it was interesting to see the GRU
cell outperform LSTM
in both experiments. A possible explanation of this is that for generating relatively short sequences (most captions had up to 20 words) the architecture of the LSTM
cell might be overly complex. The sequences might also be too short for the LSTM
to shine. Since the LSTM
has more trainable parameters when compares to GRU
it would be interesting to see if extending the training procedure in the case of LSTM
-based networks allows them to obtain the same or better performance than GRU
-based networks.
[1] Chung et al. (2014) Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:1412.3555.
[2] X. Chen et al. (2015) Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv:1504.00325.
[3] O. Vinyals et al. (2014) Show and Tell: A Neural Image Captiong Generator, arXiv:1411.4555