Skip to content

vedal/show-and-tell

Repository files navigation

Neural Image Captioning

The goal of this project was to tackle the problem of automatic caption generation for images of real world scenes. The work consisted of reimplementing the Neural Image Captioning (NIC) model proposed by Vinyals et al. and running appropriate experiments to test its performance.

The project was carried out as part of the ID2223 "Scalable Machine Learning and Deep Learning" course at KTH Royal Institute of Technology.

To run

Install pip packages and cocoapi:

pip install -r requirements.txt
git clone https://github.com/cocodataset/cocoapi
cd cocoapi/PythonAPI/; make install; cd ../..

Install PyTorch for python 3.5 with CUDA 8.0 (check pytorch.org for other options):

pip3 install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp35-cp35m-linux_x86_64.whl

Fetch the data (also builds a vocabulary):

python fetch_data.py # (to also download test set, run python fetch_data --test)

Start training with default arguments (check train.py for more arguments)

python train.py

Evaluate

python train.py --sample --checkpoint_file <your-checkpoint-file>

Contributors

  • Martin Hwasser (github: hwaxxer)
  • Wojciech Kryściński (github: muggin)
  • Amund Vedal (github: amundv)

References

The implemented architecture was based on the following publication:

Datasets

Experiments were conducted using the Common Objects in Context dataset. The following subsets were used:

  • Training: 2014 Contest Train images [83K images/13GB]
  • Validation: 2014 Contest Val images [41K images/6GB]
  • Test: 2014 Contest Test images [41K images/6GB]

Architecture

The NIC architecture consists of two models, the Encoder and a Decoder. The Encoder, which is a Convolutional Neural Network, is used to create a (semantic) summary of the image in a form of a fixed sized vector. The Decoder, which is a Recurrent Neural Network, is used to generate the caption in natural language based on the summary vector created by the encoder.

Experiments

Goals

The goal of the project was to implement and train a NIC architecture and evaluate its performance. A secondary goal, was to check how the type of a recurrent unit and the size of the word embeddings in the Decoder (language generator) affects the overall performance of the NIC model.

Setup

The Encoder was a ResNet-34 architecture with pre-trained weights on the ImageNet dataset. The output layer of the network was replaced with a new layer with a size definable by the user. All weights, except from the last layer, were frozen during the training procedure.

The Decoder was a single layer recurrent neural network. Three different recurrent units were tested, Elman, GRU, and LSTM. The Elman refers to the basic rnn architecture.

Training parameters:

  • Number of epochs: 3
  • Batch size: 128 (3236 batches per epoch)
  • Vocabulary size: 15,000 most popular words
  • Embedding size: 512 (image summary vector, word embeddings)
  • RNN hidden state size: 512 and 1024
  • Learning rate: 1e-3, with LR decay every 2000 batches

Models were implemented in Python using the PyTorch library. Models were trained either locally or on rented AWS instances (both using GPUs).

Evaluation Methods

Experiments were evaluated in a qualitative and quantitative manner. The qualitative evaluation assessed the coherence of the generated sequences and their relevance given the input image, and was done by us manually. The quantitative evaluation enabled comparison of trained models with reference models from the authors. The following metrics were used: BLEU-1, BLEU-2, BLEU-3, BLEU-4, ROGUE-L, METEOR, and CIDEr.

Results

Training Progress

Quantitative

Qualitative results are presented on the Validation and Test sets. Results obtained with the reimplemented model are compared with the results obtained by the authors of the article.

Validation Data
Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROGUE-L CIDEr
Vinyals et al. (4k subset) N/A N/A N/A 27.7 23.7 N/A 85.5
elman_512 62.5 43.2 29.1 19.8 19.5 45.6 57.7
elman_1024 61.9 42.9 28.8 19.6 19.9 45.9 58.7
gru_512 63.9 44.9 30.5 20.8 20.4 46.6 62.9
gru_1024 64.0 45.3 31.2 21.5 21.1 47.1 66.1
lstm_512 62.9 44.3 29.8 20.3 19.9 46.1 60.2
lstm_1024 63.4 45.0 31.0 21.4 20.8 47.1 64.4
Test Data
Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROGUE-L CIDEr
c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40
Vinyals et al. 71.3 89.5 54.2 80.2 40.7 69.4 30.9 58.7 25.4 34.6 53.0 68.2 94.3 94.6
elman_1024 61.8 79.9 42.8 66.2 28.7 51.9 19.5 39.8 19.9 26.7 45.7 58.4 58.0 60.0
gru_1024 63.8 81.2 45.0 68.1 30.1 54.4 21.3 42.5 21.0 27.8 47.0 59.5 65.4 66.4
lstm_1024 63.3 81.0 44.8 67.9 30.7 54.0 21.1 42.0 20.7 27.4 46.9 59.2 63.7 64.8

Note: The "MSCOCO c5" dataset contains five reference captions for every image in the MS COCO training, validation and testing datasets. "MSCOCO c40" contains 40 reference sentences for a randomly chosen 5,000 images from the MS COCO testing dataset[2].

Note2: We assume the score "Vinyals et al." is the top score of the first author of our main reference paper [3]. It comes from the MSCOCO leaderboard at www.codalab.org, where we evaluated our scores. We decided to map the scores from the range [0,1] displayed on the website to [0,100] to keep consistency with previous table and Vinyals et al. Our results for GRU and LSTM are on place 84 and 85 of the codalab leader board, respectively.

Qualitative

Captions without errors (left-to-right: Elman, GRU, LSTM)

Captions with minor errors (left-to-right: Elman, GRU, LSTM)

Captions somewhat related to images (left-to-right: Elman, GRU, LSTM)

Captions unrelated to image (left-to-right: Elman, GRU, LSTM)

Discussion

Studying the results of our experiments, we noted how increasing the number of hidden units describing the RNN state resulted in improved performance across all models, which matched our expectations. However, it was interesting to see the GRU cell outperform LSTM in both experiments. A possible explanation of this is that for generating relatively short sequences (most captions had up to 20 words) the architecture of the LSTM cell might be overly complex. The sequences might also be too short for the LSTM to shine. Since the LSTM has more trainable parameters when compares to GRU it would be interesting to see if extending the training procedure in the case of LSTM-based networks allows them to obtain the same or better performance than GRU-based networks.

References:

[1] Chung et al. (2014) Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:1412.3555.
[2] X. Chen et al. (2015) Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv:1504.00325.
[3] O. Vinyals et al. (2014) Show and Tell: A Neural Image Captiong Generator, arXiv:1411.4555

About

Scalable Machine Learning project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •