Generate captions from images
Image captions serve as a link between the information stored in image and the text. It helps machine learning algortihms to make better sense of what's going on in the image. It is useful for Seach Engine Optimization (SEO), indexing and archiving photos based on its contents such as actions / places / objects in it (similar to what Google Photos does today). Image captioning can also help to process videographic data by identifying what's happening frame by frame.
Here we are using an attention based encoder-decooder model to generate captions. Training images available in RGB format are passed through a pre-trained encoder model to obtain spatial information from images, which are then passed through a decoder block to generate the captions sequentially. Encoder is most oftenly a pre-trained CNN model, here we are using InceptionV3 and decoder is a RNN model.
Attention in literal English means directing focus at something or taking greater notice. In Deep Learning, attention mechanism lives off the same concept where a model pays higher focus on certain factors while processing the data. Attention in an encoder-decoder model is in charge of managing and quantifying the dependence that the decoder has on the encoder. See [6] for more details on Bahdanau Attention.
-
Create a conda environment using following command:
conda create --name <_yourenvname_> python=3.7
In case you are not familiar with conda environments, refer to [3] and [4] to get started. -
Install dependencies using requirements.txt
pip3 install -r requirements.txt
-
Flickr8k images have been used in this project. There are multiple sources available online to download this dataset, here is one of them - download flickr8k. Save these images in data/Flicker8k_Images directory. One image has been already saved for reference.
-
Instead of creating the train-test-val split by ourselves, we'll leverage the awesome work done by Andrej Karpathy to split the images into train-val-test in ratio 6000:1000:1000 alongwith segregating their captions. This is available as json, download from here. Save this json file in data/annotations
-
Download pretrained GLoVe embeddings (glove.840B.300d) from here or here and save to data/annotations directory
-
Run img_features.py to save the encoded version of images into the disk. This usually takes 30-35 mins for images in train and validation set, so better to perform this step one-off before kicking off training. Saving to disk would allow us to use these encoded features repeatedly without any time expense as we try different combinations of model parameters. A folder named 'img_features' would be created inside data/ folder.
Specify following settings in config.yaml. These are used to control how the model is trained.
- store: specify folder name where the model checkpoints and results will be saved. A good practice is to name this folder according to settings used, for example, number of epochs, vocab_size, RNN units etc
- epochs Number of cycles where all training data is passed through the model exactly once. A forward pass and a backward pass together are counted as one pass
- batch_size: An epoch is made up of one or more batches, where we use a part of the dataset to train the neural network. Training data is split into batch_size and model weights are updated once per batch
- units: Number of units in the decoder RNN
- embedding_dim: Size of the feature vectors of the word embedding being used. Glove word vectors have 300 dimensions
- vocab_size: Number of unique words in the training set. A feature vector per word is sourced from Glove embeddings, and a zero vector is assigned to out of vocabulary words
- In the preprocessing step, tensorflow.keras.preprocessing.pad_sequences is used to ensure length of all training captions are same, by setting 'maxlen' parameter. We have set maxlen to be 21, which is equal to {average(length of training captions) + 2*standard deviation(length of training captions)}. Keeping max_len = 39 (longest caption in train) leads to spurious results where the algorithm wanders between random words in search of stopping criteria .ie. to arrive at maxlen.
Point to note here is that only 2.2% training captions are longer than 21 words, so it does not make sense to train for full length.
-
At this point, we are ready with all settings to begin the training. Kick-off training by running main.py.
-
Here are the losses per epoch for training and validation sets. While the training loss decreases as training progresses, there is a hint of overfitting as the validation loss increases after an initial dip.
- At the end of training, there will be two folders creared within the 'store' folder that you specified in config.yaml
- checkpoints: contains a single checkpoint for the epoch with least training loss
- derived_data: has following .csv files:
- loss_per_epoch.csv: contains train_loss and validation_loss per epoch. Used to plot above figure.
- pred_cap_per_epoch.csv: contains predicted caption for a sample image after every epoch
- result_train_greedy.csv: contains predicted captions and BLEU-1,2,3,4 scores for all samples in training data, alongwith actual captions provided in Flickr8k dataset. Uses greedy search to generate captions.
- result_val_greedy.csv: same as above but for validation samples using greedy search
- result_val_beam.csv: same as above but for validation samples using beam search
We compute BLEU scores for both training and validation set, using two methods to generate predicted captions: Greedy Search and Beam Search
-
Greedy Search: In order to predict next word in caption, the decoder predicts probability scores of all words in the vocabulary. In Greedy search, we pick the best probable word at every instance and feed it into the model to predict next word.
-
Beam Search: Beam search is different from Greedy Search as it preserves top-k predicted words at each instance, and feed them individually to get the next word, thus maintaining atmost k sequences at each instance. This is done to reduce the penalty in case the highest probable word at any instance lead into a wrong direction. See Andrew NG's video [5] for better understanding beam search. Please note that if k=1 in beam search it is nothing but greedy search.
Below table summarizes the performance of various model settings.
# | epochs | vocab_size | max_len | rnn_units | Greedy Search | Beam Search (k=3) |
---|---|---|---|---|---|---|
1 | 40 | Full (7378) | 39 | 256 | Dev set BLEU scores: - BLEU-1: 0.4044 - BLEU-2: 0.2131 - BLEU-3: 0.0792 - BLEU-4: 0.0234 |
Dev set BLEU scores: - BLEU-1: 0.4259 - BLEU-2: 0.2284 - BLEU-3: 0.0962 - BLEU-4: 0.0358 |
2 | 40 | 6000 | 39 | 512 | Dev set BLEU scores: - BLEU-1: 0.4280 - BLEU-2: 0.2275 - BLEU-3: 0.0823 - BLEU-4: 0.0241 |
Dev set BLEU scores: - BLEU-1: 0.4255 - BLEU-2: 0.2261 - BLEU-3: 0.0888 - BLEU-4: 0.0290 |
3 | 20 | 6000 | 39 | 512 | Dev set BLEU scores: - BLEU-1: 0.4531 - BLEU-2: 0.2581 - BLEU-3: 0.1022 - BLEU-4: 0.0326 |
Dev set BLEU scores: - BLEU-1: 0.4620 - BLEU-2: 0.2646 - BLEU-3: 0.1157 - BLEU-4: 0.0393 |
4 | 20 | Full (7378) | 39 | 512 | Dev set BLEU scores: - BLEU-1: 0.4548 - BLEU-2: 0.2560 - BLEU-3: 0.1183 - BLEU-4: 0.0400 |
Dev set BLEU scores: - BLEU-1: 0.4547 - BLEU-2: 0.2711 - BLEU-3: 0.1301 - BLEU-4: 0.0466 |
5 | 20 | Full (7378) | 21 | 512 | Dev set BLEU scores: - BLEU-1: 0.4745 - BLEU-2: 0.2776 - BLEU-3: 0.1219 - BLEU-4: 0.0398 |
Dev set BLEU scores: - BLEU-1: 0.4855 - BLEU-2: 0.2900 - BLEU-3: 0.1359 - BLEU-4: 0.0523 |
6 | 20 | Full (7378) | 17 | 512 | Dev set BLEU scores: - BLEU-1: 0.4706 - BLEU-2: 0.2678 - BLEU-3: 0.1114 - BLEU-4: 0.0339 |
Dev set BLEU scores: - BLEU-1: 0.4776 - BLEU-2: 0.2739 - BLEU-3: 0.1232 - BLEU-4: 0.0412 |
The best performance was acheived in run #5, with a BLEU-4 score of 5.23% using beam search.
- Here is how the model learnt captions for a sample image shown below:
-
Training on a larger corpus of images could potentially improve performance. Since I have used Google Colab for training and have exhasuted the storage limit on Google Drive, I have used Flickr8k dataset, but we could upgrade to Flickr30k or COCO dataset for access to more training samples.
-
Training for 40 epochs lead to overfitting on training data, so I had changed number of epochs to 20 from run #3. We have incorporated regularization techniques like L2 regularizer and Dropout but they do not seem to help after a certain degree.
-
Training for full vocab size .ie. not dropping any words in training data lead to better performance.
-
As mentioned above, keeping a low max_len lead to stable predictions.
-
BLEU scores are better while using Beam Search as compared to Greedy Search. Increasing the beam width (currently 3) can further improve the performance marginally. However, considering more beam samples at every step would come at higher computation expense.
- Good Predictions:
- Not so good predictions: