Image caption generation in PyTorch using an encoder-decoder architecture
This work implements a variant model based on the paper Show and Tell: A Neural Image Caption Generator. Given an image, the model is able to describe in natural language the contents of the image. The model is comprised of the encoder, a pretrained CNN, which extracts high-level features from the image and feeds them to the decoder, an LSTM, which generates the sequences of words.
- Conda or Virtualenv
- Flickr8k dataset for training (downloadable here)
- Extract the images from the Flickr8k dataset under
./data/images
$ git clone https://github.com/nhabbash/autocaption
$ cd autocaption
$ conda env create -p .\cenv -f .\environment.yml # using conda
$ jupyter nbextensions_configurator enable --user # optional
Uses:
- PyTorch for deep learning
- Ax for hyperparameter tuning
- Weights and Biases for experiment tracking
For a detailed example, check the training notebook under ./notebooks/training
- The best model obtained after training and hyperparameter tuning achieves an average BLEU score on the test split of 11, compared to 27.2 of the original paper. (See the report or the slides for more details on the performance)
- The model works best with pictures similar to those it has been trained on. In the case of Flickr8k, pictures with one or two subjects doing some simple activities. It works pretty good with dogs playing around and people engaging in a couple of sports (e.g. surfing, trekking on mountains).
- The demo is made in Vue.js for the frontend and FastAPI for the backend. The backend is deployed on Heroku, and if it's the first time running in a while it does take a couple of minutes to start up and generate the first caption, after that it usually takes a dozen seconds. If running the demo locally (You can if you have Docker Compose) caption generation takes about 5 seconds.
- Nassim Habbash - nhabbash