Produce pragmatic, context aware descriptions of images (captions that describe differences between images or visual concepts) using context agnositic data (captions that describe a concept or an image in isolation). We attempt the following two problems.
- Justification:
- Given an image, a target (ground-truth) class, and a distractor class, describe the target image to explain why it belongs to the target class, and not the distractor class.
- Discriminative image captioning
- Given two similar images, produce a sentence to identify a target image from the distractor image.
We trained our model using generic context-agnostic data (captions that describe a concept or an image in isolation), in an encoder-decoder paradigm along with attention, and used an inference techiqiue called Emitter-Suppressor Beam Search to produce context aware image captions. Our model develops upon the architecture of Show attend and tell. For justification, apart from the image, the decoder is also conditioned on target-class.
We have used the CUB-200-2011 dataset which contains images of birds and their descriptions. The dataset has 200 bird classes (species), each class has 30 images and each image has 10 descriptions. The descriptions are mostly about the morphology of the birds i.e., details about various parts of their body.
-
Encoder
- We used a pretrained
ResNet-34
already available in the PyTorch'storchvision
module and discarded the last two layers (pooling and linear layers), since we only need to encode the image, and not classify it.
- We used a pretrained
-
Decoder
- We used LSTM's with
input embedding of size 512
andhidden states of size 1800
. For justification the class is embeded into a512 size vector
.
- We used LSTM's with
-
Attention
- We used adaptive pooling over encoder to get a
14*14*512
vector from the encoder and then applied a linear layer with ReLu activation to get the attention weights. Note that we used the soft version of the attention.
- We used adaptive pooling over encoder to get a
-
We used Adam's optimizer with
learning rate of 0.002
which is annealed every 5 epochs. We used dropout withp = 0.5
. Thebatch size used was 64
and thenumber of epochs were 100
. The model was trained on GTX 1060 for 15 hours.
Kindly use the requirements.txt to set up your machine for replicating this project, some dependencies are :
h5py==2.9.0
matplotlib==3.0.3
nltk==3.4.1
numpy==1.16.2
pandas==0.24.2
pillow==5.3.0
python==3.7.3
pytorch==1.0.0
torchfile==0.1.0
torchvision==0.2.1
tqdm==4.31.1
You can install these dependencies using pip install -r requirements.txt
python datapreprocess.py \path\to\data\set \path\to\vocab\
python train.py
python train_justify.py
Download the pretrained models checkpoint_d and checkpoint_j
- Context agnostic captioning:
python beamsearch.py c image_path
- Justification:
python beamsearch.py cj target_image_path target_class_path distractor_class_path
- Discrimination:
python beamsearch.py cd target_image_path distractor_image_path
- Paper: Context-aware Captions from Context-agnostic Supervision
- Dataset:
- Images : CUB-200-2011
- Captions : Reed et al.
- A beautiful tutorial on Show, Attend and Tell Implementation