Skip to content

PyTorch implementation of Image Captioning using CNNs + LSTMs

License

Notifications You must be signed in to change notification settings

Rish-01/PyTorch-Image-Captioning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyTorch-Image-Captioning 📷 🖊️

This project uses a ResNet152 convolutional neural network (CNN) as an encoder and a long short-term memory (LSTM) recurrent neural network as a decoder to generate captions for images. The model is trained on the Microsoft Common Objects in Context (MS COCO) dataset.

Architecture

The encoder consists of a ResNet152 CNN followed by a linear layer and batch normalization. The CNN is pretrained on the ImageNet dataset, and its final fully connected layer is removed.

The decoder consists of an LSTM network followed by a linear layer and softmax to generate the final output. The LSTM takes the encoded image features from the encoder and a sequence of previously generated words as input to predict the next word in the caption.

Here are the trained 🔗 model's weights.

📋 To-Do List

  • Read papers on Image Captioning (ongoing)
  • Define a custom Dataloader for MS COCO dataset
  • Build the CNN encoder
  • Build the LSTM decoder
  • Build the Training loop
  • Evaluate the Model
  • Iterate to make the model better