In this project, I have implemented an end-to-end Deep Learning model for Image Captioning. The architecture consists of Encoder and Decoder Networks. Encoder is one of the pre-trained CNN architectures to get image embedding. Decoder is LSTM network with un-intialized word embeddings.
- python3.6
- pytorch
- pytorch-vision
- pillow
- nltk
- pickle
- cuda version 9.0/9.1
- cuDNN >=7.0
pip install http://download.pytorch.org/whl/cu90/torch-0.4.0-cp36-cp36m-linux_x86_64.whl pytorch-vision pillow nltk pickle
Flickr8K
#train : 6000
#dev : 1000
#test : 1000
python3 Preprocess.py
python3 train.py -model <encoder_architecture> -dir <train_dir_path> -save_iter <model_checkpoint> -learning_rate <learning_rate> -epoch <re-train_epoch> -gpu_device <gpu_device_number> -hidden_dim <lstm_hidden_state_dim> -embedding_dim <encoder_output>
-model
: one of the cnn architectures - alexnet, resnet18, resnet152, vgg, inception, squeeze, dense
-dir
: training directory path
-save_iter
: create model checkpoint after some iterations, default = 10
-learning_rate
: default = 1e-5
-epoch
: re-train the network from saved checkpoint epoch
-gpu_device
: gpu device number in case multiple gpus are installed on server
-hidden_dim
: number of neurons for lstm's hidden state, default = 512
-embedding_dim
: output of cnn encode model, default = 512
python3 test.py -model <encoder_architecture> -i <image_path> -epoch <saved_model> -gpu_device <gpu_device_number>
-i
: image path for generating caption
Download trained model: Trained for ~24 hours (230 iterations) on single NVIDIA 1080 (8GB) GTX GPU.
Since training error is decreasing it seems like model is working just fine.