This repository is the PyTorch implementation of the paper:
Diverse Image Captioning with Grounded Style
Franz Klein, Shweta Mahajan, Stefan Roth.
In GCPR 2021.
This codebase is written in Python 3.6 and CUDA 9.0.
Required Python packages are summarized in requirements.txt
.
.
├── data # Senticap/COCO Attributes wordforms, corresponding synsets and SentiWordNet scores
├── eval # Evaluation tools based on Senticap and COCO reference captions
├── frcnn # Faster R-CNN implementation augmented by attribute detection component and image feature extraction funtionality
├── misc # Different scripts for pre- and postprocessing
├── updown_baseline # Implementation ot BU-TD image captioning + CBS; augmented by readers for Senticap, COCO Attributes
├── var_updown # Implementation of the Style-SeqCVAE model introduced in this work
├── requirements.txt
├── LICENSE
└── README.md
- Initially, download and store the following datasets:
Dataset | Website |
---|---|
Senticap | Link |
COCO | Link |
COCO Attributes | Link |
SentiWordNet (optionally) | Link |
- Faster R-CNN preparation
Please follow the instructions of the original implementation to setup this modified Faster R-CNN codebase. - Style-SeqCVAE preparation
Please follow the instructions of the original implementation for setup. - Additional steps:
- Create a COCO/Senticap vocabulary by running the following command:
python scripts/build_vocabulary.py -c /path/to/coco/captions.json -o /path/to/vocabulary/target/ -s /path/to/senticap/captions.json
- Preprocess the Coco Attributes dataset by running
misc/gen_coco_attribute_objs.py
- Augment COCO captions by COCO Attributes with
misc/prep_coco_att_data.py
or Senticap adjectives withmisc/prep_senti_data.py
- The SentiGloVe latent space can by prepared by running
misc/prep_expl_lat_space.py
- Create a COCO/Senticap vocabulary by running the following command:
The original implementation is augmented by an attribute detection layer and can be trained using COCO + COCO Attributes
Add --cocoatts
as runtime parameter to activate attribute detection.
CUDA_VISIBLE_DEVICES=0 python trainval_net.py \
--dataset coco --net res101 \
--bs 16 --nw 2 \
--lr 0.01 --lr_decay_step 4 \
--cuda --cocoatts
Adding --senticap
as runtime parameter ensures that training ignores images that occur in the Senticap test split.
To extract image features and corresponding attribute detections run the modified test script with --feat_extract
as parameter.
python test_net.py --dataset coco --net res101 \
--checksession 1 --checkepoch 10 --checkpoint 14657 \
--cuda --feat_extract
To start training run var_updown/scripts/train.py
similar to e.g.
python scripts/train.py --config configs/updown_plus_cbs_nocaps_val.yaml \
--serialization-dir /path/to/checkpoint/destination/folder \
--gpu-ids 0
For evaluation of a trained model first run the following command to store predictions in a JSON file.
python scripts/inference.py --config configs/updown_plus_cbs_nocaps_val.yaml \
--checkpoint-path /path/to/checkpoint.pth \
--output-path /path/to/output.json \
--gpu-ids 0
To evaluate generated captions based on COCO or Senticap ground truth captions set the paths and config parameters in eval/eval.py
and run it.
The following metrics are available:
- BLEU
- METEOR
- ROUGE
- CIDER
- n-gram diversity
- sentiment accuracy, sentiment recall
- Top-* oracle score calculation for each metric if multiple candidate captions given per image_id
This project is based on the following publicly available repositories
We would like to thank all who have contributed to them.
If you use our code, please cite our GCPR 2021 paper:
@inproceedings{Klein:2021:DICWGS,
title = {Diverse Image Captioning with Grounded Style},
author = {Franz Klein and Shweta Mahajan and Stefan Roth},
booktitle = {Pattern Recognition, 43rd DAGM German Conference, DAGM GCPR 2021},
year = {2021}}