This repository includes the implementation for Exploiting Cross-Modal Prediction and Relation Consistency for Semi-Supervised Image Captioning.
- Python 3.6
- Java 1.8.0
- PaddlePaddle 2.1.0
- cider (already been added as a submodule)
- coco-caption (already been added as a submodule)
See details in data/README.md
.
(notes: Set word_count_threshold
in scripts/prepro_labels.py
to 4 to generate a vocabulary of size 10,369.)
You should also preprocess the dataset and get the cache for calculating cider score for SCST:
$ python scripts/prepro_ngrams.py --input_json data/dataset_coco.json --dict_json data/cocotalk.json --output_pkl data/coco-train --split train
$ CUDA_VISIBLE_DEVICES=0 sh train.sh
See opts.py
for the options.
$ CUDA_VISIBLE_DEVICES=0 python eval.py --model log/log_aoanet_rl/model.pth --infos_path log/log_aoanet_rl/infos_aoanet.pkl --dump_images 0 --dump_json 1 --num_images -1 --language_eval 1 --beam_size 2 --batch_size 100 --split test
If you find this repo helpful, please consider citing:
@article{yang2022,
title={Exploiting Cross-Modal Prediction and Relation Consistency for Semi-Supervised Image Captioning},
author={Yang Yang, Hong-Chen Wei, Heng-Shu Zhu, Dian-Hai Yu, Hui Xiong, Jian Yang},
booktitle={IEEE Transactions on Cybernetics},
year={2022}
}