This repository is the PyTorch implementation of the paper:
Joint Wasserstein Autoencoders for Aligning Multimodal Embeddings (ICCV Worshops 2019)
Shweta Mahajan, Teresa Botschen, Iryna Gurevych and Stefan Roth
This repository is built on top of SCAN and VSE++ in PyTorch.
The following code is written in Python 2.7.0 and CUDA 9.0.
Requirements:
- torch 0.3
- torchvision 0.3.0
- nltk 3.5
- gensim
- Punkt Sentence Tokenizer:
import nltk
nltk.download()
> d punkt
To install requirements:
conda config --add channels pytorch
conda config --add channels anaconda
conda config --add channels conda-forge
conda config --add channels conda-forge/label/cf202003
conda create -n <environment_name> --file requirements.txt
conda activate <environment_name>
-
The preprocessed COCO and Flickr30K dataset used in the experiments are based on the SCAN and can be downloaded at COCO_Precomp and F30k_Precomp. The downloaded dataset should be placed in the
data
folder. -
Run
vocab.py
to generate the vocabulary for the datasets as:
python vocab.py --data_path data --data_name f30k_precomp
python vocab.py --data_path data --data_name coco_precomp
A new JWAE model can be trained using the following:
python train.py --data_path "$DATA_PATH" --data_name coco_precomp --vocab_path "$VOCAB_PATH"
The trained model can then be evaluated using the following python script:
from vocab import Vocabulary
import evaluation
evaluation.evalrank("$CHECKPOINT_PATH", data_path="$DATA_PATH", split="test")
@inproceedings{Mahajan:2019:JWA,
author = {Shweta Mahajan and Teresa Botschen and Iryna Gurevych and Stefan Roth},
booktitle = {ICCV Workshop on Cross-Modal Learning in Real World},
title = {Joint {W}asserstein Autoencoders for Aligning Multi-modal Embeddings},
year = {2019}
}