A large-scale listing dataset of 67,728 actual eBay listing images and titles along with translated german titles
Image Guided Translation Dataset is created from 67,728 actual eBay listing (item) images and titles provided by sellers. These listings are taken from different sets of categories like Home & Garden, Consumer Electronics, Toys & Hobbies, Business & Industrial, Musical Instruments and Crafts. Listings are also choosen from a wide varity of sellers and locations to form this heterogeneous dataset suitable to train large multimodals.
All these listings (items) in this dataset are listed by sellers in eBay United States website and had listing title in english provided by seller at the time of listing. Each of these english language listing titles are then manually translated into german by human translators either by professional translators or via crowd-sourcing.
Image Guided Translation Dataset is released to do more research collaboration with many universities. This image-guided translation task is aimed at the generation of item titles in a target language. The task can be addressed as a multisource multimodal translation task, which takes source language listing titles in multiple languages and translates them into the target language, using the visual information as additional context. Multimodal translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data.
This deep collaboration with universities will help eBay to find better solutions for handling challenging translation of polysemous words where listing titles are smaller texts, but a very critical information for eBay pages like search results.
Here are the summary of contents in Image Guided Translation Dataset:
Training:
Listing Images
: 54,378
Listing Titles (English & German)
: 54,378
Test:
Listing Images
: 6,696
Listing Titles (English & German)
: 6,696
Validation:
Listing Images
: 6,652
Listing Titles (English & German)
: 6,652
Data format in the TSV mapping file having the listing titles and corresponding image mappings:
Label | Description |
---|---|
project_name | Name of the project |
set_name | train / test / val to specify whether this data is used |
image_id | Listing image id |
image_file | Listing image file name in the image data files |
source | Seller provided listing title in english language |
target | Translated listing title in german language |
Please first install Git Large File Storage by following the below instructions. This step has to be done before the large image data files can be downloaded from the repo.
https://help.github.com/articles/installing-git-large-file-storage/
Then the repo can be cloned using git clone.
git clone git@github.com:eBay/ImageGuidedTranslationDataset.git
Please use our issues page to ask questions, report issues or submit feature requests.
The data is licensed under the Creative Commons Attribution-NonCommercial license 4.0.