Icon dataset (for training and evaluating icon classification):

We provide a dataset of 250K icon images downloaded from Google images to cover 391 different tag classes.

validation set (4.5 GB uncompressed)
training set, split into 4 parts (each around 4.5-5.5 GB uncompressed, 21 GB total)

Note: these are the original downloaded image sets, uncurated (warning: data may be noisy)

Parsed text (OCR from within infographics):

Computed using Google's Cloud Vision API for OCR: https://cloud.google.com/vision/

raw_ocr_output.pickle (2.2 GB) contains all the extracted text along with the bounding boxes of individual words
- contains a dictionary that maps infographic filenames to the extracted text
- the extracted text is a list, where the first element is the full text extraction (with coordinates)
- subsequent elements are individual words and their bounding box coordinates e.g., ('Road', ['(11,26)', '(55,26)', '(55,47)', '(11,47)'])
google_text_extraction_output.pckl (220 MB) contains just a list of the extracted words per infographic
- contains a dictionary that maps infographic filenames to a list of individual extracted words

See plot_text_detections.ipynb for examples of how to use these files.