We provide a dataset of 250K icon images downloaded from Google images to cover 391 different tag classes.
- validation set (4.5 GB uncompressed)
- training set, split into 4 parts (each around 4.5-5.5 GB uncompressed, 21 GB total)
Note: these are the original downloaded image sets, uncurated (warning: data may be noisy)
Computed using Google's Cloud Vision API for OCR: https://cloud.google.com/vision/
- raw_ocr_output.pickle (2.2 GB) contains all the extracted text along with the bounding boxes of individual words
- contains a dictionary that maps infographic filenames to the extracted text
- the extracted text is a list, where the first element is the full text extraction (with coordinates)
- subsequent elements are individual words and their bounding box coordinates e.g., ('Road', ['(11,26)', '(55,26)', '(55,47)', '(11,47)'])
- google_text_extraction_output.pckl (220 MB) contains just a list of the extracted words per infographic
- contains a dictionary that maps infographic filenames to a list of individual extracted words
See plot_text_detections.ipynb for examples of how to use these files.