Skip to content

Latest commit

 

History

History
17 lines (14 loc) · 1.67 KB

links_to_data_files.md

File metadata and controls

17 lines (14 loc) · 1.67 KB

Icon dataset (for training and evaluating icon classification):

We provide a dataset of 250K icon images downloaded from Google images to cover 391 different tag classes.

Note: these are the original downloaded image sets, uncurated (warning: data may be noisy)

Parsed text (OCR from within infographics):

Computed using Google's Cloud Vision API for OCR: https://cloud.google.com/vision/

  • raw_ocr_output.pickle (2.2 GB) contains all the extracted text along with the bounding boxes of individual words
    • contains a dictionary that maps infographic filenames to the extracted text
    • the extracted text is a list, where the first element is the full text extraction (with coordinates)
    • subsequent elements are individual words and their bounding box coordinates e.g., ('Road', ['(11,26)', '(55,26)', '(55,47)', '(11,47)'])
  • google_text_extraction_output.pckl (220 MB) contains just a list of the extracted words per infographic
    • contains a dictionary that maps infographic filenames to a list of individual extracted words

See plot_text_detections.ipynb for examples of how to use these files.