Skip to content

Synthetic identity documents dataset

License

Notifications You must be signed in to change notification settings

milosacimovic/docxpand

 
 

Repository files navigation

DocXPand tool

Requirements

Functionalities

This repository exposes functions to generate documents using templates and generators, contained in docxpand/templates:

  • Templates are SVG files, containing information about the appearence of the documents to generate, i.e. their backgrounds, the fields contained in the document, the positions of these fields etc.
  • Generators are JSON files, containing information on how to generate the fields content.

This repository allows to :

  • Generate documents for known templates (id_card_td1_a, id_card_td1_b, id_card_td2_a, id_card_td2_b, pp_td3_a, pp_td3_b, pp_td3_c, rp_card_td1 and rp_card_td2 ), by filling the templates with random fake information.
  • Integrate generated document in some scenes, to replace other documents originally present in the scenes.
    • It implies you have some dataset of background scenes usable for this task, with coordinates of original documents to replace by generated fake documents.
    • To integrate documents, use the insert_generated_documents_in_scenes.py script, that takes as input the directory containing the generated document images, a JSON dataset containing information obout those document images (generated by above script), the directory containing "scene" (background) images, a JSON dataset containing localization information, an output directory to store the final images and a chrome webdriver corresponding to the installed version of your installed chrome browser. The background scene images must contain images that are present in the docxpand/specimens directory. See the SOURCES.md file for more information.
    • All JSON datasets must follow the DocFakerDataset format, defined in docxpand/dataset.py.

Installation

Run

poetry install

To generate the OCR dataset we need puppeteer which is a node package. puppeteer will render the SVGs in a browser which then allows us to extract text coordinates from the HTML.

Install node and npm Follow instructions at Node website

Then install node packages

npm i

Usage

To generate SVGs and render PNGs of synthetic documents, run:

poetry run python scripts/dataset/generate_fake_structured_documents.py -n <number_to_generate> -o <output_directory> -t <template.json_to_use> -s <path_to_stable_diffusion_web_api>

To insert document in target images, run:

poetry run python scripts/insert_generated_documents_in_scenes.py -di <document_images_directory> -dd <documents_dataset> -si <scene_images_directory> -sd <scenes_dataset> -o <output_directory>

Delete field from other side

When the synthetic documents are generated the JSON for each side contains fields from the other side so that when for example we want to extract the field locations they repeat. To avoid the repetition of fields post-extraction we run:

poetry run python scripts/dataset/delete_other_side_fields.py -dd <input_json> -o <output_directory>

Extract field locations

poetry run python scripts/dataset/extract_field_locations_from_svgs.py -dd <input_json> -di <input_directory> -o <output_directory>

Extract OCR from SVG In order to generate an OCR dataset from the

poetry run python scripts/dataset/extract_ocr_from_svgs.py -dd <input_json> -di <input_directory> -o <output_directory>

DocXPand-25k dataset

The synthetic ID document images dataset ("DocXPand-25k"), released alongside this tool, is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

You can download the dataset from this release. It's split into 12 parts (DocXPand-25k.tar.gz.xx, from 00 to 11). Once you've downloaded all 12 binary files, you can extract the content using the following command : cat DocXPand-25k.tar.gz.* | tar xzvf -. The labels are stored in a JSON format, which is readable using the DocFakerDataset class. The document images are stored in the images/ folder, which contains one sub-folder per-class. The original image fields (identity photos, ghost images, barcodes, datamatrices) integrated in the documents are stored in the fields/ sub-folder.

About

Synthetic identity documents dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.1%
  • JavaScript 1.6%
  • Jupyter Notebook 1.3%
  • Other 1.0%