- Python 3.9 or 3.10
- Poetry
- Chrome and the corresponding webdriver
- Stable diffusion for face generation, see stable_diffusion
This repository exposes functions to generate documents using templates and generators, contained in docxpand/templates:
- Templates are SVG files, containing information about the appearence of the documents to generate, i.e. their backgrounds, the fields contained in the document, the positions of these fields etc.
- Generators are JSON files, containing information on how to generate the fields content.
This repository allows to :
- Generate documents for known templates (id_card_td1_a, id_card_td1_b, id_card_td2_a, id_card_td2_b, pp_td3_a, pp_td3_b, pp_td3_c, rp_card_td1 and rp_card_td2 ), by filling the templates with random fake information.
- These templates are inspired from European ID cards, passports and residence permits. Their format follow the ISO/IEC 7810 , and they contains machine-readable zone (MRZ) that follow the Machine Readable Travel Documents Specifications.
- To generate documents, use the generate_fake_structured_documents.py script, that takes as input the name of one of the templates, the number of fake documents to generate, an output directory, an url to request that can serve generated photos of human faces using stable diffusion.
- Integrate generated document in some scenes, to replace other documents originally present in the scenes.
- It implies you have some dataset of background scenes usable for this task, with coordinates of original documents to replace by generated fake documents.
- To integrate documents, use the insert_generated_documents_in_scenes.py script, that takes as input the directory containing the generated document images, a JSON dataset containing information obout those document images (generated by above script), the directory containing "scene" (background) images, a JSON dataset containing localization information, an output directory to store the final images and a chrome webdriver corresponding to the installed version of your installed chrome browser. The background scene images must contain images that are present in the docxpand/specimens directory. See the SOURCES.md file for more information.
- All JSON datasets must follow the
DocFakerDataset
format, defined in docxpand/dataset.py.
Run
poetry install
To generate the OCR dataset we need puppeteer
which is a node package.
puppeteer
will render the SVGs in a browser which then allows us to extract text coordinates from the HTML.
Install node
and npm
Follow instructions at Node website
Then install node packages
npm i
To generate SVGs and render PNGs of synthetic documents, run:
poetry run python scripts/dataset/generate_fake_structured_documents.py -n <number_to_generate> -o <output_directory> -t <template.json_to_use> -s <path_to_stable_diffusion_web_api>
To insert document in target images, run:
poetry run python scripts/insert_generated_documents_in_scenes.py -di <document_images_directory> -dd <documents_dataset> -si <scene_images_directory> -sd <scenes_dataset> -o <output_directory>
Delete field from other side
When the synthetic documents are generated the JSON for each side contains fields from the other side so that when for example we want to extract the field locations they repeat. To avoid the repetition of fields post-extraction we run:
poetry run python scripts/dataset/delete_other_side_fields.py -dd <input_json> -o <output_directory>
Extract field locations
poetry run python scripts/dataset/extract_field_locations_from_svgs.py -dd <input_json> -di <input_directory> -o <output_directory>
Extract OCR from SVG In order to generate an OCR dataset from the
poetry run python scripts/dataset/extract_ocr_from_svgs.py -dd <input_json> -di <input_directory> -o <output_directory>
The synthetic ID document images dataset ("DocXPand-25k"), released alongside this tool, is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
You can download the dataset from this release. It's split into 12 parts (DocXPand-25k.tar.gz.xx, from 00 to 11). Once you've downloaded all 12 binary files, you can extract the content using the following command : cat DocXPand-25k.tar.gz.* | tar xzvf -
.
The labels are stored in a JSON format, which is readable using the DocFakerDataset class. The document images are stored in the images/
folder, which contains one sub-folder per-class. The original image fields (identity photos, ghost images, barcodes, datamatrices) integrated in the documents are stored in the fields/
sub-folder.