DocXPand tool

Requirements

Python 3.9 or 3.10
Poetry
Chrome and the corresponding webdriver
Stable diffusion for face generation, see stable_diffusion

Functionalities

This repository exposes functions to generate documents using templates and generators, contained in docxpand/templates:

Templates are SVG files, containing information about the appearence of the documents to generate, i.e. their backgrounds, the fields contained in the document, the positions of these fields etc.
Generators are JSON files, containing information on how to generate the fields content.

This repository allows to :

Generate documents for known templates (id_card_td1_a, id_card_td1_b, id_card_td2_a, id_card_td2_b, pp_td3_a, pp_td3_b, pp_td3_c, rp_card_td1 and rp_card_td2 ), by filling the templates with random fake information.
- These templates are inspired from European ID cards, passports and residence permits. Their format follow the ISO/IEC 7810 , and they contains machine-readable zone (MRZ) that follow the Machine Readable Travel Documents Specifications.
- To generate documents, use the generate_fake_structured_documents.py script, that takes as input the name of one of the templates, the number of fake documents to generate, an output directory, an url to request that can serve generated photos of human faces using stable diffusion.
Integrate generated document in some scenes, to replace other documents originally present in the scenes.
- It implies you have some dataset of background scenes usable for this task, with coordinates of original documents to replace by generated fake documents.
- To integrate documents, use the insert_generated_documents_in_scenes.py script, that takes as input the directory containing the generated document images, a JSON dataset containing information obout those document images (generated by above script), the directory containing "scene" (background) images, a JSON dataset containing localization information, an output directory to store the final images and a chrome webdriver corresponding to the installed version of your installed chrome browser. The background scene images must contain images that are present in the docxpand/specimens directory. See the SOURCES.md file for more information.
- All JSON datasets must follow the DocFakerDataset format, defined in docxpand/dataset.py.

Installation

Run

poetry install

To generate the OCR dataset we need puppeteer which is a node package. puppeteer will render the SVGs in a browser which then allows us to extract text coordinates from the HTML.

Install node and npm Follow instructions at Node website

Then install node packages

npm i

Usage

To generate SVGs and render PNGs of synthetic documents, run:

poetry run python scripts/dataset/generate_fake_structured_documents.py -n <number_to_generate> -o <output_directory> -t <template.json_to_use> -s <path_to_stable_diffusion_web_api>

To insert document in target images, run:

poetry run python scripts/insert_generated_documents_in_scenes.py -di <document_images_directory> -dd <documents_dataset> -si <scene_images_directory> -sd <scenes_dataset> -o <output_directory>

Delete field from other side

When the synthetic documents are generated the JSON for each side contains fields from the other side so that when for example we want to extract the field locations they repeat. To avoid the repetition of fields post-extraction we run:

poetry run python scripts/dataset/delete_other_side_fields.py -dd <input_json> -o <output_directory>

Extract field locations

poetry run python scripts/dataset/extract_field_locations_from_svgs.py -dd <input_json> -di <input_directory> -o <output_directory>

Extract OCR from SVG In order to generate an OCR dataset from the

poetry run python scripts/dataset/extract_ocr_from_svgs.py -dd <input_json> -di <input_directory> -o <output_directory>

DocXPand-25k dataset

The synthetic ID document images dataset ("DocXPand-25k"), released alongside this tool, is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

You can download the dataset from this release. It's split into 12 parts (DocXPand-25k.tar.gz.xx, from 00 to 11). Once you've downloaded all 12 binary files, you can extract the content using the following command : cat DocXPand-25k.tar.gz.* | tar xzvf -. The labels are stored in a JSON format, which is readable using the DocFakerDataset class. The document images are stored in the images/ folder, which contains one sub-folder per-class. The original image fields (identity photos, ghost images, barcodes, datamatrices) integrated in the documents are stored in the fields/ sub-folder.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.vscode		.vscode
docxpand		docxpand
scripts		scripts
stable_diffusion		stable_diffusion
.gitignore		.gitignore
DATASET_LICENSE		DATASET_LICENSE
DISCLAIMER		DISCLAIMER
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
fix_svg.ipynb		fix_svg.ipynb
package-lock.json		package-lock.json
package.json		package.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
visualize_extracted_ocr.ipynb		visualize_extracted_ocr.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocXPand tool

Requirements

Functionalities

Installation

Usage

DocXPand-25k dataset

About

Releases

Packages

Languages

License

milosacimovic/docxpand

Folders and files

Latest commit

History

Repository files navigation

DocXPand tool

Requirements

Functionalities

Installation

Usage

DocXPand-25k dataset

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages