DoGe is designed to synthesize a dataset of realistic document scans. Each document contains meaningful text, headings, tables, paragraphs with different formatting and fonts which are parsed from Wikipedia. The coordinates of the words are extracted using the No-OCR method we invented for faster generation on CPU.
Check the full size (1024x1024) in resources folder.
You can use Docker image with predefined environment to run DoGe:
git clone https://github.com/Travvy88/DoGe
cd DoGe
docker build -t doge . && docker run -it doge
For faster generation, it is recommended to install all dependencies without Docker. Doge is tested on Ubuntu 22.04.
sudo apt-get update && apt-get install libreoffice libjpeg-dev zlib1g-dev poppler-utils
/usr/bin/python3 -m pip install --user unoserver # install unoserver to system python
git clone https://github.com/Travvy88/DoGe
cd DoGe
pip3 install -r requirements.txt # there you can make venv if needed!
python3 main.py --out_dir data --image_size 244 --max_urls 16 --num_processes 2 --ports 4000 4001 4002 4003 --languages en ru
The following arguments can be passed to the script:
--out_dir
: The output directory for saving results. This argument is required.--remove_existing_dir
: If set toTrue
, the output directory will be deleted before creating a new one. Default isFalse
.--image_size
: The size of the final images. Default is244
.--start_page
: The starting page URL. Default is the Wikipedia main page.--languages
: Permitted languages. Other languages will be ignored. Default is['en']
.--max_urls
: The maximum number of URLs to process. Default is100
.--num_processes
: The number of processes to use. Default is1
. Each process will start DocumentGenerator inside and start separate Unoserver.--max_threads
: The maximum threads inside a process. Default is3
.--ports
: The list of ports to use. Default is[8145, 8146]
. The number of ports should be 2 times larger thannum_processes
.
Parameter | Description |
---|---|
max_words |
The maximum number of words allowed in the generated documents. |
p_2columns |
The probability that the document will be formatted into two columns. |
font_size_interval |
The font size range from which the size is randomly selected for each document. |
p_line_spacing |
A list of probabilities controlling the line spacing of the document (1.5 or double). |
p_text_alignment |
A list of probabilities controlling the text alignment of the document (center, left, right, justify). |
p_heading_bold |
The probability that headings will be displayed in bold font. |
heading_relative_size_interval |
The range of relative font sizes for headings. The relative font size is chosen randomly. |
p_heading_alignment |
A list of probabilities controlling the alignment of headings (center, left, right, justify). |
table_max_rows |
The maximum number of rows allowed in a table. Tables with more than the specified number of rows are dropped. |
table_max_cols |
The maximum number of columns allowed in a table. Tables with more than the specified number of columns are dropped. |
Parameters with probabilities and intervals calculate its values for each document randomly.
According to my experience, generator produces an average about 14 images for each url with the above Docx settings.
Augmentation pipeline applies on a final stage. You can manage different augmentations
in src/augmentations.py
file. Read the Augraphy Docs for detailed explanation.
Firstly, the Manager
class creates the DocumentGenerator
instances in separate processes. For
each DocumentGenerator
, a Unoserver instance is started.
Then, the UrlParser
generates a list of URLs by crawling the web, starting from a given start page
and following links on each page. It uses BeautifulSoup
to parse HTML content and extract links,
then checks each link's validity and language, adding it to the list if it meets certain conditions.
The process continues until a maximum number of URLs is reached, and the method returns the list of
generated URLs, excluding the starting URL.
When data generation begins, the list of URLs is divided into several chunks for each DocumentGenerator
.
Each DocumentGenerator
instance retrieves a Wikipedia HTML page by URL from its chunk.
Headers, paragraphs formatting, and tables are extracted and placed into a Docx document via the DocxDocument
class.
At this stage, some random parametrization is applied according to docx_config.json
.
For example, font size, text alignment, one or two columns, and other parameters
are chosen for each document randomly.
After that, each word in the Docx is filled with a unique color. As a result, a colored rectangle appears in place of each word. The image will be encoded with 24-bit color depth, so the maximum number of words per document is 16,777,216. The text of each word is saved to a hashmap of type color_code -> word.
The next step is Docx to image conversion. DoGe uses Unoserver to convert Docx to Pdf and pdf2image for image rendering.
Then, all rectangle coordinates are detected via OpenCV on converted images. The word for each bounding box is retrieved from the hashmap. DoGe saves annotations to JSON files in the following format:
{
"words": [
"Hello",
"World"
],
"bboxes": [
[0.1, 0.1, 0.03, 0.02],
[0.4, 0.3, 0.11, 0.02]
]
}
The bboxes are normalized and saved in XYWH format.
The final step is deleting all color fills from words in the Docx document, rendering images, applying Augraphy augmentations, and saving the augmented images to disk. That's it!
- Download and place images into documents
- Annotate headers, tables, paragraphs
- Different output formats
Here are some great open-source projects I benefit from:
- ISP RAS Dedoc Team for support and assistance.
- Augraphy for augmentation code of final images.
- Unoserver for method of converting Docx to Pdf.
- Pdf2image for module for rendering images from Pdf.
- Pillow-SIMD for faster processing of images.