Setup Guide

We recommend you to use a virtual environment for the installation. We used virtualenv with the following commands.

python -m virtualenv ditenv
source ditenv/bin/activate

STEP 1 - OpenCV (pip)

pip install opencv-contrib-python
# or
pip install opencv-python

STEP 2 - PyTorch

We used PyTorch v1.10.2 and torchvision v0.11.3 to build the pipeline but PyTorch v1.9.0 and torchvision v0.10.0 were used in the unilm DiT repository by Microsoft. PyTorch versions higher than 1.9.0 should work.

The install command below was used on the authors machine. Depending on your CUDA version, OS, and package manager yours will probably be different. You can get the install command based on your preferences from the PyTorch get started guide

pip install torch==1.10.2+cu113 torchvision==0.11.3+cu113 torchaudio==0.10.2+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

Or if you don't have a CUDA enabled GPU you can install the "CPU version" with PyTorch v1.11.0 and torchvision v0.12.0 which we also tested.

pip3 install torch==1.11.0+cpu torchvision==0.12.0+cpu torchaudio==0.11.0+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html

STEP 3 - Detectron2

The install command below should work to install the Detectron AI toolkit by Facebook. If you experience any issues you can use the official installation guide. Note that the library is hard to install on a Windows machine. For that reason we strongly recommend using Linux or MacOS.

pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.5#egg=detectron2"

STEP 4 - PyTesseract

A Tesseract OCR Python wrapper is used for text recognition. For more information about the installation you can refer to the pytesseract GitHub repo.

But first of all you should install the Tesseract engine. For Linux users the command below should suffice, others might want to look at the official documentation.

# or install package with language pack (e.g., tesseract-ocr-eng or tesseract-ocr-deu) (ISO 639-3 format)
sudo apt install tesseract-ocr

The Python wrapper can simply be installed with pip using the following command.

pip install pytesseract

Next, you need to download additional language packs. You can download the language packs from either the tessdata or tessdata_fast repository. Keep in mind that you have to make a speed/accuracy compromise when using the fast packs.

You can either clone the whole repository or download a single pack. During development the (format=language:abbreviation:packname) English='eng'=eng.traineddata, French='fra'=fra.traineddata, and German='deu'=deu.traineddata lanaguage packs were used.

Put the language packs in a directory called tessdata and set the TESSDATA_PREFIX environment variable like we do below.

export TESSDATA_PREFIX=/home/user/tessdata

STEP 6 - Layout Parser

For more information about this DIA toolkit you can refer to the Layout Parser GitHub repository. The two install commands below should suffice if you just want to test the pipeline.

pip install layoutparser

STEP 7 - Python Libraries

Install the required Python libraries from the root of the repository. Don't forget to source into your virtual environment if applicable. You can also manually install them by opening the requirements.txt file.

pip install -r requirements.txt

STEP 8 - Pre-trained Models

After your environment is set up, you should download the pre-trained model weights from this link (1.4GB) and place it in the './resources/weights/' directory. You can either do this manually or use the commands below.

# from root directory
wget --directory-prefix=resources/weights https://layoutlm.blob.core.windows.net/dit/dit-fts/publaynet_dit-l_cascade.pth

STEP 8.1 - (optional) Table Extraction

A different model is used for table structure recognition. If you want table extraction, another pre-trained model's weights have to be downloaded from here. Also put this file in the './resources/weights' directory. For this you can also use the commands below.

# from root directory
wget --directory-prefix=resources/weights https://pubtables1m.blob.core.windows.net/model/pubtables1m_detection_detr_r18.pth

Check out the examples for a guide on how to enable table extraction in the pipeline.

@Developers -> looking to add table extraction to your own document image analysis pipeline? You can find the original repository right here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

setup_guide.md

setup_guide.md

Setup Guide

STEP 1 - OpenCV (pip)

STEP 2 - PyTorch

STEP 3 - Detectron2

STEP 4 - PyTesseract

STEP 6 - Layout Parser

STEP 7 - Python Libraries

STEP 8 - Pre-trained Models

STEP 8.1 - (optional) Table Extraction

Files

setup_guide.md

Latest commit

History

setup_guide.md

File metadata and controls

Setup Guide

STEP 1 - OpenCV (pip)

STEP 2 - PyTorch

STEP 3 - Detectron2

STEP 4 - PyTesseract

STEP 6 - Layout Parser

STEP 7 - Python Libraries

STEP 8 - Pre-trained Models

STEP 8.1 - (optional) Table Extraction