Skip to content

This repo is a simple example how to use TesseractOCR

Notifications You must be signed in to change notification settings

ammarali32/SimpleOCR

Repository files navigation

SimpleOCR

This repo is a simple example how to use TesseractOCR to extract text from image docs. It provides some preprocessing functionalites for images: (Denoising, Removing lines and Fixing rotation).
And Postprocessing for the text includes some basic text cleaning functions.

Installation

Tesseract and leptonica

Make sure that you have installed Tesseract on your device.

Linux Installation

apt-get install libleptonica-dev

apt-get install tesseract-ocr libtesseract-dev

Poetry

pip install poetry


For another installation options see

Project Installation

git clone https://github.com/ammarali32/SimpleOCR.git

cd SimpleOCR poetry install

Testing

To run the package you have a command line interface:

poetry run python run.py --input="./coding_test/samples/oldpaper.jpg" --output="./output.txt" --verbose

It also works in interactive mode.

Installation Tutorial

This is a full tutorial for installation and testing on colab. Open In Colab

Documentation

File Class Method Input Output Comments
run.py - run command line txt file command-line interface
io_txt.py - read_file file-path Images as np.ndarray input reader
io_txt.py - write_file file_path txt file output writer
denoise_photos_nn.py denoisingModel Constructer - - Model trained on data from Kaggle
denoise_photos_nn.py denoisingModel forward image as np.ndarray image as np.ndarray -
denoise_photos_nn.py denoisingModel load_weights weights-path - -
config.py CFG - - - Some parameters like weights-path and others
config.py LOG - - - Logger parameters and setting
preprocess.py PreProcessor Constructer CFG - -
preprocess.py PreProcessor fix_rotation image as np.ndarray image as np.ndarray in case the image is rotated a little
preprocess.py PreProcessor denoiseAndBinarize image as np.ndarray image as np.ndarray call the denoising model
preprocess.py PreProcessor removeLines image as np.ndarray image as np.ndarray In case the image include lines
preprocess.py PreProcessor preprocess image as np.ndarray image as np.ndarray call all preprocessor functions
text_recognition.py textRecognition constructor language str - default is English
text_recognition.py textRecognition get_text image as np.ndarray string text uses psm 1 for automatic page segmentation with OSD
postprocess.py PostProcessor constructor CFG - -
postprocess.py PostProcessor removeEmptyLines string text string text -
postprocess.py PostProcessor cleanText string text string text remove undesirable chars "not included in CFG.chars"
postprocess.py PostProcessor spellingCheck string text string text Not used but provided to use please uncomment
postprocess.py PostProcessor postprocess string text string text call all postprocessor functions

Visualization and Having Fun

Open In Colab

References:

About

This repo is a simple example how to use TesseractOCR

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages