Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Tesseract 5 Training: Multiple Lines or Drop Caps Handling #404

Open
4F2E4A2E opened this issue Nov 12, 2024 · 0 comments
Open

Comments

@4F2E4A2E
Copy link

4F2E4A2E commented Nov 12, 2024

How can Tesseract recognize drop caps?

I am trying to train Tesseract to recognize drop caps in paragraphs. However, Tesseract v5 does not support multiline training. How can I achieve this?

Drop caps examples:
https://support.microsoft.com/en-us/office/insert-a-drop-cap-817fd19f-40fe-4b73-95e8-f3c0f5e01278
image

drop caps data-set examples:
drop_caps_data_set_example.zip

tesseract --version 
tesseract 5.5.0-1-g43b8d
 leptonica-1.85.1
  libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0
 Found NEON
 Found OpenMP 201511
 Found libcurl/7.74.0 OpenSSL/1.1.1w zlib/1.2.11 brotli/1.0.9 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.3.0) libssh2/1.9.0 nghttp2/1.43.0 librtmp/2.3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant