Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: please implement Tesseract OCR's LSTM mode to dramatically reduce OCR error rates! #465

Open
HelpfulCarrot opened this issue Oct 3, 2024 · 8 comments

Comments

@HelpfulCarrot
Copy link

Is your feature request related to a problem? Please describe.
NAPS2 is great - it's very useful to me and many others, and a big part of the utility it offers is the integrated OCR function, since that makes the content of all my documents easily searchable, as well as more accessible to people with vision-related disabilities. However, the OCR error rate is signficant, dramatically reducing its utility.

Describe the solution you'd like
I'd like NAPS2 to implement Tesseract's LSTM mode, which works using neural networks, since it reduces the error rate from about 5-10 words with errors per page to nearly zero. Since NAPS2 already uses Tesseract as its OCR provider, it makes sense to enable the more modern, superior mode available in Tesseract.

I have tested Tesseract 4, comparing its LSTM mode to its legacy mode, and the LSTM mode (which works using machine learning) is far more accurate than the legacy mode, which NAPS2 uses.

Ideally, the dropdown menu which already allows users to choose between 'fast' and 'best' modes would be changed to provide three options: 'fastest (lowest accuracy)', 'fast (high accuracy)', and 'slow (very high accuracy)'. The specific wording is just a suggestion, what's most important is that users are granted access to the best available FOSS OCR solution.

Describe alternatives you've considered
The neural network based LSTM mode of Tesseract is the SOTA FOSS OCR model, and NAPS2 has already implemented Tesseract, so it doesn't make sense to use any alternative solutionl

Additional context
Here is a screenshot showing the difference in accuracy between Tesseract's LSTM and legacy modes:
Tesseract LSTM vs Tesseract Legacy

Lastly, I just want to say thank you to everyone who works so hard to make NAPS2 as great as it is! 😊

@cyanfish
Copy link
Owner

cyanfish commented Oct 3, 2024

NAPS2 already uses LSTM.

@HelpfulCarrot
Copy link
Author

That's great, but what's strange is that OCR doesn't work as well for me in NAPS2 as Tesseract LSTM does elsewhere, which is why I thought LSTM wasn't being used - is there any way to verify that it's enabled correctly and in use in my installation of NAPS2?

@cyanfish
Copy link
Owner

cyanfish commented Oct 3, 2024

NAPS2 doesn't even include the legacy engine so it's impossible for it be used. When there are differences with command-line Tesseract it's usually related to word alignment in the PDF (as opposed to word recognition which is the same)

Can you attach a sample PDF with issues that aren't present in standalone Tesseract?

@HelpfulCarrot
Copy link
Author

Thanks for explaining! Unfortunately the PDFs I work with are sensitive, so I can't share them.

Perhaps the bigger issue than word recognition is document layout analysis/segmentation. I'm not as familiar with the state-of-the-art in FOSS DLA as I am with OCR, but do you think it might be helpful to implement an advanced document layout analysis model like LayoutLM v3? Just to name one example of a high-performing DLA model with a relatively permissive license.

@cyanfish
Copy link
Owner

cyanfish commented Oct 3, 2024

There's an existing issue (#258) for segmentation improvements. I'm not sure how that would work in comparison to base Tesseract functionality, but there are practical issues - for LayoutLM the license is incompatible with GPL, plus it would require bundling Python etc.

@HelpfulCarrot
Copy link
Author

HelpfulCarrot commented Oct 3, 2024

I suspected there was probably a license conflict between any CC-NC license and GPL. Hypothetically, if I could find a document segmentation model which significantly outperforms the current segmentation model used by NAPS2 and is compatible with GPL, would you be interested in that?

A lot of ‘AI’ talk right now is impractical hype, but in recent years there have been some huge leaps forward in segmentation accuracy using machine learning, which is why I think it is worth pursuing this, if there’s any interest from your end.

@cyanfish
Copy link
Owner

cyanfish commented Oct 3, 2024

In theory yes, but in practice I don't want to blow up the installer size for this. Most ML stuff isn't super lightweight.

@HelpfulCarrot
Copy link
Author

I understand. I’ll look into it and get back to you with my findings. Thanks for taking the time to talk this over with me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants