-
-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: please implement Tesseract OCR's LSTM mode to dramatically reduce OCR error rates! #465
Comments
NAPS2 already uses LSTM. |
That's great, but what's strange is that OCR doesn't work as well for me in NAPS2 as Tesseract LSTM does elsewhere, which is why I thought LSTM wasn't being used - is there any way to verify that it's enabled correctly and in use in my installation of NAPS2? |
NAPS2 doesn't even include the legacy engine so it's impossible for it be used. When there are differences with command-line Tesseract it's usually related to word alignment in the PDF (as opposed to word recognition which is the same) Can you attach a sample PDF with issues that aren't present in standalone Tesseract? |
Thanks for explaining! Unfortunately the PDFs I work with are sensitive, so I can't share them. Perhaps the bigger issue than word recognition is document layout analysis/segmentation. I'm not as familiar with the state-of-the-art in FOSS DLA as I am with OCR, but do you think it might be helpful to implement an advanced document layout analysis model like LayoutLM v3? Just to name one example of a high-performing DLA model with a relatively permissive license. |
There's an existing issue (#258) for segmentation improvements. I'm not sure how that would work in comparison to base Tesseract functionality, but there are practical issues - for LayoutLM the license is incompatible with GPL, plus it would require bundling Python etc. |
I suspected there was probably a license conflict between any CC-NC license and GPL. Hypothetically, if I could find a document segmentation model which significantly outperforms the current segmentation model used by NAPS2 and is compatible with GPL, would you be interested in that? A lot of ‘AI’ talk right now is impractical hype, but in recent years there have been some huge leaps forward in segmentation accuracy using machine learning, which is why I think it is worth pursuing this, if there’s any interest from your end. |
In theory yes, but in practice I don't want to blow up the installer size for this. Most ML stuff isn't super lightweight. |
I understand. I’ll look into it and get back to you with my findings. Thanks for taking the time to talk this over with me! |
Is your feature request related to a problem? Please describe.
NAPS2 is great - it's very useful to me and many others, and a big part of the utility it offers is the integrated OCR function, since that makes the content of all my documents easily searchable, as well as more accessible to people with vision-related disabilities. However, the OCR error rate is signficant, dramatically reducing its utility.
Describe the solution you'd like
I'd like NAPS2 to implement Tesseract's LSTM mode, which works using neural networks, since it reduces the error rate from about 5-10 words with errors per page to nearly zero. Since NAPS2 already uses Tesseract as its OCR provider, it makes sense to enable the more modern, superior mode available in Tesseract.
I have tested Tesseract 4, comparing its LSTM mode to its legacy mode, and the LSTM mode (which works using machine learning) is far more accurate than the legacy mode, which NAPS2 uses.
Ideally, the dropdown menu which already allows users to choose between 'fast' and 'best' modes would be changed to provide three options: 'fastest (lowest accuracy)', 'fast (high accuracy)', and 'slow (very high accuracy)'. The specific wording is just a suggestion, what's most important is that users are granted access to the best available FOSS OCR solution.
Describe alternatives you've considered
The neural network based LSTM mode of Tesseract is the SOTA FOSS OCR model, and NAPS2 has already implemented Tesseract, so it doesn't make sense to use any alternative solutionl
Additional context
Here is a screenshot showing the difference in accuracy between Tesseract's LSTM and legacy modes:
Lastly, I just want to say thank you to everyone who works so hard to make NAPS2 as great as it is! 😊
The text was updated successfully, but these errors were encountered: