You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Just want to share how I managed to run tesseract training with tesstrain on version 5. It might help other and I hope can be used to improve documentation.
This was my first try on tesseract training, I neved did it before.
That document recommended (https://github.com/tesseract-ocr/tesstrain#provide-ground-truth) trying the train with the ocrd-testset.zip files. I unziped the contents in a folder named 'data/foo-ground-truth/'. The folder named 'data' was created by me to put the files when running make tesseract-langdata as stated in the document.
So I run make training and the result was a lot of error messages:
Can't encode transcription: '<some random german phrase>' in language ''
Encoding of string failed! Failure bytes: <some hexa codes>
Side note: I needed to run it twice, looks like the first command crashes when building the all-gt file.
It was clearly something related to unicharset that has not described the special characters that exists in the samples ground truth.
After studying a while, I decided by my own to replace the unicharset file in data/foo/ with the contents of data/langdata/Latin.unicharset
that completely solved the error messages and training finally started.
After some minutes, the BCER train that started at 89% went to 99,9%. Something was clearly wrong again.
I was digging in the web and had a hunch that the issue was related that I haven't specified the starter traineddata, so the training was running from "scratch", don't know.
I then specified the START_MODEL and the result was much better. The BCER started below 20% and continued to improve.
When specifying the starter model, the training process extracts the unicharset from the model and put it in the data/eng folder. I was expecting that eng.traineddata would be using Latin.unicharset, but that seems to not be the case (perhaps the ger.traineddata?), so copying the unicharset is still necessary. For my application I will be using the eng.traineddata, so I decided to continue on english traineddata instead of using the germany traineddata (which I haven't tried).
To have a cleaner run, I decided to run the training in steps. Those were:
# let's start by cleaning the environment
make -r TESSDATA=/usr/local/share/tessdata/ MODEL_NAME=foo START_MODEL=eng clean
make -r TESSDATA=/usr/local/share/tessdata/ MODEL_NAME=foo START_MODEL=eng unicharset
# error expected (creating the foo/all-gt file)
make -r TESSDATA=/usr/local/share/tessdata/ MODEL_NAME=foo START_MODEL=eng unicharset
cp data/langdata/Latin.unicharset data/eng/foo.lstm-unicharset
make -r TESSDATA=/usr/local/share/tessdata/ MODEL_NAME=foo START_MODEL=eng training
I hope this can support Tesseract comunity and any contribution is welcome.
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
stalebot
added
the
stale
Issues which require input by the reporter which is not provided
label
May 22, 2023
Hi there.
Just want to share how I managed to run tesseract training with tesstrain on version 5. It might help other and I hope can be used to improve documentation.
This was my first try on tesseract training, I neved did it before.
I cloned tesseract from git on tag 5.3 and was able to make it exactly as documented here: https://github.com/tesseract-ocr/tessdoc/blob/main/Compiling-–-GitInstallation.md
I performed the installation on Ubuntu running on WSL.
I cloned the latest git for tesstrain and followed this page:
https://github.com/tesseract-ocr/tesstrain
That document recommended (https://github.com/tesseract-ocr/tesstrain#provide-ground-truth) trying the train with the ocrd-testset.zip files. I unziped the contents in a folder named 'data/foo-ground-truth/'. The folder named 'data' was created by me to put the files when running
make tesseract-langdata
as stated in the document.So I run
make training
and the result was a lot of error messages:Side note: I needed to run it twice, looks like the first command crashes when building the all-gt file.
It was clearly something related to unicharset that has not described the special characters that exists in the samples ground truth.
After studying a while, I decided by my own to replace the unicharset file in data/foo/ with the contents of data/langdata/Latin.unicharset
cp data/langdata/Latin.unicharset data/foo/unicharset
that completely solved the error messages and training finally started.
After some minutes, the BCER train that started at 89% went to 99,9%. Something was clearly wrong again.
I was digging in the web and had a hunch that the issue was related that I haven't specified the starter traineddata, so the training was running from "scratch", don't know.
I then specified the START_MODEL and the result was much better. The BCER started below 20% and continued to improve.
When specifying the starter model, the training process extracts the unicharset from the model and put it in the data/eng folder. I was expecting that eng.traineddata would be using Latin.unicharset, but that seems to not be the case (perhaps the ger.traineddata?), so copying the unicharset is still necessary. For my application I will be using the eng.traineddata, so I decided to continue on english traineddata instead of using the germany traineddata (which I haven't tried).
To have a cleaner run, I decided to run the training in steps. Those were:
I hope this can support Tesseract comunity and any contribution is welcome.
The text was updated successfully, but these errors were encountered: