Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to do incremental-training on tesseract-ocr? #391

Open
zaryabRiasat opened this issue May 31, 2024 · 5 comments
Open

How to do incremental-training on tesseract-ocr? #391

zaryabRiasat opened this issue May 31, 2024 · 5 comments

Comments

@zaryabRiasat
Copy link

I'm working with tesseract-4.1.1 and trying to do training (fine-tuning) for this I have followed steps:

  1. Downloaded eng.traineddata from tessdata_best and pasted it into /usr/share/tesseract-ocr/4.00/tessdata.

  2. Then I've created image-crops using craft-text-detector in python and made ground-truths (.gt.txt) for each image crop.

  3. Then cloned git clone https://github.com/tesseract-ocr/ocrd-train.git and then cd ocrd-train.

  4. Inside ocrd-train/data folder, I've created my-model-ground-truth folder and pasted .png and .gt.txt files in it.

  5. Then I ran command make tesseract-langdata on terminal.

  6. At last I ran command make training MODEL_NAME=my-model MAX_ITERATIONS=20000 PSM=7 FINETUNE_TYPE=Impact DEBUG_INTERVAL=-1 START_MODEL=eng TESSDATA=/usr/share/tesseract-ocr/4.00/tessdata/

Above procedure took some time, and I got my-model.traineddata file in ocrd-train/data/. I've pasted that file in /usr/share/tesseract-ocr/4.00/tessdata and it is giving results better than eng.traineddata.

For above training I used 20 images, now I want to do incremental-training. I want to train 30 more images on previously trained my-model.traineddata. Here I'm confused because after completion of previous training there are some folder in ocrd-train/data/:

  1. my-model (folder)

  2. my-model-ground-truth (folder)

  3. eng (folder)

  4. langdata (folder)

  5. my-model.traineddata (file)

Now what should I do for incremental-training?

Do I only need to remove files in my-model-ground-truth and paste new .png and .gt.txt files of 30 images, and use my-model as START_MODEL?

Or I need to remove other folders as well?

@stweil
Copy link
Collaborator

stweil commented May 31, 2024

Are you using very old instructions (old Tesseract release, old repository URL, ...)?

@zaryabRiasat
Copy link
Author

zaryabRiasat commented May 31, 2024

@stweil Thank You for your response.

Yes I'm using tesseract-4.1.1, Old Repository.

First time training is working fine with START_MODEL=eng, but I am unable to do incremental training as mentioned in above details.

@zaryabRiasat
Copy link
Author

zaryabRiasat commented May 31, 2024

@stweil I just want to know, how can I do incremental-training on my existing trained model?

What steps I should follow?

@zdenop
Copy link
Contributor

zdenop commented May 31, 2024

What about reading Tesseract documentation and Readme of this repository?

@stweil
Copy link
Collaborator

stweil commented May 31, 2024

@zaryabRiasat, the first step is using a recent software release instead of an old one and also reading the current documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants