Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] [Pytesseract] [Urdu] [Segmentation fault] [Deserialize header failed] #354

Open
IrtazaIjaz opened this issue Oct 10, 2023 · 5 comments

Comments

@IrtazaIjaz
Copy link

IrtazaIjaz commented Oct 10, 2023

Hi All,

I'm having trouble executing the fine-tunning on this repository. Below is my code which I run on my Jupyter notebook:

**Step1:**
!git clone https://github.com/tesseract-ocr/tesstrain.git

Step-2:
%cd tesstrain
!make tesseract-langdata

**Step-3:**
import zipfile
with zipfile.ZipFile('/content/tesstrain/irt-ground-truth.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/tesstrain/data')

**Step-4:**
# Create the directory 'usr/share/tessdata'
!mkdir -p usr/share/tessdata

# Download the trained data file and save it to 'usr/share/tessdata'
!wget -P usr/share/tessdata https://github.com/tesseract-ocr/tessdata_best/raw/main/urd.traineddata

**Step-5:**
!pip install Pillow>=6.2.1
!pip install python-bidi>=0.4
!pip install matplotlib
!pip install pandas
!pip install pytesseract
!apt-get install tesseract-ocr-urd
!apt-get install tesseract-ocr
!make leptonica tesseract

Step-6:
I have replaced /content/tesstrain/data/irt/list.train folder with my file which contains below text:

/content/tesstrain/data/irt-ground-truth/page_10_line_1.png نقش فریادی ہے کس کی شوخیٔ تحریر کا
/content/tesstrain/data/irt-ground-truth/page_10_line_2.png کاغذی ہے پیرہن ہر پیکر تصویر کا
/content/tesstrain/data/irt-ground-truth/page_10_line_3.png کاو کاو سخت جانی ہائے تنہائی نہ پوچھ
/content/tesstrain/data/irt-ground-truth/page_10_line_4.png صبح کرنا شام کا لانا ہے جوئے شیر کا
/content/tesstrain/data/irt-ground-truth/page_10_line_5.png جذبۂ بے اختیار شوق دیکھا چاہیے
/content/tesstrain/data/irt-ground-truth/page_10_line_6.png سینۂ شمشیر سے باہر ہے دم شمشیر کا
/content/tesstrain/data/irt-ground-truth/page_10_line_7.png آگہی دام شنیدن جس قدر چاہے بچھائے
/content/tesstrain/data/irt-ground-truth/page_10_line_8.png مدعا عنقا ہے اپنے عالم تقریر کا
/content/tesstrain/data/irt-ground-truth/page_10_line_9.png نبسکہ ہوں غالبؔ اسیری میں بھی آتش زیر پا
/content/tesstrain/data/irt-ground-truth/page_10_line_10.png موئے آتش دیدہ ہے حلقہ مری زنجیر کا

**Step-7:**
# Giving Read/Write rights on tesstrain folder

import os
import subprocess
folder_path = '/content/tesstrain'

# Define the chmod command as a list of arguments
chmod_command = ['chmod', '-R', '777', folder_path]

# Execute the chmod command
try:
    subprocess.run(chmod_command, check=True)
    print(f"Permissions changed for {folder_path}")
except subprocess.CalledProcessError as e:
    print(f"Error: {e}")

Step8:
# /content/tesstrain Path to run the below code
!make training MODEL_NAME=irt START_MODEL=urd FINETUNE_TYPE=Impact

Step8 OutCome:
You are using make version: 4.3
lstmtraining
--debug_interval 0
--traineddata data/irt/irt.traineddata
--old_traineddata /content/tesstrain/usr/share/tessdata/urd.traineddata
--continue_from data/urd/irt.lstm
--learning_rate 0.0001
--model_output data/irt/checkpoints/irt
--train_listfile data/irt/list.train
--eval_listfile data/irt/list.eval
--max_iterations 10000
--target_error_rate 0.01
Loaded file data/urd/irt.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 129 to 129!
Num (Extended) outputs,weights in Series:
1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys64:64, 20736
Lfx96:96, 61824
Lrx96:96, 74112
Lfx384:384, 738816
Fc129:129, 49665
Total weights = 945313
Previous null char=2 mapped to 128
Continuing from data/urd/irt.lstm
Deserialize header failed: /content/tesstrain/data/irt-ground-truth/page_10_line_1.png نقش فریادی ہے کس کی شوخیٔ تحریر کا
Deserialize header failed: /content/tesstrain/data/irt-ground-truth/page_10_line_2.png کاغذی ہے پیرہن ہر پیکر تصویر کا
Deserialize header failed: /content/tesstrain/data/irt-ground-truth/page_10_line_5.png جذبۂ بے اختیار شوق دیکھا چاہیے
Load of page 0 failed!
Load of images failed!!
make: *** [Makefile:327: data/irt/checkpoints/irt_checkpoint] Segmentation fault (core dumped)

Please help me how to proceed further. I'm stuck.

Thanks you

@IrtazaIjaz IrtazaIjaz changed the title [Python] [Pytesseract] [Segmentation fault] [Deserialize header failed] [Python] [Pytesseract] [Urdu] [Segmentation fault] [Deserialize header failed] Oct 10, 2023
@stefan6419846
Copy link
Contributor

How is this related to Python and pytesseract? By the way: GitHub allows formatting code sections as code to improve readability (just use the <> button after marking the corresponding lines).

@zdenop
Copy link
Contributor

zdenop commented Oct 10, 2023

Also, it seems you try to run training on some platform (kaggle?) - run it on your local computer Linux/WSL or Mac.
Next do not report problems with your data - first, make sure that example data training works (e.g. you install and set training env correctly )

@IrtazaIjaz
Copy link
Author

Hi @zdenop,

I'm running it on Jupyter Notebook. I started with a single page that contained 10 lines only.

@IrtazaIjaz
Copy link
Author

Hi @stefan6419846,

I'm working on Jupyter notebook for python and writing the code in it. Moreover, I have also made the code more readable as you suggested.

Thanks

@zdenop
Copy link
Contributor

zdenop commented Oct 11, 2023

Follow readme instruction - only supported training process. Jupyter notebook is not there.
Otherwise you will not get support and issue will be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants