Skip to content
Stefan Weil edited this page Apr 19, 2020 · 13 revisions

This is work in progress, so please be patient.

Introduction

The British Library provides free transcriptions of Arabic handwritten text. They already have run trainings with Transkribus.

That transcriptions can also be used to train Tesseract.

Reference

Data preparation

Create a new directory and run the following commands to prepare the data for the training process.

mkdir -p ~/ArabicHandwriting
cd ~/ArabicHandwriting

# Get the data.
curl -L https://bl.oar.bl.uk/fail_uploads/download_file?fileset_id=e03280ef-5a75-4193-a8b5-1265f295e5cf >RASM2019_part_1.zip
curl -L https://bl.oar.bl.uk/fail_uploads/download_file?fileset_id=907b2e2a-3f23-49b8-8eef-f073c8bb97ab >RASM2019_part_2.zip

# Extract the data. Use 7za instead of unzip because there is an error in RASM2019_part_2.zip.
7za x RASM2019_part_1.zip
7za x RASM2019_part_2.zip
mkdir -p IMG PAGE
mv *.tif IMG
mv *.xml PAGE

# Remove spaces in filenames (workaround because currently not fully supported by OCR-D).
for i in IMG/* PAGE/*; do mv -v "$i" "${i/ /}"; done
for i in IMG/* PAGE/*; do mv -v "$i" "${i/ /}"; done
perl -pi -e 's/(imageFilename=.*) (.*tif)/$1$2/' PAGE/*
perl -pi -e 's/(imageFilename=.*) (.*tif)/$1$2/' PAGE/*
perl -pi -e 's/(filename=.*) (.*tif)/$1$2/' PAGE/*
perl -pi -e 's/(filename=.*) (.*tif)/$1$2/' PAGE/*

# Fix path for images for further processing.
perl -pi -e 's/imageFilename="/imageFilename="IMG\//' PAGE/*

# Remove alternative image filenames which are not available from PAGE files.
perl -pi -e 's/.*AlternativeImage.*//' PAGE/*

# Create OCR-D workspace and add images and PAGE files.
ocrd workspace init
for i in IMG/*; do base=$(basename "$i" .tif); ocrd workspace add "$i" -G IMG -i "${base}_img" -g "$base" -m image/tiff; done
for i in PAGE/*; do base=$(basename "$i" .xml); ocrd workspace add "$i" -G PAGE -i "${base}_page" -g "$base" -m application/vnd.prima.page+xml; done

# Binarize and denoise images.
ocrd-olena-binarize -I PAGE -O WOLF,WOLF-IMG -m mets.xml -p <(echo '{"impl":"wolf"}')
ocrd-cis-ocropy-denoise -I WOLF -O DENOISE,DENOISE-IMG -m mets.xml -p '{"level-of-operation": "line"}'

# Extract the line images.
ocrd-segment-extract-lines -I DENOISE -O LINES -m mets.xml

# Remove empty texts (files contain only a line feed) which cannot be used for training.
rm -v $(find LINES -size 1c)

# Remove lines with missing transcriptions.
rm -v $(fgrep -l ؟ LINES/*txt)
rm -v $(fgrep -l '[' LINES/*txt)

# Remove images which were written from top to bottom or from bottom to top.
# The heuristics here assumes that such images have a 3 digit width and a 4 digit height.
rm -v $(file *png|grep ", ... x ....,"|sed s/:.*//)

Training

Here training is started with the existing Tesseract model script/Arabic.traineddata.

# Create box files needed for Tesseract training.
for t in ~/ArabicHandwriting/GT/LINES/*.txt; do test -f ${t/gt.txt/box} || (echo $t && ./generate_wordstr_box.py -i ${t/gt.txt/bin.png} -t $t -r >${t/gt.txt/box}); done 

nohup make LANG_TYPE=RTL MODEL_NAME=ArabicHandwritingOCRD GROUND_TRUTH_DIR=/home/stweil/src/ArabicHandwriting/GT/LINES PSM=13 START_MODEL=Arabic TESSDATA=/home/stweil/src/github/OCR-D/venv-20200408/share/tessdata EPOCHS=20 lists >>data/ArabicHandwritingOCRD.log
nohup make LANG_TYPE=RTL MODEL_NAME=ArabicHandwritingOCRD GROUND_TRUTH_DIR=/home/stweil/src/ArabicHandwriting/GT/LINES PSM=13 START_MODEL=Arabic TESSDATA=/home/stweil/src/github/OCR-D/venv-20200408/share/tessdata EPOCHS=20 training >>data/ArabicHandwritingOCRD.log

The ground truth lines are split in 2351 lines for training and 262 lines for validation.

After one epoch (2351 iterations), the CER is at about 46 %. With sufficient training (200 epochs, about 32 hours), the CER falls below 5 %.

Results

Best and fast Tesseract models which were trained using the steps above are available from https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ArabicHandwritingOCRD/.

Open Questions

How good are the new Tesseract models?

The achieved CER for other Arabic handwritings still has to be measured.

Unusable ground truth

The ground truth texts seem to mark text which could not be read by the human transcriber using patterns like [؟], ؟؟؟ and maybe others with []. Example: [؟؟؟]صْنَاف[؟]. About 35 ground truth lines with such marks are neither usable for the training nor for the validation. Therefore all current training has to be repeated.

Some text (especially in the left and right side margins) is not written horizontally but vertically, either from top to bottom or from bottom to top. The PAGE data does not indicate that (it should have added the text orientation), so the image is extracted as it is instead of rotating it by +90° or -90°. Such images are unusable for the training. A first estimate is that this affects at least 75 line images.

Encoding and other training errors

The Tesseract training shows lots of encoding and other problems with a rather large skip ratio of more than 8 %. Here a typical example:

At iteration 139783/462420/506586, Mean rms=0.254%, delta=0.043%, char train=4.88%, word train=21.6%, skip ratio=8.7%,  wrote checkpoint.

Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Encoding of string failed! Failure bytes: ef bf bd
Can't encode transcription: 'امهرد نوسمخو ةيام نكي ىش ىلع ةموسقم امهرد رشع ةسمخ ىف مث نيمهردو ىش�' in language ''
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Image too small to scale!! (2x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Encoding of string failed! Failure bytes: ef bf bd 20 d8 a8 d9 88 d8 aa d9 83 d9 85 d9 84 d8 a7 20 d8 a7 d8 b0 d9 87
Can't encode transcription: 'اٰمو اهعيابط ناٰيب يف ثلاثلا بابلا � بوتكملا اذه' in language ''
At iteration 139783/462520/506696, Mean rms=0.252%, delta=0.04%, char train=4.764%, word train=21.182%, skip ratio=9.5%,  wrote checkpoint.