-
Notifications
You must be signed in to change notification settings - Fork 190
Arabic Handwriting
This is work in progress, so please be patient.
The British Library provides free transcriptions of Arabic handwritten text. They already have run trainings with Transkribus.
That transcriptions can also be used to train Tesseract.
- Ground Truth transcriptions for training OCR of historical Arabic handwritten texts. Adi Keinan-Schoonbaert; British Library, 2019; https://doi.org/10.23636/1135
- Arabische Handschriften & Automatische Texterkennung. Viola Voß; 2020; https://www.ulb.uni-muenster.de/fachblog/archiv/2155
Create a new directory and run the following commands to prepare the data for the training process.
mkdir -p ~/ArabicHandwriting
cd ~/ArabicHandwriting
# Get the data.
curl -L https://bl.oar.bl.uk/fail_uploads/download_file?fileset_id=e03280ef-5a75-4193-a8b5-1265f295e5cf >RASM2019_part_1.zip
curl -L https://bl.oar.bl.uk/fail_uploads/download_file?fileset_id=907b2e2a-3f23-49b8-8eef-f073c8bb97ab >RASM2019_part_2.zip
# Extract the data. Use 7za instead of unzip because there is an error in RASM2019_part_2.zip.
7za x RASM2019_part_1.zip
7za x RASM2019_part_2.zip
mkdir -p IMG PAGE
mv *.tif IMG
mv *.xml PAGE
# Remove spaces in filenames (workaround because currently not fully supported by OCR-D).
for i in IMG/* PAGE/*; do mv -v "$i" "${i/ /}"; done
for i in IMG/* PAGE/*; do mv -v "$i" "${i/ /}"; done
perl -pi -e 's/(imageFilename=.*) (.*tif)/$1$2/' PAGE/*
perl -pi -e 's/(imageFilename=.*) (.*tif)/$1$2/' PAGE/*
perl -pi -e 's/(filename=.*) (.*tif)/$1$2/' PAGE/*
perl -pi -e 's/(filename=.*) (.*tif)/$1$2/' PAGE/*
# Fix path for images for further processing.
perl -pi -e 's/imageFilename="/imageFilename="IMG\//' PAGE/*
# Remove alternative image filenames which are not available from PAGE files.
perl -pi -e 's/.*AlternativeImage.*//' PAGE/*
# Create OCR-D workspace and add images and PAGE files.
ocrd workspace init
for i in IMG/*; do base=$(basename "$i" .tif); ocrd workspace add "$i" -G IMG -i "${base}_img" -g "$base" -m image/tiff; done
for i in PAGE/*; do base=$(basename "$i" .xml); ocrd workspace add "$i" -G PAGE -i "${base}_page" -g "$base" -m application/vnd.prima.page+xml; done
# Binarize and denoise images.
ocrd-olena-binarize -I PAGE -O WOLF,WOLF-IMG -m mets.xml -p <(echo '{"impl":"wolf"}')
ocrd-cis-ocropy-denoise -I WOLF -O DENOISE,DENOISE-IMG -m mets.xml -p '{"level-of-operation": "line"}'
# Extract the line images.
ocrd-segment-extract-lines -I DENOISE -O LINES -m mets.xml
# Remove empty texts (files contain only a line feed) which cannot be used for training.
rm -v $(find LINES -size 1c)
# Remove lines with missing transcriptions.
rm -v $(fgrep -l ؟ LINES/*txt)
rm -v $(fgrep -l '[' LINES/*txt)
# Remove images which were written from top to bottom or from bottom to top.
# The heuristics here assumes that such images have a 3 digit width and a 4 digit height.
rm -v $(file *png|grep ", ... x ....,"|sed s/:.*//)
Here training is started with the existing Tesseract model script/Arabic.traineddata.
# Create box files needed for Tesseract training.
for t in ~/ArabicHandwriting/GT/LINES/*.txt; do test -f ${t/gt.txt/box} || (echo $t && ./generate_wordstr_box.py -i ${t/gt.txt/bin.png} -t $t -r >${t/gt.txt/box}); done
nohup make LANG_TYPE=RTL MODEL_NAME=ArabicHandwritingOCRD GROUND_TRUTH_DIR=/home/stweil/src/ArabicHandwriting/GT/LINES PSM=13 START_MODEL=Arabic TESSDATA=/home/stweil/src/github/OCR-D/venv-20200408/share/tessdata EPOCHS=20 lists >>data/ArabicHandwritingOCRD.log
nohup make LANG_TYPE=RTL MODEL_NAME=ArabicHandwritingOCRD GROUND_TRUTH_DIR=/home/stweil/src/ArabicHandwriting/GT/LINES PSM=13 START_MODEL=Arabic TESSDATA=/home/stweil/src/github/OCR-D/venv-20200408/share/tessdata EPOCHS=20 training >>data/ArabicHandwritingOCRD.log
The ground truth lines are split in 2351 lines for training and 262 lines for validation.
After one epoch (2351 iterations), the CER is at about 46 %. With sufficient training (200 epochs, about 32 hours), the CER falls below 5 %.
Best and fast Tesseract models which were trained using the steps above are available from https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ArabicHandwritingOCRD/.
The achieved CER for other Arabic handwritings still has to be measured.
The ground truth texts seem to mark text which could not be read by the human transcriber using patterns like [؟]
, ؟؟؟
and maybe others with []
. Example: [؟؟؟]صْنَاف[؟]
. About 35 ground truth lines with such marks are neither usable for the training nor for the validation. Therefore all current training has to be repeated.
Some text (especially in the left and right side margins) is not written horizontally but vertically, either from top to bottom or from bottom to top. The PAGE data does not indicate that (it should have added the text orientation), so the image is extracted as it is instead of rotating it by +90° or -90°. Such images are unusable for the training. A first estimate is that this affects at least 75 line images.
The Tesseract training shows lots of encoding and other problems with a rather large skip ratio of more than 8 %. Here a typical example:
At iteration 139783/462420/506586, Mean rms=0.254%, delta=0.043%, char train=4.88%, word train=21.6%, skip ratio=8.7%, wrote checkpoint.
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Encoding of string failed! Failure bytes: ef bf bd
Can't encode transcription: 'امهرد نوسمخو ةيام نكي ىش ىلع ةموسقم امهرد رشع ةسمخ ىف مث نيمهردو ىش�' in language ''
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Image too small to scale!! (2x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Encoding of string failed! Failure bytes: ef bf bd 20 d8 a8 d9 88 d8 aa d9 83 d9 85 d9 84 d8 a7 20 d8 a7 d8 b0 d9 87
Can't encode transcription: 'اٰمو اهعيابط ناٰيب يف ثلاثلا بابلا � بوتكملا اذه' in language ''
At iteration 139783/462520/506696, Mean rms=0.252%, delta=0.04%, char train=4.764%, word train=21.182%, skip ratio=9.5%, wrote checkpoint.