-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/generate trainingsets #205
base: main
Are you sure you want to change the base?
Feat/generate trainingsets #205
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested it yet, but looks very promising. I particularly appreciate the unit tests 👍
generate_sets.py
Outdated
"-m", | ||
"--minchars", | ||
required=False, | ||
help="Minimum chars for a line") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An explicit default value would be better. In sets/training_sets.py
there is a constant DEFAULT_MIN_CHARS
that is used in generate_sets.py
but the min_chars
kwarg to TrainingSets.create
is 8.
Also, why 8/16? It's very common to have valid shorter lines, like the last word of a sentence on a new line, lines in narrow columns, dramas etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historical note: Originates from newspaper digitalization project, where a common text line (no adds) has usually more than 20 chars. In fact, we only took care for lines with at least 32 chars, since I thought these lines were more valuable for training than shorter lines because of having more characters to learn.
But personally I'm totally free about this, so what about, say, 4 chars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to skip short lines for training, but the fact that there is a minimum of chars and what that minimum is should be clearly communicated to the user, so he/she isn't surprised why some lines are skipped.
And yes, probably something low like 4 (documented in ./generate_sets.py --help
would be best IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably makes sense not to skip short lines for training. Tesseract was trained only with artifical long lines initially, and the standard models have problems with short lines (typically the page numbers, but also short lines ending paragraphs). We know that there exist valid lines with only a single character for page numbers (1 … 9, a … z, A … Z). Why should we skip lines with one or two characters as long as they are valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This originates from the decision to prefer longer lines because they provide more characters. I thought more characters means more training character training material, and more material increases pattern recognition accuracy. But this doesn't pay much attention to a characters' context. In newspapers advertises I've seen many lines that are way shorter than 8 chars, only containing abbreviations and alike. Maybe focusing on "regular article lines" is another reason why Tesseract (4.1.1) usually performs rather worse in this realm, compared to single-column-liners.
@stweil Do you suggest to turn the minchars arg into being completely optional, or to set the default value to "1", to skip lines that only contain non-printable characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting minchars to 1 sounds reasonable. I cannot imagine how a line which only contains non-printable characters would look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Stefan we should make minchars optional and try to make Tesseract learn short lines well. Not sure how the LSTM implementation here unrolls, but short lines should create fewer weight updates, so characters would still contribute "democratically" – just that there's more incentive to get a better transition from the initial state.
sets/training_sets.py
Outdated
""" | ||
|
||
if self.revert: | ||
return reduce(lambda c, p: p + ' ' + c, self.text_words) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we already require python-bidi, it would probably more robust to use it for handling the inversion, c.f. https://github.com/MeirKriheli/python-bidi/blob/master/bidi/algorithm.py / https://github.com/MeirKriheli/python-bidi#api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hint, I'll take a look!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm,
I guess we need to go without bidi so far, since it looks the output from mixed arab + latin lines makes them turn from rtl to ltr. Lines with only arab chars and indic numbers seem to work pretty well with bidi, but mixed don't.
I'm no export on that, though. Please, take a look yourself. I've added the bidi import, adopted line content generation (the commented section). Feel free to switch implementations. (in your preferred IDE place a debugging mark in test_create_sets_from_page2013_and_jpg
to inspect the temporary test files written)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I get it right: bidi works on char-Level? If so, I don't think it is useful in this scenario. I only know some (rather poor) arabic output generated from tesseract itself which is word-based.
I've tested it now, unit tests pass and I managed to extract image-text pairs from the kant_aufklaerung_1784 sample in assets: $ python3 ./generate_sets.py -d ../assets/data/kant_aufklaerung_1784/data/OCR-D-GT-PAGE/PAGE_0017_PAGE.xml -i ../assets/data/kant_aufklaerung_1784/data/OCR-D-IMG/INPUT_0017.tif
[SUCCESS] created '20' training data sets, please review It would be useful to make Could the We also need a section on at least the CLI usage in the README.md |
@kba Do you know of any Devanagari or any other Indic language datasets in Page XML format? I only have scanned page images and and their groundtruth in text format. I don't think those will work with this PR. |
Sorry, I do not. But maybe you have OCR results in Devanagari to test the mechanics of this PR? What problems do you foresee with Devanagari? |
I don't foresee any, but wanted to test with complex scripts, just in case there is any difference in processing.
Good idea. I can test using ALTO output from tesseract.
I found a set of files at https://github.com/ramayanaocr/ocr-comparison/tree/master/Transkribus/Input, which has the png files as well as the xml files (generated by transkribus, I guess). I tested with one of those files, while the console messages reported success, the files were not created. The summary option created a file, but the file had empty lines.
I tested with the Arabic image shared earlier in this thread with its xml file in resources, just to make sure that I had the PR installed correctly. That worked i.e. created the files. I haven't looked at the text within them.
Is there a compatibility issue with transkribus generated PAGE files? |
I tested just now with ALTO output from tesseract and get the following warnings:
EDIT: Earlier error with ALTO was because of typo in filename. |
@Shreeshrii Thanks for pointing to PAGE-Files that miss `Word' elements at all!
|
@M3ssman I tried just now but am getting the same result as before.
However, only the summary file is created in 'training_data_ram110'. File is attached. PS: I looked at the XML file and the Devanagari text in it has errors, so it is probably raw OCRed text and not corrected text for groundtruth. |
I also tried with the ALTO 4.1 XML referenced in the issue I opened at OCR-D/ocrd_fileformat#23
|
@Shreeshrii Thanks for pointing towards ALTO V4. I've missed this before, since we're using the latest official stable release, tesseract 4.1., which doesn't create this kind of ALTO data. I've added the ALTO V4 namespace declaration and it worked fine. Somehow, I found this surprising, since the ALTO V4 data from OpenITI you pointed out looks quite unfamiliar, having String CONTENT spanned over a complete textline. I've never seen this before. Where does this data come from? Regarding the Devanagari Issue: Your git log looks well, the version matches. Maybe |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I do not know more than the info available online. Please see |
@Shreeshrii Please note, test images are just created on-the-fly, with a library that is out-of-the-box just able to render a very small subset of UTF-8 chars, I guess only ASCII, neither arabic, persian, devanagari or old german fracture letters. This was introduced to keep test data small and free from binary image stuff. It only gives you a hint whether the lines would match the "words". |
@Shreeshrii Regarding the lastest version: currently, there's only a-pre-beta-version (0.0.1) annotated in the |
@M3ssman Thanks for the explanations regarding test files.
You were right about this. I removed The alto4.1 Persian file is also generating line images and text. (I haven't checked regarding the RTL issue yet). This is great!! Thank you. |
@Shreeshrii You're welcome! ... Sorry for the confusion regarding RTL ... finally, it turned out that the Since this relies on individual coordinates for each token, I'm afraid it will have no effect on test resources like the ones gathered from OpenITI which only have a single |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This should not be closed. It needs review by someone familiar with RTL languages. |
This pull request introduces 4 alerts when merging f3e73e4 into fa57d61 - view on LGTM.com new alerts:
|
I've been talking with https://github.com/galdring , a colleague, about this review and he's out to get us somebody. |
@M3ssman: can you please update your PR to current git code (python code is in src see Migrate Python code to a dedicated package) |
@zdenop Sorry for the late reply. What layout do you prefer? |
If I understood @zdenop correctly, the final goal is to make everything available through the tesstrain Python package in the end. As you provide a dedicated entry point, Nevertheless, I am not sure about the external dependencies. They might should be made optional ( |
@stefan6419846 |
@M3ssman If you are going to integrate the training set generator into the existing Python package, I would suggest yes. At least for me they appear to be overkill for most users which just want to use the basic artificial training functionality. |
|
||
`tesstrain-extract-sets` currently supports OCR data in ALTO V3, PAGE 2013 and PAGE 2019, as well as TIFF, JPEG and PNG images. | ||
|
||
Output is written as UTF-8 encoded plain text files and TIFF images. The image frame is produced from the textline coordinates in the OCR data, so please take care of properly annotated geometrical information. Additionally, the tool can add a fixed synthetic padding around the textline or store it binarized (`--binarize`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't padding for raw images going to be a desaster? I'd recommend making this combination disallowed in the CLI right away.
|
||
Output is written as UTF-8 encoded plain text files and TIFF images. The image frame is produced from the textline coordinates in the OCR data, so please take care of properly annotated geometrical information. Additionally, the tool can add a fixed synthetic padding around the textline or store it binarized (`--binarize`). | ||
|
||
By default, several sanitize actions are performed at image line level, like deskewing or removement of top-bottom intruders. To disable this, add flag `--no-sanitze`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, several sanitize actions are performed at image line level, like deskewing or removement of top-bottom intruders. To disable this, add flag `--no-sanitze`. | |
By default, several optimization actions are performed at image line level, like deskewing or removal of top-bottom intruders. To disable this, add flag `--no-sanitize`. |
extract_sets/training_sets.py
Outdated
|
||
import exifread | ||
import lxml.etree as etree | ||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract_sets/training_sets.py
Outdated
* drawing artificial border | ||
* collect only contours that touch this | ||
* get contours that are specific ratio to close to the edge | ||
* fill those with specific grey tone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this operation will be helpful for raw images. For binarized, it may improve, but grey untextured fill is certainly going to irritate the pixel pipeline (as it introduces artificial edges etc). It's also not realistic (not going to be seen at inference), so forcing the models to learn this is not a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, didn't you write a textured fill (grey_canvas
IIRC) for that very purpose (but for synthetic training) already?
extract_sets/training_sets.py
Outdated
only if so, enhance img to prevent rotation | ||
black area artifacts with constant padding | ||
* rotate | ||
* slice rotation result due previous padding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing all this on the line-level image calls for trouble:
- skew detection via Hough transform is much less reliable than on the region level
- derotation introduces white corners, which you then have to fill in – again, detrimental to raw/rgb images
I disagree with that assessment. The pkg for synthetic training is as relevant as some way to import from the widely used file formats (ALTO, PAGE) for real GT training IMO. So if the trainingsets extension is adopted (at all), then its dependencies should not be moved to |
fhdl.writelines(contents) | ||
|
||
|
||
def calculate_grayscale(low=168, neighbourhood=32, in_data=None): |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns Note
return tuple(map(lambda c: sum(c) / len(c), zip(*point_pairs))) | ||
|
||
|
||
def to_center_coords(elem, namespace, vertical=False): |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns Note
self.set_id() | ||
self.set_text() | ||
if self.valid: | ||
self.reorder = reorder |
Check warning
Code scanning / CodeQL
Overwriting attribute in super-class or sub-class Warning
TextLine
@@ -5,6 +5,8 @@ | |||
|
|||
ROOT_DIRECTORY = Path(__file__).parent.resolve() | |||
|
|||
installation_requirements = open('requirements.txt', encoding='utf-8').read().split('\n') |
Check warning
Code scanning / CodeQL
File is not always closed Warning
do_opt = args.sanitize | ||
intrusion_ratio = args.intrusion_ratio | ||
if isinstance(intrusion_ratio, str) and ',' in intrusion_ratio: | ||
intrusion_ratio = [float(n) for n in intrusion_ratio.split(',')] |
Check warning
Code scanning / CodeQL
Variable defined multiple times Warning
redefined
if isinstance(intrusion_ratio, str) and ',' in intrusion_ratio: | ||
intrusion_ratio = [float(n) for n in intrusion_ratio.split(',')] | ||
else: | ||
intrusion_ratio = float(intrusion_ratio) |
Check warning
Code scanning / CodeQL
Variable defined multiple times Warning
Include generation of Trainingdata Sets from OCR like ALTO V3, PAGE 2013, PAGE 2019 and Image Files (tif, jpeg)