-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
word segmentation in kant_aufklaerung_1784 GT PageXML #13
Comments
@bertsky Yes, such a strict rule would be better. Since we do not have a designated space element as in ALTO and no |
@tboenig |
Excellent! Maybe having an automatic validation step (see #16) for consistency between PAGE element levels would quickly reveal such non-abiding annotations. (Or even allowed exceptions to it in corner cases like extra whitespace at paragraph column crossings, |
@tboenig Push. |
@tboenig Push. |
@tboenig Please reopen. I understand that this is allowed now by the spec under a "lax" consistency rule. In fact, all variants But can we have at least one representation here in the GT assets repo that does meet the strict rules for tokenisation? (We need this to be able to write automated module tests without having to download GT files each time ...) |
We made a new example this month. |
In the assets repo? Please publish! |
I think @tboenig meant we WILL provide a new example in the coming weeks. |
Is it correct for assets/data/kant_aufklaerung_1784/page/kant_aufklaerung_1784_0020.xml to have punctuation characters as separate
Word
elements, even if they are written adjacent to other words (i.e. without whitespace)?For example, in the first line tl_2 of region r_2_1, the word word_1478541900479_904 depicts a single semicolon token. But the line reads
so it should be part of the second
Word
(i.e.worden;
).In principle, punctuation characters can occur
And of course, these might be combined, as in
or
or
So again (see #12) there is no way to reproduce the
TextLine
content fromWord
level annotation, except using coordinate heuristics.Wouldn't it be better to have a strict rule to only segment at whitespace? (This is what segmentation using OCR-D/ocrd_tesserocr does now.)
@tboenig @kba @finkf
The text was updated successfully, but these errors were encountered: