Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word segmentation in kant_aufklaerung_1784 GT PageXML #13

Open
bertsky opened this issue Aug 30, 2018 · 9 comments
Open

word segmentation in kant_aufklaerung_1784 GT PageXML #13

bertsky opened this issue Aug 30, 2018 · 9 comments
Assignees
Labels
bug Something isn't working groundtruth Groundtruth quality issues

Comments

@bertsky
Copy link
Contributor

bertsky commented Aug 30, 2018

Is it correct for assets/data/kant_aufklaerung_1784/page/kant_aufklaerung_1784_0020.xml to have punctuation characters as separate Word elements, even if they are written adjacent to other words (i.e. without whitespace)?

For example, in the first line tl_2 of region r_2_1, the word word_1478541900479_904 depicts a single semicolon token. But the line reads

gewiegelt worden; ſo ſchaͤdlich iſt es Vorurtheile zu

so it should be part of the second Word (i.e. worden;).

In principle, punctuation characters can occur

  • at the start of a token (e.g. opening parenthesis or quotation marks etc.)
  • at the end of a token (e.g. closing parenthesis or quotation marks, comma, colon, semicolon, sentence punctuation, hyphen etc.)
  • as an isolated token (e.g. dash)

And of course, these might be combined, as in

„die Urſach jener intereſſanten Erſcheinung ſeyn ſollte?“ —

or

sind ganz entzückt über diesen glänzenden (!) Sieg der gerechten Sache

or

„Wir alten Republikaner“, sagt Guinard, „die seit 30 Jahren über die Republik wachten, wir werden doch einem Leon Faucher zur Seite noch ferner darüber wachen dürfen.“

So again (see #12) there is no way to reproduce the TextLine content from Word level annotation, except using coordinate heuristics.

Wouldn't it be better to have a strict rule to only segment at whitespace? (This is what segmentation using OCR-D/ocrd_tesserocr does now.)

@tboenig @kba @finkf

@wrznr
Copy link
Collaborator

wrznr commented Sep 17, 2018

@bertsky Yes, such a strict rule would be better. Since we do not have a designated space element as in ALTO and no wordStart attribute as in Abbyy XML, I would even say that this is the only possible way to unambiguously represent whitespace.

@wrznr
Copy link
Collaborator

wrznr commented Sep 17, 2018

@tboenig assets and our own GT has to be thoroughly checked wrt this issue.

@wrznr wrznr added the bug Something isn't working label Sep 17, 2018
@bertsky
Copy link
Contributor Author

bertsky commented Sep 17, 2018

Excellent! Maybe having an automatic validation step (see #16) for consistency between PAGE element levels would quickly reveal such non-abiding annotations. (Or even allowed exceptions to it in corner cases like extra whitespace at paragraph column crossings, drop-capital regions, etc. So one might need a semi-automatic procedure to get there.)

@wrznr
Copy link
Collaborator

wrznr commented Oct 4, 2018

@tboenig Push.

@wrznr
Copy link
Collaborator

wrznr commented Nov 6, 2018

@tboenig Push.

@bertsky
Copy link
Contributor Author

bertsky commented Feb 12, 2019

@tboenig Please reopen. I understand that this is allowed now by the spec under a "lax" consistency rule. In fact, all variants kant_aufklaerung_1784{,-jp2,-binarized,-page-block-line-word_glyph} fall under this case.

But can we have at least one representation here in the GT assets repo that does meet the strict rules for tokenisation? (We need this to be able to write automated module tests without having to download GT files each time ...)

@kba kba reopened this Feb 12, 2019
@kba kba assigned tboenig and unassigned tboenig Jul 9, 2019
@tboenig
Copy link
Contributor

tboenig commented Jul 11, 2019

We made a new example this month.

@bertsky
Copy link
Contributor Author

bertsky commented Jul 11, 2019

In the assets repo? Please publish!

@kba
Copy link
Member

kba commented Jul 11, 2019

I think @tboenig meant we WILL provide a new example in the coming weeks.

@cneud cneud added the groundtruth Groundtruth quality issues label Nov 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working groundtruth Groundtruth quality issues
Projects
None yet
Development

No branches or pull requests

5 participants