-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Space between regular character and sub-script character/number #179
Comments
It's a different problem indeed, #160 is for detecting the fact that a token is a subscript or superscript. Here, the extra space is a consequence of I think, we could either try to guess if there is a space or not between the two tokens in GROBID by looking at the coordinates and the average character spacing for instance. Or we could tackle that in pdf2xml by introducing maybe an XML attribute that would clarify the spacing. It's a bit a design issue of the XML format generated by |
Just had a look at the ALTO format. It looks like they have a separate Cermine is using the Trueviz format for training data annotations. It separates 'Zones' into 'Lines' into 'Words' and 'Characters'. ALTO seems similar in a way but is more flexible. So I might use that for my annotated training data. Although the alternative I was considering extended SVG. Peter's pdf2svg converts to svg with character mapping (but leaves the text block detection for the next step). mupdf/tools also has an option to render as SVG (either as paths or text). But neither would add a space element like in ALTO. Do you know of any tool that can convert PDF to ALTO already? |
Thanks! ALTO looks indeed a good choice. It is used a lot by many national libraries because ABBYY FineReader (which is used by most massive digitalization projects) can produce it. I already received requests to make GROBID supporting ALTO as input format. The alternative I think would be hOCR, produced by some open source OCR like Tesseract. The specification introduces similar areas as ALTO, including a space element I didn't find tools using the Trueviz format, except Trueviz and CERMINE, and no OCR which limit very significantly its interest. I saw some commercial tools that can convert PDF to ALTO, but nothing Open Source - there is a project pdf2alto) but it only outputs word element. I think outputting ALTO format (and/or hOCR) with |
This is also from the first pubmed manuscript (Introduction):
"...suppression of integrin α2 by E7820..."
The 2 after α is in subscript.
Currently an extra space is added:
"...suppression of integrin α 2 by E7820..."
This may be related to #160 but then seems a different problem. Just because it's a different font/style may not mean that there should be a space.
The text was updated successfully, but these errors were encountered: