Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence segmentation error case #1130

Open
lfoppiano opened this issue Jun 11, 2024 · 1 comment
Open

Sentence segmentation error case #1130

lfoppiano opened this issue Jun 11, 2024 · 1 comment
Labels
bug From Hemiptera and especially its suborder Heteroptera implemented The issue has been implemented
Milestone

Comments

@lfoppiano
Copy link
Collaborator

lfoppiano commented Jun 11, 2024

This is an error case not to forget that causes some trouble with the sentence segmentation.
The document is not CC-BY, referenced here: https://dx.doi.org/10.1063/1.1874292

Here the delinquent paragraph:

image

With version 0.8.0 and the current master, the process fails:

ERROR [2024-06-11 06:22:00,602] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! java.lang.StringIndexOutOfBoundsException: begin 592, end 595, length 594
! at java.base/java.lang.String.checkBoundsBeginEnd(String.java:4606)
! at java.base/java.lang.String.substring(String.java:2709)
! at org.grobid.core.document.TEIFormatter.segmentIntoSentences(TEIFormatter.java:1900)
! at org.grobid.core.document.TEIFormatter.toTEITextPiece(TEIFormatter.java:1468)
! at org.grobid.core.document.TEIFormatter.toTEIBody(TEIFormatter.java:1015)
! at org.grobid.core.engines.FullTextParser.toTEI(FullTextParser.java:2648)
! ... 83 common frames omitted
! Causing: org.grobid.core.exceptions.GrobidException: [GENERAL] An exception occurred while running Grobid.

There are two problems (code

if (pos+posInSentence <= theSentences.get(i).end) {
):

  1. String local_text_chunk = text.substring(pos+posInSentence, theSentences.get(i).end); may crash when the sentence is going over the text length
  2. The if is completely ignored in certain cases, so all the accumulated nodes are dropped. See below:
<div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head>C. dc field dependence of R "T , B rf , B dc , f…</head>
                <p>
                    <s>As mentioned in Ref.</s>
                    <s>31, properly annealed, bulk Nb TM-TE-mode cavities show large additional rf losses by frozen-in flux with, e.g., at 4.2 K and 2 GHz, R H Ӎ 2 ⍀ H dc / mT for RRRӍ 30, which is described in Eq. ͑3.9͒ by ␤ Ӎ 1 and ␤ Ͻ 10 for RRRտ 200.</s>
                    <s>Those large rf losses by the normal conducting cores of slow AF do not increase with rf field level.</s>
                    <s>,
                        <ref type="bibr" target="#b30">31</ref>
                    </s>
                </p>
            </div>
            ```
@lfoppiano lfoppiano added the bug From Hemiptera and especially its suborder Heteroptera label Jun 11, 2024
@lfoppiano
Copy link
Collaborator Author

This is normally fixed in #1131.

@lfoppiano lfoppiano added the implemented The issue has been implemented label Jun 12, 2024
@lfoppiano lfoppiano added this to the 0.8.1 milestone Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera implemented The issue has been implemented
Projects
None yet
Development

No branches or pull requests

1 participant