Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinates of caption elements #1008

Open
keto33 opened this issue May 1, 2023 · 6 comments
Open

Coordinates of caption elements #1008

keto33 opened this issue May 1, 2023 · 6 comments
Labels
enhancement implemented The issue has been implemented

Comments

@keto33
Copy link

keto33 commented May 1, 2023

This may seem unnecessary, but it should be a feasible feature suggestion.

GROBID outputs all coordinates of structures except for text blocks. I am mostly interested in the coordinates of figure captions. When figures are embedded as EPS in vector format rather than raster/bitmap, GROBID does not correctly detect the bounding box of the figure, as drawings and texts are somehow blended into the PDF structure rather than being a distinguishable stream. In such cases, the bounding box of the figure caption can be helpful in estimating the actual bounding box of the EPS figure.

@kermitt2
Copy link
Owner

kermitt2 commented May 1, 2023

Hi @keto33 !

Thanks for the issue.

GROBID outputs all coordinates of structures except for text blocks.

Yes text blocks are not part of the TEI XML output because they are presentation/layout elements, not something related to the logicial structure of the document (like paragraphs, titles, etc.).

I am mostly interested in the coordinates of figure captions. When figures are embedded as EPS in vector format rather than raster/bitmap, GROBID does not correctly detect the bounding box of the figure, as drawings and texts are somehow blended into the PDF structure rather than being a distinguishable stream.

Yes the coordinates of the caption elements are indeed not outputted currently and there is no reason not to do it.

Regarding the "graphic part" of a figure, this is more or less implemented in PR #963 (the whole PR is not usable at this stage, really work in progress), the vector graphics are further analyzed to detect their boundaries, deal with overlapped text, etc. so that we have reliable "figure graphic" aggregated elements similar to the embedded bitmaps. There are many other things in this PR and it will take a lot time to be completed !

@kermitt2 kermitt2 changed the title Coordinates of text blocks Coordinates of caption elements May 1, 2023
@ClementFrvl
Copy link

ClementFrvl commented Aug 11, 2024

Hello!

Is there an ongoing effort or a specific branch where coordinates of text blocks can be extracted as part of the TEI/XML output?

I checked the documentation and I saw p elements are under teiCoordinates, and I am running this command:

curl --form input=@./Papers/test.pdf --form teiCoordinates='head' --form teiCoordinates='p' host:8070/api/processFulltextDocument

However there are no coordinates for the p elements, which I'm interested in.
image

Please let me know if there is a solution or anything I can do to assist!

@lfoppiano
Copy link
Collaborator

Hi @ClementFrvl, which version are you using? This seems a problem of grobid version 0.8.0 which disappears on the grobid master's version. 🤔

@ClementFrvl
Copy link

Hey, I am using 0.8.0, that may be the reason why.

My server is ARM-based though, I just tried with version 0.7.3, but I'm having the same issue.

image

Is there a newer arm version available ?

@lfoppiano
Copy link
Collaborator

We're working on a new version since a few weeks, hopefully we will be able to release soon.

@lfoppiano
Copy link
Collaborator

Should be solved in version 0.8.1

@lfoppiano lfoppiano added the implemented The issue has been implemented label Oct 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement implemented The issue has been implemented
Projects
None yet
Development

No branches or pull requests

4 participants