Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Same document, different PDF files, same curl command, predictably different output. #1135

Open
haykharut opened this issue Jun 29, 2024 · 2 comments

Comments

@haykharut
Copy link

haykharut commented Jun 29, 2024

I have 2 PDF versions of a paper, which look exactly the same when inspected visually. The only difference I can detect is file size (2.2MB vs 900KB) and the fact that my PDF viewer will show a contents bar for the big file but not the small file. I am no PDF expert.

I process both files with the command below.

curl -v --form input=@./paper.pdf --form teiCoordinates=ref --form teiCoordinates=biblStruct --form teiCoordinates=figure --form teiCoordinates=persName --form teiCoordinates=formula --form segmentSentences=1 --form teiCoordinates=s https://kermitt2-grobid.hf.space/api/processFulltextDocument > ./paper.xml

The XML outputs differ. Specifically, GROBID will correctly output <graphic coords=... type='bitmap'> for all figures in the small file while it outputs the graphic coords for only 1 figure in the large file, even though it still detects the figures correctly. I am attaching the files for reproducibility.

I would appreciate if someone could help me understand why this happens or at least help me get started with an investigation.

paper_big.pdf
paper_small.pdf

@lfoppiano lfoppiano changed the title Same PDF, same curl command, predictably different output. Same document, different PDF files, same curl command, predictably different output. Jun 30, 2024
@lfoppiano
Copy link
Collaborator

Hi @haykharut,
thanks for reporing this issue.

The PDF format allow to inject any type of information, including fonts, images. Images may be embedded as bitmap or as vectorial.

Now, although the PDF document looks good, they often smell bad :-)
In your examples, I extracted the bitmap using a different application, poppler and I've got the same results, in the small pdf I could extract all the 5 bitmaps, while in the big pdf I could only extract figure 1 and figure 2 (which is composed by three images).
This is the reason why Grobid does not attach the graphic tag in the image, because there is no bitmap associated in the big document.

There are other differences in these two documents, for example, paper_big has some hidden content:

image

which is not present in the paper_small:

image

@haykharut
Copy link
Author

haykharut commented Jun 30, 2024

@lfoppiano thanks so much for getting back. If you don't mind, I would like to ask a couple follow up questions.

Just to make sure I understand -- is it correct to say that in all likelihood, the larger file represents some figures as vectors and others as bitmaps?

In that case, I wonder, how can I extract the coordinates for vectors when bitmaps are missing?

Somewhat bewilderingly, the Grobid HF space processes the larger PDF file correctly.
I navigated to the PDF section, selected "include figures and tables" and uploaded the larger file. I can see correctly drawn bounding boxes. However, when I inspect the Network section of Chrome dev tab, I can see that the coordinates for some of the figures are under tables, not figures.

For example, the underlined tab_6 item in the attached picture corresponds to the graphic on the left hand side. It's not a table.

At the same time, the XML file generated by the curl command I mentioned above, references no tab_6 but instead correctly recognizes that item as a figure, even though it misses coordinates.

Screenshot 2024-06-30 at 16 29 45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants