Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GROBID Inconsistent Reference Detection in Custom PDFs: Format Guidelines Needed #1154

Open
JosVuHuynh opened this issue Aug 12, 2024 · 3 comments
Labels
question There's no such thing as a stupid question

Comments

@JosVuHuynh
Copy link

What is the correct format for a PDF file that GROBID can detect references in? I create PDFs myself, and sometimes they work and sometimes they don’t. I’m not sure about the formatting rules. Can you please let me know?

@lfoppiano
Copy link
Collaborator

With "detect references" do you mean, detect reference callout (e.g. In previous work [1] we showed that...)? or references sections in the article?

For the first case, there is generally not much training data in grobid (Fulltext model), but maybe it's easier if you show me some examples of your generated documents.

@JosVuHuynh
Copy link
Author

JosVuHuynh commented Aug 13, 2024

GwptVMUJQT.pdf
T5D17Q7WMj.pdf
besG09DFZb.pdf
CsoUOcdybT.pdf
Could you review all files @lfoppiano ? Grobid not detect ref when I run on https://huggingface.co/spaces/kermitt2/grobid .|
It related issues: #1152

I would like to know the formatting rules I need to follow when creating a new article PDF so that GROBID can accurately detect citations.

@lfoppiano
Copy link
Collaborator

There are no "rules" to format a document so that Grobid recognise the references. It's more like, to make a document like a scientific article.
At a first glance, these document' format is a bit far from the layout of a scientific article. For example, there is no header (at least title and authors) and the page layout is also horizontal (landscape).

Then, most important, the references don't match the text, so is normal that Grobid does not extract them correctly.

I did adjust your document and now with some more consistency looks much better ;-) Although, the body look indeed like an abstract:
Untitled.pdf
Untitled.pdf.tei.xml.zip

@lfoppiano lfoppiano added bug From Hemiptera and especially its suborder Heteroptera question There's no such thing as a stupid question and removed bug From Hemiptera and especially its suborder Heteroptera labels Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question There's no such thing as a stupid question
Projects
None yet
Development

No branches or pull requests

2 participants