Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors: [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 134 and 139 #1120

Open
RANN9 opened this issue May 25, 2024 · 3 comments
Assignees

Comments

@RANN9
Copy link

RANN9 commented May 25, 2024

Hi mighty developers

I am using GROBID for research which I need to extract text (processFulltextDocument) from some company annual report PDF files. I know GROBID is designed for academic documents but it is able to process most of my documents very well. The problem is, for some documents, like 30% of my whole document set (around 1000 PDFs), there were errors: [BAD_INPUT_DATA] 134, [BAD_INPUT_DATA] 139 and [GENERAL] An exception occurred while running Grobid. Besides, there are documents very similar to those with error codes and GROBID is able to process them. I have uploaded a few examples corresponding to each error code. Are there any workarounds or solutions for these errors? Thanks!

Examples with error code:

  • 500: [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 134
    • Document 1.pdf: failed with error 500, [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 134
    • Document 2.pdf: similar to Document 1 but with no error.
  • 500: [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 139
    • Document 3.pdf: failed with error 500, [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 139
    • Document 4.pdf: similar to Document 3 but with no error.
  • 500: [GENERAL] An exception occurred while running Grobid.
    • Document 5.pdf: failed with error 500, [GENERAL] An exception occurred while running Grobid.
    • Document 6.pdf: failed with error 500, [GENERAL] An exception occurred while running Grobid.
    • Document 7.pdf: similar to Document 5 and 6 but with no error.
  • 408

Environment:

  • Windows 11 with GPU
  • grobid/grobid:0.8.0 on Docker Container
  • python grobid_client (processFulltextDocument: consolidate_header + segment_sentences)

The error code also appears to be the same using local GROBID Service and HuggingFace

@lfoppiano
Copy link
Collaborator

@RANN9 thanks a lot for the report. I will look into it in the next weeks.

@lfoppiano
Copy link
Collaborator

@RANN9 How much memory are you allocating to the docker and to the JVM?

@RANN9
Copy link
Author

RANN9 commented May 27, 2024

Hi @lfoppiano thanks for getting back to me.

  • System memory: 64G
  • Docker: 31.25G
  • Docker container: no limit
  • JVM: 4G

@lfoppiano lfoppiano self-assigned this Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants