Approaches for checking a PDF for validity or complete-ness #2530

Evan0000000000 · 2024-03-19T14:08:19Z

Evan0000000000
Mar 19, 2024

Hi all, I'm dealing with PDFs generated from a third party tool and so would like to implement a quick check in my Python to ensure the file is a valid PDF for some very basic value of valid.

I know I can open the file as a stream and check the first few bytes for '%%PDF' and the last few for '%%EOF/n' but I'm curious if PyPDF has a way to do this as well.

I saw mentioned a few places that reading the page count out of a PDF library usually does the trick but I'm unclear if that works in PyPDF.

stefan6419846 · 2024-03-19T14:21:47Z

stefan6419846
Mar 19, 2024
Maintainer

In general, it depends on what you mean with validity, id est to which extent you want to ensure compliance according to the specification. From my experience, lots of the PDF files you will see have at least some small defect, thus libraries tend to implement workarounds for some common violations. To ensure proper compliance, a PDF/A validator and PDF/A-compliant files would be the best.

Trying to iterate over the pages of a PDF file is at least some basic measure you could follow, as well as trying to extract all xobjects. Some might consider a PDF file invalid if it does not have any content on any of the pages as well. Nevertheless, it does not necessarily mean that any file pypdf or any other library is not able to process needs to be invalid - there might be recoverable edge cases or features which just have not yet been implemented. Sometimes Ghostscript, MuPDF, PDFtk or similar are able to recover documents which appear to be broken at first sight.

To sum this up: pypdf might give you a hint about whether a PDF is valid, but it might fail for some PDF files as well you actually consider valid yourself. Thus it always depends on your requirements and on what you interpret as a "valid PDF file" while providing a good tradeoff between validation and speed if you are looking for a "quick check".

0 replies

Evan0000000000 · 2024-03-19T15:47:30Z

Evan0000000000
Mar 19, 2024
Author

Thanks Stefan, that's very thoughtful and touches on something I sort of knew that impacts this, that is I sort of understood that adhering to pdf /spec/ and whether or not the PDF would suit the use one has are different things, since lots of pdfs that break spec will open in the reader of your choice, and so forth.

I think in my use case then as a starting point "does pypdf.PdfReader() throw an exception" catches the situations that are most likely to occur (e.g., "the file is empty", "the file is truncated") and I can build from there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approaches for checking a PDF for validity or complete-ness #2530

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Approaches for checking a PDF for validity or complete-ness #2530

Evan0000000000 Mar 19, 2024

Replies: 2 comments

stefan6419846 Mar 19, 2024 Maintainer

Evan0000000000 Mar 19, 2024 Author

Evan0000000000
Mar 19, 2024

stefan6419846
Mar 19, 2024
Maintainer

Evan0000000000
Mar 19, 2024
Author