Approaches for checking a PDF for validity or complete-ness #2530
Replies: 2 comments
-
In general, it depends on what you mean with validity, id est to which extent you want to ensure compliance according to the specification. From my experience, lots of the PDF files you will see have at least some small defect, thus libraries tend to implement workarounds for some common violations. To ensure proper compliance, a PDF/A validator and PDF/A-compliant files would be the best. Trying to iterate over the pages of a PDF file is at least some basic measure you could follow, as well as trying to extract all xobjects. Some might consider a PDF file invalid if it does not have any content on any of the pages as well. Nevertheless, it does not necessarily mean that any file pypdf or any other library is not able to process needs to be invalid - there might be recoverable edge cases or features which just have not yet been implemented. Sometimes Ghostscript, MuPDF, PDFtk or similar are able to recover documents which appear to be broken at first sight. To sum this up: pypdf might give you a hint about whether a PDF is valid, but it might fail for some PDF files as well you actually consider valid yourself. Thus it always depends on your requirements and on what you interpret as a "valid PDF file" while providing a good tradeoff between validation and speed if you are looking for a "quick check". |
Beta Was this translation helpful? Give feedback.
-
Thanks Stefan, that's very thoughtful and touches on something I sort of knew that impacts this, that is I sort of understood that adhering to pdf /spec/ and whether or not the PDF would suit the use one has are different things, since lots of pdfs that break spec will open in the reader of your choice, and so forth. I think in my use case then as a starting point "does pypdf.PdfReader() throw an exception" catches the situations that are most likely to occur (e.g., "the file is empty", "the file is truncated") and I can build from there. |
Beta Was this translation helpful? Give feedback.
-
Hi all, I'm dealing with PDFs generated from a third party tool and so would like to implement a quick check in my Python to ensure the file is a valid PDF for some very basic value of valid.
I know I can open the file as a stream and check the first few bytes for '%%PDF' and the last few for '%%EOF/n' but I'm curious if PyPDF has a way to do this as well.
I saw mentioned a few places that reading the page count out of a PDF library usually does the trick but I'm unclear if that works in PyPDF.
Beta Was this translation helpful? Give feedback.
All reactions