Replies: 2 comments 2 replies
-
No, we don't support that feature at the moment. I think that would also be a really difficult feature.
Good question ... things that pop to my mind are not completely error-proof. For example: def is_blank(page) -> bool:
has_text = bool(page.extract_text().strip())
has_image = False
x_object = page["/Resources"]["/XObject"].getObject()
for obj in x_object:
if x_object[obj]["/Subtype"] == "/Image":
has_image = True
if has_text or has_image:
return False
return True This is not error-proof:
Also some things are not clear:
I can check if those happen / how often that happens on a bigger dataset, but I will likely not have the time to do so today. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the detailed answer! Indeed, I strongly suspect the "blank" pages are in fact just large not-very-interesting not-very-dark images -- that's why I talked about using heuristics and mentioned that they were scans of blank sheets. (At least in my my specific use-case, with scanned documents, things like links and attachments are not a problem.) I'll look into using something like pdf2image to do the blank page detection, and then use PyPDF2 to do the chopping. |
Beta Was this translation helpful? Give feedback.
-
I have a number of PDFs from scanned documents. I would like to identify and remove blank pages (2-sided scans of 1-sided documents). Is there a way to do this with PyPDF2?
If I could get a bitmap of a page, I could work out whether it was blank with a bit of heuristics (no more than x% of pixels darker than a certain value etc.). Is it possible to get a bitmap of a page? Is there another way?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions