How to identify blank pages #1065

mas94uk · 2022-07-06T09:00:16Z

mas94uk
Jul 6, 2022

I have a number of PDFs from scanned documents. I would like to identify and remove blank pages (2-sided scans of 1-sided documents). Is there a way to do this with PyPDF2?

If I could get a bitmap of a page, I could work out whether it was blank with a bit of heuristics (no more than x% of pixels darker than a certain value etc.). Is it possible to get a bitmap of a page? Is there another way?

Thanks!

MartinThoma · 2022-07-06T10:55:16Z

MartinThoma
Jul 6, 2022
Maintainer

Is it possible to get a bitmap of a page?

No, we don't support that feature at the moment. I think that would also be a really difficult feature.

Is there another way?

Good question ... things that pop to my mind are not completely error-proof. For example:

def is_blank(page) -> bool:
    has_text = bool(page.extract_text().strip())

    has_image = False
    x_object = page["/Resources"]["/XObject"].getObject()
    for obj in x_object:
        if x_object[obj]["/Subtype"] == "/Image":
            has_image = True

    if has_text or has_image:
        return False
    return True

This is not error-proof:

False-negatives (returning False when the page IS blank): For example, if you scanned both sides of a document and only one is printed, you still have content on the "empty" page. A (mostly) white image. Knowing that this doesn't have content is REALLY hard. You need computer vision to do that. That is definitely out of scope for PyPDF2. You can approach it with OpenCV.
False-positives (returning True when the page is NOT blank): I'm not sure if that can happen... maybe if there is a PDF object I didn't think about

Also some things are not clear:

If there is an attachement on that page but nothing else, would you consider it being blank?
If there is a clickable area on the page without any content, just a redirect to another page: Is it blank?
When you have a scanned completely white image you would probably consider it blank. Would you still say it's blank when there is some dust? If there is a coffee stain? If you can see a little bit of the content from the other side of the page? What if the page is completely black?

I can check if those happen / how often that happens on a bigger dataset, but I will likely not have the time to do so today.

0 replies

mas94uk · 2022-07-06T13:25:23Z

mas94uk
Jul 6, 2022
Author

Thanks for the detailed answer!

Indeed, I strongly suspect the "blank" pages are in fact just large not-very-interesting not-very-dark images -- that's why I talked about using heuristics and mentioned that they were scans of blank sheets. (At least in my my specific use-case, with scanned documents, things like links and attachments are not a problem.)

I'll look into using something like pdf2image to do the blank page detection, and then use PyPDF2 to do the chopping.

2 replies

briangkatz Aug 23, 2024

@mas94uk Did you have any success with the pdf2image + PyPDF2 approach?

stefan6419846 Aug 23, 2024
Maintainer

There shouldn't be the need for actually using pypdf (PyPDF2 is not maintained anymore) for this, but just pdf2image with either the mediabox (default) or cropbox (optional) with an appropriate resolution. This will generate a list of Pillow images, where you can just use image.histogram() on.

A simple example:

from pdf2image import convert_from_path


for index, image in enumerate(convert_from_path(pdf_path='test.pdf', dpi=100, use_pdftocairo=True), start=1):
    # It should be sufficient to work in black-and-white mode. Generate the histogram.
    # The histogram will have 256 values, where for black-and-white images only the first
    # and last value should be set (corresponding to black and white).
    histogram = image.convert('1').histogram()
    # Check if there are no black pixels.
    if histogram[0] == 0:
        print('Page', index, 'appears to be empty.')

You can of course modify the above code to not use black-and-white images, but grayscale (mode L, then checking everything except the last histogram value: all(x == 0 for x in image.convert('L').histogram()[:-1])) - or use another library like numpy for generating histograms. But as all of this is out of scope for pypdf and depends on your use case anyway, I will not be listing more options/details here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to identify blank pages #1065

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to identify blank pages #1065

mas94uk Jul 6, 2022

Replies: 2 comments · 2 replies

MartinThoma Jul 6, 2022 Maintainer

mas94uk Jul 6, 2022 Author

briangkatz Aug 23, 2024

stefan6419846 Aug 23, 2024 Maintainer

mas94uk
Jul 6, 2022

Replies: 2 comments 2 replies

MartinThoma
Jul 6, 2022
Maintainer

mas94uk
Jul 6, 2022
Author

stefan6419846 Aug 23, 2024
Maintainer