How to remove watermark with pypdf2 #2917

Estelle-gqy · 2024-10-22T13:57:44Z

Estelle-gqy
Oct 22, 2024

I can use following code to remove watermarks where pypdf2=3.0.1, but it only works in few situations.

from PyPDF2.generic import ContentStream
from PyPDF2.generic import TextStringObject, NameObject
from PyPDF2._utils import b_

def pypdf2_remove_watermark(input_file, output_file):
    """
    :param input_file:
    :param output_file:
    :return:
    """
    reader = PyPDF2.PdfReader(input_file)
    output = PyPDF2.PdfWriter()

    for page in reader.pages:
        content_object = page.get_contents()
        # content_object = page["/Contents"].getObject()
        content = ContentStream(content_object, reader)
        for operands, operator in content.operations:
            if operator == b_("Tj"):
                operands[0] = TextStringObject('')
                # _text = operands[0]
                # if isinstance(_text, str) and _text in WATERMARK_TEXT:
                #     print(_text)
                #     operands[0] = TextStringObject('')
        page.__setitem__(NameObject('/Contents'), content)
        output.add_page(page)

    # 输入新的pdf文件
    with open(output_file, "wb") as outputStream:
        output.write(outputStream)
        print("watermark removed！", output_file)

Most of the time, the extracted text contain text and watermark text and the extracted images will also contain watermark images. Does anyone know how to remove watermark text/imgs using pypdf 5.0.1?

stefan6419846 · 2024-10-22T14:32:49Z

stefan6419846
Oct 22, 2024
Maintainer

I have converted your issue into a discussion which fits better.

At first: PyPDF2 has long been deprecated and you probably should not use it anymore.

Removing watermarks from PDFs probably is not really legal as there usually are reasons they have watermarks. I am going to assume that you are only doing this on PDF files created by yourself.

For text: Your approach looks correct when the filtering is enabled. Nevertheless, there are tons of different ways to typeset text in PDF files - texts are basically just a collection of characters with a specific position. You might be lucky to have text operators which are organized in groups, but this does not necessarily have to be the case. Additionally taking into account the text color or transformation matrix might help to avoid false positives, although this further complicates the analysis.

For images: You are not specifying further details about this. If the watermark is just an image, you should be able to use its .replace method. If you are extracting the single images and they contain watermarks, there is nothing pypdf could do as does not do any image manipulation.

2 replies

Estelle-gqy Oct 24, 2024
Author

If the watermark is just an image, how to tell it is just a watermark or a normal image. Can you give me an example to use .replace method?

stefan6419846 Oct 24, 2024
Maintainer

This highly depends on the watermark and is out of scope for pypdf - you might be able to do this with pattern matching.

A simple example for replacing images might look like this (adapted from the docs at https://pypdf.readthedocs.io/en/stable/user/file-size.html#reducing-image-quality):

from pypdf import PdfWriter
from PIL import Image

writer = PdfWriter(clone_from="example.pdf")

for page in writer.pages:
    for img in page.images:
        with Image.open("target.jpg") as target:
            img.replace(target)

with open("out.pdf", "wb") as f:
    writer.write(f)

For the API docs, see https://pypdf.readthedocs.io/en/stable/modules/PageObject.html#pypdf._page.ImageFile.replace

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to remove watermark with pypdf2 #2917

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

How to remove watermark with pypdf2 #2917

Estelle-gqy Oct 22, 2024

Replies: 1 comment · 2 replies

stefan6419846 Oct 22, 2024 Maintainer

Estelle-gqy Oct 24, 2024 Author

stefan6419846 Oct 24, 2024 Maintainer

Estelle-gqy
Oct 22, 2024

Replies: 1 comment 2 replies

stefan6419846
Oct 22, 2024
Maintainer

Estelle-gqy Oct 24, 2024
Author

stefan6419846 Oct 24, 2024
Maintainer