Missing images in document with duplicated images #2472

jnsebgosselin · 2024-02-26T14:32:12Z

jnsebgosselin
Feb 26, 2024

I want to create a microsoft word document and use a placeholder image that I want to replace with the proper images with pypdf after the document has been converted to pdf.

I have created the attached document with microsoft word using the same image three times (3X). However, when I extract the images with pypdf only one image is listed. I think microsoft is now checking for duplicate images to optimize space (when I look into the docx document with 7zip, only one image is stored in there).

Is there something that can be done do differentiate these images with pypdf other than using distinct images for the placeholder?

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.19045-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.0.2, crypt_provider=('cryptography', '42.0.1'), PIL=8.2.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
pdf_fpath = 'placeholder.pdf'
reader = PdfReader(pdf_fpath)
page = reader.pages[0]
images = page.images    
print(len(images))

placeholder.pdf

Answered by pubpub-zz

Feb 27, 2024

Images property extracts the images stored attached to the page. As noticed If images are called multiple times they only appear once.
In order to view the calls you have to parse the content. The easiest is to get it as operations and look for Do operations or BI for inline images.
2 warnings:
Do operations 'calls' images but also sub drawings: you have to ensure it is an image type object
Some images are included in sub drawings.

View full answer

pubpub-zz · 2024-02-27T04:49:40Z

pubpub-zz
Feb 27, 2024
Maintainer

Images property extracts the images stored attached to the page. As noticed If images are called multiple times they only appear once.
In order to view the calls you have to parse the content. The easiest is to get it as operations and look for Do operations or BI for inline images.
2 warnings:
Do operations 'calls' images but also sub drawings: you have to ensure it is an image type object
Some images are included in sub drawings.

0 replies

jnsebgosselin · 2024-02-27T15:54:21Z

jnsebgosselin
Feb 27, 2024
Author

Thanks for the help @pubpub-zz !

And if I locate the calls to the image, will this be feasible to remove those calls and insert another image in their place?

2 replies

pubpub-zz Feb 27, 2024
Maintainer

If you mean use the .replace() function this will replace the internal object and will display the image in all positions.

jnsebgosselin Mar 5, 2024
Author

No not exactly. What I would like to do is to replace each call to the internal object (the placeholder image) with a different image.

Anyway, even though this is not what I was hoping for, I understand now that what I want to do will require a lot more work thank expected.

Thank you again for you help. I will mark this as resolved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing images in document with duplicated images #2472

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Missing images in document with duplicated images #2472

jnsebgosselin Feb 26, 2024

Environment

Code + PDF

Replies: 2 comments · 2 replies

pubpub-zz Feb 27, 2024 Maintainer

jnsebgosselin Feb 27, 2024 Author

pubpub-zz Feb 27, 2024 Maintainer

jnsebgosselin Mar 5, 2024 Author

jnsebgosselin
Feb 26, 2024

Replies: 2 comments 2 replies

pubpub-zz
Feb 27, 2024
Maintainer

jnsebgosselin
Feb 27, 2024
Author

pubpub-zz Feb 27, 2024
Maintainer

jnsebgosselin Mar 5, 2024
Author