Can't extract images for this PDF #3948

bbfrog · 2024-10-10T18:23:31Z

bbfrog
Oct 10, 2024

Description of the bug

Monaleesa_full.pdf
Pymupdf can't extract images in page 2 and page 4 of this pdf.

How to reproduce the bug

import pymupdf
doc = pymupdf.open('Monaleesa_full.pdf')

page_num = 0
for page in doc:
  page_num += 1
  images = page.get_images(full=True)
  print(f'page {page_num}: {len(images)} images')

PyMuPDF version

1.24.11

Operating system

MacOS

Python version

3.12

Answered by JorjMcKie

Oct 12, 2024

You can try this script. Or do this:

import pymupdf

doc = pymupdf.open("input.pdf")
for page in doc:
    for i, bbox in enumerate(page.cluster_drawings()):
        pix = page.get_pixmap(clip=bbox, dpi=150)
        pix.save(f"{doc.name}-{page.number}-{i}.png")

View full answer

JorjMcKie · 2024-10-10T20:16:16Z

JorjMcKie
Oct 10, 2024
Maintainer

Except for page 7 (0-based), none of the pages contains an image.
What you see are vector graphics - no images.

0 replies

JorjMcKie · 2024-10-10T20:18:01Z

JorjMcKie
Oct 10, 2024
Maintainer

Vector graphics cannot be extracted. All you can do is making a "photo" of the respective page area ...

0 replies

bbfrog · 2024-10-12T05:38:37Z

bbfrog
Oct 12, 2024
Author

Acrobat API can extract the vector graphics and save as png or svg. How does it do this? Is it hard to support in Pymupdf? THanks!

0 replies

JorjMcKie · 2024-10-12T06:36:38Z

JorjMcKie
Oct 12, 2024
Maintainer

You can try this script. Or do this:

import pymupdf

doc = pymupdf.open("input.pdf")
for page in doc:
    for i, bbox in enumerate(page.cluster_drawings()):
        pix = page.get_pixmap(clip=bbox, dpi=150)
        pix.save(f"{doc.name}-{page.number}-{i}.png")

0 replies

bbfrog · 2024-10-15T19:11:35Z

bbfrog
Oct 15, 2024
Author

Thanks @JorjMcKie very much. It works and can extract the image I want. But it also extracted tables from this pdf as drawing, is there any field can differentiate the tables with other drawing? Thanks!

2 replies

JorjMcKie Oct 15, 2024
Maintainer

No, the recognition of vector graphics is based on syntax algorithms only - not on a sematic understanding of what it is that is wrapped bey lines, etc.

If however you are lucky enough such that PyMuPDF can identify the tables as tables ... then you could check whether those table boundary boxes coincide with identified vector graphic boxes. And thus exclude them from picture conversion.

bbfrog Oct 15, 2024
Author

Thanks. Chatgpt works for identifying whether a image is table or plot:)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't extract images for this PDF #3948

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Can't extract images for this PDF #3948

bbfrog Oct 10, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Replies: 5 comments · 2 replies

JorjMcKie Oct 10, 2024 Maintainer

JorjMcKie Oct 10, 2024 Maintainer

bbfrog Oct 12, 2024 Author

JorjMcKie Oct 12, 2024 Maintainer

bbfrog Oct 15, 2024 Author

JorjMcKie Oct 15, 2024 Maintainer

bbfrog Oct 15, 2024 Author

bbfrog
Oct 10, 2024

Replies: 5 comments 2 replies

JorjMcKie
Oct 10, 2024
Maintainer

JorjMcKie
Oct 10, 2024
Maintainer

bbfrog
Oct 12, 2024
Author

JorjMcKie
Oct 12, 2024
Maintainer

bbfrog
Oct 15, 2024
Author

JorjMcKie Oct 15, 2024
Maintainer

bbfrog Oct 15, 2024
Author