Extracting images in context #677

petermr · 2022-07-02T07:56:21Z

petermr
Jul 2, 2022

I want to extract images using pdfplumber retaining a knowledge of their content (page_number and coordinates on page). (Some tools only emit image files with non-semantic names).

My current (arbitrary) scheme is to create filenames of the form:

image_<page>_<serial_in_page>_<x1>_<x2>__<y1>_<y2>.png

I'm hoping that there is a single way of getting this in pdfplumber. Currently I have 2 approaches:

pdf2txt

I can run:

pdf2txt.py fulltext.pdf --output-dir images

which gives:

(base) pm286macbook:Chapter04 pm286$ ls images/
Im0.0.bmp	Im0.1.bmp	Im0.2.bmp	Im0.4.bmp	Im0.bmp
Im0.0.jpg	Im0.1.jpg	Im0.3.bmp	Im0.5.bmp	Im0.jpg

This gets the images I want but is impenetrable. The Im<d> is occasionally incremented to Im1, Im2, etc, sometimes with and without a minor index. I don'r even know how to map these onto the order in the document.

pdfplumber

I have a "debugger" for pdfplumber in https://github.com/petermr/pyami/blob/main/py4ami/ami_pdf.py (messy as I'm still digging!) with method print_images

@classmethod
    def print_images(cls, page, maximage=10, outdir=None):
        write_image = True
        resolution = 400  # may be better
        from pdfminer.image import ImageWriter
        from pdfminer.layout import LTImage
        if not outdir:
            print(f"no output dir given")
            return
        if n_image := len(page.images) > 0:
            print(f"images {n_image}", end=" | ")
            for i, image in enumerate(page.images[:maximage]):
                print(f"image: {type(image)}: {image.values()}")

                path = Path(outdir, "images")
                if not path.exists():
                    path.mkdir()

If I knew how to get an LTImage I could probably export it here:

                if isinstance(image, LTImage):
                    imagewriter = ImageWriter(str(Path(path, f"image{i}.png")))
                    imagewriter.export_image(image)
                page_height = page.height

I can at least extract the coordinates:

                image_bbox = (image[X0], page_height - image[Y1], image[X1], page_height - image[Y0])
                print(f"image: {image_bbox}")

I save this string, see below

I can get the images by screen capture but this can lose info and also is overwritten by a watermark


                cropped_page = page.crop(image_bbox)  # crop screen display (may have overwriting text)
                image_obj = cropped_page.to_image(resolution=resolution)
                path1 = Path(path, f"image_{page.page_number}_{i}_{cls.format_bbox(image_bbox)}.png")
                if write_image:
                    image_obj.save(path1)
                    print(f" wrote image {path1}")
                continue

These are the coordinates I extracted for filenames

coord image_8_0_72_523_177_428
coord image_8_1_72_523_436_638
coord image_9_0_80_514_72_298
coord image_11_0_101_493_225_448
coord image_12_0_77_305_198_352
coord image_12_1_77_305_423_589
coord image_12_2_313_513_215_352
coord image_12_3_311_512_423_566
coord image_13_0_72_523_71_523
coord image_14_0_77_293_396_543
coord image_14_1_303_519_396_546
coord image_16_0_72_523_175_356
coord image_18_0_72_523_119_420
coord image_20_0_133_461_278_494

Thanks!

petermr · 2022-07-11T09:40:35Z

petermr
Jul 11, 2022
Author

OK,
This is obviously a hard problem - I'll have a go at it. (Happy if anyone wants to help)

pdf2txt gets the images but without page and page coordinates.
pdfplumber gets the page coordinates but without the bitmap. (a clipped screenshot is not good enough).

approaches

Hack through the PDStream stuff (again I'd love help here). It probably means reading the pdfminer code and finding out what's going on. Getting the color maps correct and the Flate/Decode may be tricky
Really hacky. Convert geometric scale of pdfminer output to pdfplumber images and correlate them bitwise to try to get correspondence so we can map pdfminer filenames to pdfplumber.
Hope to find some other way of ordering the pdfminer output (maybe precise datetimes?). Unlikely

1 reply

jsvine Jul 11, 2022
Maintainer

Really interesting challenge, @petermr! If you're only after those images and their coordinates, you may actually be better off just with pdfminer.six, sans pdfplumber. My instinct — admittedly not having tested this out — would be to do something like the following:

Grab all LTImage objects (and taking this opportunity to set a .page_number attribute on each object) via pdfminer.high_level.extract_pages(...).
Monkeypatch pdfminer.ImageWriter's _create_unique_image_name(...) method so that it grabs the x/y coordinates from the LTImage object passed to (the .page_number attribute from the previous step) it and generates the filename based on that.
Run imagewriter.export_image(image_obj) on each of the objects gathered in the first step.

petermr · 2022-07-12T07:12:13Z

petermr
Jul 12, 2022
Author

Thanks @jsvine , makes sense! I'm not familiar with pdfminer.six architecture and will welcome any guidance.

I am also happy to run a separate program, write to file, and pick up the results in pdfplumber. I asked this strategy on StackOverflow (https://stackoverflow.com/questions/72936759/extracting-images-from-pdf-with-page-and-screen-coordinate-information. The discussion so far (it's not an answer) suggests it's very complex, with references rather than objects and multiple alternate approaches.

I'll do a bit of exploring and record progress here. It won't be immediate.

FWIW we are not only extracting the images, but also extracting text from them using a variety of OCR (pytesseract, easyocr) and converting to structured HTML, That's why we need the original, not a clipped screenshot.

0 replies

petermr · 2022-07-13T09:08:31Z

petermr
Jul 13, 2022
Author

I think I have a Horrible Hack that solves my problem 99%.

we use another tool (e.g. pdfminer.six) to extract the images. They may have meaningless names:

Im0.0.bmp	Im0.13.bmp	Im0.2.bmp	Im0.4.bmp	Im0.7.jpg	Im1.0.bmp	Im1.jpg
Im0.0.jpg	Im0.14.bmp	Im0.2.jpg	Im0.4.jpg	Im0.8.bmp	Im1.1.bmp	Im2.bmp
Im0.1.bmp	Im0.15.bmp	Im0.20.bmp	Im0.5.bmp	Im0.8.jpg	Im1.2.bmp	Im3.bmp
...

but they will contain:

xrange, yrange, bytecount

use pdfplumber to extract the screen coords and image size (this is all extractable in PDFStream).
use the image size and bytecount to map the pdfminer.six image to the pdfplumber screen coords.

There may be collisions but if we do it on a per-page basis in pdfminer.six it will work for one image per page and has a good chance of not colliding for multiple images.

e.g.

|       with pdfplumber.open(IPCC_CHAP6_PDF) as pdf:
            pages = list(pdf.pages)
            for page in pages[:maxpage]:
                pdf_debug.debug_page_properties(page, debug=[WORDS, IMAGES], outdir=outdir)
        pdf_debug.write_summary(outdir=outdir)
        print(f"pdf_debug {pdf_debug.image_dict}")
        assert pdf_debug.image_dict == {
            ((1397, 779), 143448): (8, (72.0, 523.3), (412.99, 664.64)), # ((width,height),bytes) : (page,(x0,x1), (y0, y1))
            ((1466, 655), 122016): (8, (72.0, 523.3), (203.73, 405.38)),
            ((1634, 854), 204349): (9, (80.9, 514.25), (543.43, 769.92))
        }

NOTE. I have been looking for other image extractors and they may be better. I do not like JPGs as they lose info and I don't think they are in the original PDF. But it's all messy.

0 replies

petermr · 2022-07-13T14:40:15Z

petermr
Jul 13, 2022
Author

Hmm.
pdfminer.six (pdf2txt.py) extracts *.bmp and *.jpg - rather uncontrolledly - i.e. I can't choose the format but have to accept what the program emits. I'd prefer a non-lossy format to jpg (assuming that the bit stream is not JPG. The *.bmp are extracted but with a completely wrong color map.

The good news is that I can extract per-page using

pdf2txt.py fulltext.py --pagenos 8 --output-dir page8/

which means many of the images can be automatically identified and there is only ambiguity for images which have exactly the same dimensions and the same compressed bytecount.

0 replies

petermr · 2022-07-13T15:54:59Z

petermr
Jul 13, 2022
Author

Have used PyMuPDF, aka fitz

pm286macbook:Chapter06 pm286$ python -m fitz extract -h
usage: fitz extract [-h] [-images] [-fonts] [-output OUTPUT] [-password PASSWORD] [-pages PAGES] input

--------------------- extract images and fonts to disk --------------------

positional arguments:
  input               PDF filename

optional arguments:
  -h, --help          show this help message and exit
  -images             extract images
  -fonts              extract fonts
  -output OUTPUT      folder to receive output, defaults to current
  -password PASSWORD  password
  -pages PAGES        consider these pages only, format: 1,5-7,50-N
(base) pm286macbook:Chapter06 pm286$ 
(base) pm286macbook:Chapter06 pm286$ 
(base) pm286macbook:Chapter06 pm286$ python -m fitz extract fulltext.pdf -images -pages 8-12 -output fitz/
output directory fitz/ does not exist
(base) pm286macbook:Chapter06 pm286$ mkdir fitz
(base) pm286macbook:Chapter06 pm286$ python -m fitz extract fulltext.pdf -images -pages 8-12 -output fitz/
Warning: unsupported /SMask 24 for 25:
Pixmap(DeviceGray, IRect(0, 0, 1197, 682), 0)
Warning: unsupported /SMask 28 for 29:
Pixmap(DeviceGray, IRect(0, 0, 696, 471), 0)
Warning: unsupported /SMask 30 for 31:
Pixmap(DeviceGray, IRect(0, 0, 698, 507), 0)
Warning: unsupported /SMask 32 for 33:
Pixmap(DeviceGray, IRect(0, 0, 611, 420), 0)
Warning: unsupported /SMask 34 for 35:
Pixmap(DeviceGray, IRect(0, 0, 613, 436), 0)
saved 8 images to 'fitz/'
(base) pm286macbook:Chapter06 pm286$ tree fitz
fitz
├── img-15.jpeg
├── img-16.jpeg
├── img-19.png
├── img-25.png
├── img-29.png
├── img-31.png
├── img-33.png
└── img-35.png

The JPEGs seem fine. The pngs are also fine EXCEPT they have a black background (the original images are white). Maybe this is an alpha problem. But it completely swamps any black text so it's not useful. Hmm. Maybe I have to read the PDFStream in pdfplumber? At present I output:

======page: 8 ===========
words 137 | images 2 |

keys:
image: <class 'dict'>: dict_keys(['x0', 'y0', 'x1', 'y1', 'width', 'height', 'name', 'stream', 'srcsize', 'imagemask', 'bits', 'colorspace', 'object_type', 'page_number', 'top', 'bottom', 'doctop']) 

values:
dict_values([72.0, 412.99, 523.3, 664.64, 451.29999999999995, 251.64999999999998, 'Im0', <PDFStream(15): raw=143448, {'BitsPerComponent': 8, 'ColorSpace': /'DeviceRGB', 'Filter': /'DCTDecode', 'Height': 779, 'Interpolate': True, 'Length': 143448, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 1397}>, (1397, 779), None, 8, [/'DeviceRGB'], 'image', 8, 177.27999999999997, 428.92999999999995, 6070.719999999999])

stream <PDFStream(15): raw=143448, {'BitsPerComponent': 8, 'ColorSpace': /'DeviceRGB', 'Filter': /'DCTDecode', 'Height': 779, 'Interpolate': True, 'Length': 143448, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 1397}>
xxyy ((72.0, 523.3), (412.99, 664.64), (1397, 779), 'Im0', 8)
image:  ((1397, 779), 143448) => (8, (72.0, 523.3), (412.99, 664.64))

If I could turn the PDFStream of 143448 bytes into a bitmap (?LTImage) that would be fine. But I can't easily find how to hack PDFStream

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting images in context #677

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Extracting images in context #677

petermr Jul 2, 2022

pdf2txt

pdfplumber

Replies: 5 comments · 1 reply

petermr Jul 11, 2022 Author

approaches

jsvine Jul 11, 2022 Maintainer

petermr Jul 12, 2022 Author

petermr Jul 13, 2022 Author

petermr Jul 13, 2022 Author

petermr Jul 13, 2022 Author

petermr
Jul 2, 2022

Replies: 5 comments 1 reply

petermr
Jul 11, 2022
Author

jsvine Jul 11, 2022
Maintainer

petermr
Jul 12, 2022
Author

petermr
Jul 13, 2022
Author

petermr
Jul 13, 2022
Author

petermr
Jul 13, 2022
Author