Replies: 5 comments 1 reply
-
OK,
approaches
|
Beta Was this translation helpful? Give feedback.
-
Thanks @jsvine , makes sense! I'm not familiar with I am also happy to run a separate program, write to file, and pick up the results in I'll do a bit of exploring and record progress here. It won't be immediate. FWIW we are not only extracting the images, but also extracting text from them using a variety of OCR (pytesseract, easyocr) and converting to structured HTML, That's why we need the original, not a clipped screenshot. |
Beta Was this translation helpful? Give feedback.
-
I think I have a Horrible Hack that solves my problem 99%.
There may be collisions but if we do it on a per-page basis in e.g.
NOTE. I have been looking for other image extractors and they may be better. I do not like JPGs as they lose info and I don't think they are in the original PDF. But it's all messy. |
Beta Was this translation helpful? Give feedback.
-
Hmm. The good news is that I can extract per-page using
which means many of the images can be automatically identified and there is only ambiguity for images which have exactly the same dimensions and the same compressed bytecount. |
Beta Was this translation helpful? Give feedback.
-
Have used
The JPEGs seem fine. The pngs are also fine EXCEPT they have a black background (the original images are white). Maybe this is an alpha problem. But it completely swamps any black text so it's not useful. Hmm. Maybe I have to read the
If I could turn the |
Beta Was this translation helpful? Give feedback.
-
I want to extract images using
pdfplumber
retaining a knowledge of their content (page_number and coordinates on page). (Some tools only emit image files with non-semantic names).My current (arbitrary) scheme is to create filenames of the form:
I'm hoping that there is a single way of getting this in
pdfplumber
. Currently I have 2 approaches:pdf2txt
I can run:
which gives:
This gets the images I want but is impenetrable. The
Im<d>
is occasionally incremented toIm1
,Im2
, etc, sometimes with and without a minor index. I don'r even know how to map these onto the order in the document.pdfplumber
I have a "debugger" for
pdfplumber
in https://github.com/petermr/pyami/blob/main/py4ami/ami_pdf.py (messy as I'm still digging!) with methodprint_images
If I knew how to get an
LTImage
I could probably export it here:I can at least extract the coordinates:
I save this string, see below
I can get the images by screen capture but this can lose info and also is overwritten by a watermark
These are the coordinates I extracted for filenames
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions