Replies: 3 comments
-
This is currently not possible with pypdf. While you are able to get the text positions by using a visitor (https://pypdf.readthedocs.io/en/stable/user/extract-text.html#using-a-visitor), extracting the bounding boxes of images has been discussed previously, but not implemented as this tends to be more complex and in this specific case the page values could have been used as well: #2763. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your prompt reply! My pdf is fairly simple with only text and image, so maybe i can infer the image location from the text locations. I'll give it a try |
Beta Was this translation helpful? Give feedback.
-
I put together a workaround, the idea is that you can get all images on the page in a top-down order. Then you can use consecutive newlines between text to figure out image slots. Align the images to their slots and you have the text-image ordered. This works for usecases where images and texts are not on the same line. def embed_image_placeholders(text, image):
txt = text
for img in image:
placeholder = f"\n<IMAGE at {img.name}>\n"
txt = re.sub('\n{4,}', placeholder, txt, count=1, flags=re.MULTILINE)
return txt I am using 4 or more consecutive new lines as image slots, you can easily update it to your usecase. I hope this helps folks who just need a quick and dirty solution. |
Beta Was this translation helpful? Give feedback.
-
Thank you so much for maintaining such a wonderful package. I am trying to parse a pdf into texts and images. I was able to get texts and images separately by following the documentation. But I also need to make sure the paragraphs of text and image are in the correct order. For example a PDF page could be:
PDF page
Below is an image:
an image
The location of the image above is important.
/PDF page
Currently, the workaround I can think of is to use the empty lines from extracted text to location where the image should be. But I am thinking that there must be a better way to order the texts and images from the same page.
Thank you for reviewing my question!
Beta Was this translation helpful? Give feedback.
All reactions