Get the location of image and text paragraphs #2828

alanhyue · 2024-09-03T03:54:10Z

alanhyue
Sep 3, 2024

Thank you so much for maintaining such a wonderful package. I am trying to parse a pdf into texts and images. I was able to get texts and images separately by following the documentation. But I also need to make sure the paragraphs of text and image are in the correct order. For example a PDF page could be:

PDF page

Below is an image:
an image
The location of the image above is important.

/PDF page

Currently, the workaround I can think of is to use the empty lines from extracted text to location where the image should be. But I am thinking that there must be a better way to order the texts and images from the same page.

Thank you for reviewing my question!

stefan6419846 · 2024-09-03T05:45:57Z

stefan6419846
Sep 3, 2024
Maintainer

This is currently not possible with pypdf. While you are able to get the text positions by using a visitor (https://pypdf.readthedocs.io/en/stable/user/extract-text.html#using-a-visitor), extracting the bounding boxes of images has been discussed previously, but not implemented as this tends to be more complex and in this specific case the page values could have been used as well: #2763.

0 replies

alanhyue · 2024-09-03T12:43:12Z

alanhyue
Sep 3, 2024
Author

Thanks for your prompt reply! My pdf is fairly simple with only text and image, so maybe i can infer the image location from the text locations. I'll give it a try

0 replies

alanhyue · 2024-09-05T03:04:36Z

alanhyue
Sep 5, 2024
Author

I put together a workaround, the idea is that you can get all images on the page in a top-down order. Then you can use consecutive newlines between text to figure out image slots. Align the images to their slots and you have the text-image ordered. This works for usecases where images and texts are not on the same line.

def embed_image_placeholders(text, image):
    txt = text
    for img in image:
        placeholder = f"\n<IMAGE at {img.name}>\n"
        txt = re.sub('\n{4,}', placeholder, txt, count=1, flags=re.MULTILINE)
    return txt

I am using 4 or more consecutive new lines as image slots, you can easily update it to your usecase. I hope this helps folks who just need a quick and dirty solution.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get the location of image and text paragraphs #2828

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Get the location of image and text paragraphs #2828

alanhyue Sep 3, 2024

Replies: 3 comments

stefan6419846 Sep 3, 2024 Maintainer

alanhyue Sep 3, 2024 Author

alanhyue Sep 5, 2024 Author

alanhyue
Sep 3, 2024

stefan6419846
Sep 3, 2024
Maintainer

alanhyue
Sep 3, 2024
Author

alanhyue
Sep 5, 2024
Author