PDF and muPDF coordiantes #3889

anhalu · 2024-09-25T09:34:50Z

anhalu
Sep 25, 2024

Description of the bug

I am trying to get the coordinate positions of each text in a pdf file. But the mupdf files have the coordinate origin as top left, but the pdf files have the coordinate origin as bottom left (or may be the random coordinate origin). I only want to get the coordinate origin as top left, is there a way to get the coordinate origin as top left only?

How to reproduce the bug

None.

PyMuPDF version

1.24.10

Operating system

Linux

Python version

3.10

Answered by anhalu

Sep 26, 2024

update problem

I think I'm misunderstanding something, the coordinates of the pdf are read by pymupdf (forming the coordinates of mupdf). The original PDF file is rotated so the coordinates of mupdf are also rotated. Now I set page.remove_rotation() which solved the problem, but at the same time adding remove_rotation() generates a lot of shape points (x, y coordinate pairs in the items of the shape) which increases the calculation a lot. I was going to use convex hull to handle this but it's not efficient enough.

View full answer

anhalu · 2024-09-25T09:35:30Z

anhalu
Sep 25, 2024
Author

@JorjMcKie please help me.

7 replies

anhalu Sep 26, 2024
Author

@JorjMcKie Thanks for reply me.
Let me clarify a bit. I am trying to make a tool to convert pdf files to pptx files. It would be fine if the coordinate origin of pdf files is top-left, but some files (like pdf files) have the coordinate origin at bottom left. The problem to be solved is:

How can we unify that there is only one type of coordinate origin which is top left?
Or what is the identifying mark (or some attribute) to distinguish when pymupdf reads a file in pdf format (bottom left) to uses transformation matrix to convert pdf coordinates space to mupdf coordinates space

K8S666 Sep 26, 2024

Get the font size and then convert the estimate

anhalu Sep 26, 2024
Author

@K8S666 I don't think it will solve the problem, because the coordinates are wrong.

anhalu Sep 26, 2024
Author

update problem

I think I'm misunderstanding something, the coordinates of the pdf are read by pymupdf (forming the coordinates of mupdf). The original PDF file is rotated so the coordinates of mupdf are also rotated. Now I set page.remove_rotation() which solved the problem, but at the same time adding remove_rotation() generates a lot of shape points (x, y coordinate pairs in the items of the shape) which increases the calculation a lot. I was going to use convex hull to handle this but it's not efficient enough.

Answer selected by anhalu

JorjMcKie Sep 26, 2024
Maintainer

How did you compute the convex hull?
Every path (item in list page.get_drawings()) has a "rect" property which is the convex hull of all points occurring in the path's atomic draw commands.

anhalu Sep 27, 2024
Author

shapes = page.get_drawings() 
for shape in shapes:
                if shape['type'] in ['fs', 'f', 's']:
                    items = shape['items']
                    if items[0][0] == 're':
                        # Handling rectangles
                        rect = shape['rect']
                        x0 = slide_width * (rect.x0 / page_width)
                        y0 = slide_height * (rect.y0 / page_height)
                        x1 = slide_width * (rect.x1 / page_width)
                        y1 = slide_height * (rect.y1 / page_height)

                        path = slide.shapes.build_freeform(x0, y0)
                        path.add_line_segments(
                            [
                                (x1, y0),
                                (x1, y1),
                                (x0, y1),
                                (x0, y0),
                            ], close=False,
                        )
                    else:
                        # Handling other shapes
                        start_x = slide_width * (items[0][1].x / page_width)
                        start_y = slide_height * (items[0][1].y / page_height)
                        path = slide.shapes.build_freeform(start_x, start_y)
                        num = 0
                        for item in items:
                            if item[0] in ['l', 'c']:  # line to
                                for l_i in range(1, len(item)):
                                    x, y = slide_width * \
                                        (item[l_i].x / page_width), \
                                        slide_height * \
                                        (item[l_i].y / page_height)
                                    print(x, y)
                                    num += 1
                                    path.add_line_segments(
                                        [(x, y)], close=False,
                                    )

                            else:
                                logger.debug(f'Other item type: {item[0]}')
                        print("Number of line: ", num)
                        if shape['closePath']:
                            path.add_line_segments([(start_x, start_y)])

                    shape_obj = path.convert_to_shape()

                    if 'fill' in shape:
                        if shape['fill_opacity'] is not None \
                                and shape['fill_opacity'] > -1:
                            r, g, b = (int(c * 255) for c in shape['fill'])
                            shape_obj.fill.solid()
                            shape_obj.fill.fore_color.rgb = RGBColor(r, g, b)
                            _set_shape_transparency(
                                shape_obj, int(
                                    (shape['fill_opacity']) * 100000,
                                ),
                            )
                        else:
                            shape_obj.fill.solid()
                            shape_obj.fill.fore_color.rgb = RGBColor(0, 0, 0)
                            _set_shape_transparency(shape_obj, int(0))

                    if 'color' in shape:
                        if shape['color'] is None:
                            shape_obj.line.fill.background()
                        else:
                            r, g, b = (int(c * 255) for c in shape['color'])
                            shape_obj.line.color.rgb = RGBColor(r, g, b)
                            # assuming width is in points
                            shape_obj.line.width = Inches(shape['width'] / 72)

@JorjMcKie
I think that for shape of type 're' it is possible to use rec but other shape types need to be handled separately. So I tried to get each pair of x, y points from item and convert it to a shape in pptx. But the problem is that when removing rotation it generates too many pairs of x, y points (more than 4000 pairs of points) making the conversion quite difficult. I intend to use convex hull on these pairs of points to get the points located at the edge of the shape.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF and muPDF coordiantes #3889

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

PDF and muPDF coordiantes #3889

anhalu Sep 25, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

update problem

Replies: 1 comment · 7 replies

anhalu Sep 25, 2024 Author

anhalu Sep 26, 2024 Author

K8S666 Sep 26, 2024

anhalu Sep 26, 2024 Author

anhalu Sep 26, 2024 Author

update problem

JorjMcKie Sep 26, 2024 Maintainer

anhalu Sep 27, 2024 Author

anhalu
Sep 25, 2024

Replies: 1 comment 7 replies

anhalu
Sep 25, 2024
Author

anhalu Sep 26, 2024
Author

anhalu Sep 26, 2024
Author

anhalu Sep 26, 2024
Author

JorjMcKie Sep 26, 2024
Maintainer

anhalu Sep 27, 2024
Author