Issue with y coordinate of text matrix (tm) using pypdf 3.7.0 #1751

phamanhuy · 2023-03-27T03:09:56Z

phamanhuy
Mar 27, 2023

I'm using pypdf 3.7.0 to extract text from a pdf file. I need the text's location to do subsequent operations, so I extract the text along with its x and y coordinates from the text matrix. However, while there is no issue with the x coordinate, there is something wrong with the y coordinates.

I tried to check the page size to make sure that the file wasn't scaled, which is correct (the page size is 612x792).

I think one of the ways to solve this issue is to do some modification with the transformation matrix (cm) with the text matrix (tm), but I haven't figured out how to do that.

Note: A reason why I think about the transformation matrix (cm) is that for other pdf files, its value is [1,0,0,1,0,0]. However, for this pdf file, the values of cm keep on changing (especially the last 2 elements in the matrix).

Link to the pdf file: https://drive.google.com/file/d/10KMQVAJPB2hQSOOT6OrnF0RGg82k6i31/view?usp=sharing

Below is a code example of the first page. (The issue happens with all the pages)

from pypdf import PdfReader

def visitor_body(text, cm, tm, fontDict, fontSize):
  x, y = tm[4], tm[5]
  print('This is text',text)
  print('This is tm',tm)
  print('This is cm',cm)

py_reader = PdfReader("Typhoon Merbok PVRR.pdf")
py_page = py_reader.pages[0]

print('This is page size',py_page.mediabox)
py_page.extract_text(visitor_text=visitor_body)

I printed the result of some of the transformation and text matrices:

`This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 699.75]
This is text Typhoon Merbok
This is tm [1.0, 0.0, 0.0, -1.0, 138.5182341, 17.92000058]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 699.75]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 138.5182341, 17.92000058]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 683.25]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 0.0, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 683.25]
This is text 17 September, 2022
This is tm [1.0, 0.0, 0.0, -1.0, 127.91241470000003, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 683.25]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 127.91241470000003, 14.079999970000001]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 669.75]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 0.0, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 669.75]
This is text Released:
This is tm [1.0, 0.0, 0.0, -1.0, 64.36441049999999, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 669.75]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 669.75]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 73.317032, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 669.75]
This is text 31 October, 2022
This is tm [1.0, 0.0, 0.0, -1.0, 177.6134188, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 669.75]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 177.6134188, 14.079999970000001]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 656.25]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 0.0, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 656.25]
This is text NHERI DesignSafe Project ID:
This is tm [1.0, 0.0, 0.0, -1.0, 201.9814454, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 656.25]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 201.9814454, 14.079999970000001]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 642.75]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 0.0, 15.3599997]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 642.75]
This is text PRJ-
This is tm [1.0, 0.0, 0.0, -1.0, 27.6880035, 15.3599997]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 642.75]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 642.75]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 36.640625, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 642.75]
This is text 3737
This is tm [1.0, 0.0, 0.0, -1.0, 36.640625, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 642.75]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 36.640625, 15.35999966]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 76.5, 615.75]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 31.672241, 1.02156258]
This is cm [0.75, 0.0, 0.0, -0.75, 76.5, 615.75]
This is text PRELIMINARY VIRTUAL RECONNAISSANCE REPORT (PVRR)
This is tm [1.0, 0.0, 0.0, -1.0, 573.4407955, 17.92000058]
This is cm [0.75, 0.0, 0.0, -0.75, 76.5, 615.75]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 573.4407955, 17.92000058]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 522.0]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 95.317818, 1.02156258]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 522.0]
This is text Virtual Assessment Structural Team (VAST) Lead
This is tm [1.0, 0.0, 0.0, -1.0, 516.5935141, 17.92000058]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 522.0]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 516.5935141, 17.92000058]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 505.5]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 155.507813, 15.89333344]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 505.5]
This is text Mohammad Alam, University of Notre Dame
This is tm [1.0, 0.0, 0.0, -1.0, 243.546876, 15.89333344]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 505.5]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 243.546876, 15.89333344]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 480.0]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 81.33474, 1.02156258]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 480.0]
This is text Virtual Assessment Structural Team (VAST) Authors
This is tm [1.0, 0.0, 0.0, -1.0, 530.9040763, 17.92000058]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 480.0]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 530.9040763, 17.92000058]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 463.5]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 234.625, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 463.5]
This is text (in alphabetical order)
This is tm [1.0, 0.0, 0.0, -1.0, 234.625, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 463.5]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 234.625, 15.35999966]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 449.25]
This is text  

This is tm [1.0, 0.0, 0.0, -1.0, 160.804688, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 449.25]
This is text Janise Rodgers, GeoHazards International
This is tm [1.0, 0.0, 0.0, -1.0, 160.804688, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 449.25]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 160.804688, 15.35999966]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 435.0]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 186.91797, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 435.0]
This is text Prateek Arora, New York University
This is tm [1.0, 0.0, 0.0, -1.0, 339.003908, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 435.0]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 339.003908, 15.35999966]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 420.75]

pbrus · 2023-05-08T20:20:44Z

pbrus
May 8, 2023

I've noticed the same issue. When we use a snippet code from the documentation to get rid of header and footer this seems to work properly:

from PyPDF2 import PdfReader  # the same for pypdf

reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[3]

parts = []


def visitor_body(text, cm, tm, fontDict, fontSize):
    y = tm[5]
    if y > 50 and y < 720:  # page.artbox = RectangleObject([0, 0, 612, 792])
        parts.append(text)


page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)

print(text_body)

Now, let's focus on the last page and try to reproduce layout of vector graphics included in this page:

import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

from PyPDF2 import PdfReader


reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[18]

coo = []

def visitor_before(op, args, cm, tm):
    if op == b"re":
        coo.append([args[i].as_numeric() for i in range(4)])


def visitor_text(text, cm, tm, fontDict, fontSize):
    pass


page.extract_text(visitor_operand_before=visitor_before, visitor_text=visitor_text)

fig, ax = plt.subplots()
fig.set_dpi(300)
ax.set_aspect(1)
ax.set_xlim(0, 612)
ax.set_ylim(0, 792)

for c in coo:
	ax.add_patch(Rectangle((c[0], c[1]), c[2], c[3], fill=False, lw=0.2))

plt.show()

What we expect and what we get is:

Please, note that rectangles related to header and footer are positioned much better.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with y coordinate of text matrix (tm) using pypdf 3.7.0 #1751

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Issue with y coordinate of text matrix (tm) using pypdf 3.7.0 #1751

phamanhuy Mar 27, 2023

Replies: 1 comment

pbrus May 8, 2023

phamanhuy
Mar 27, 2023

pbrus
May 8, 2023