Replies: 1 comment
-
I've noticed the same issue. When we use a snippet code from the documentation to get rid of header and footer this seems to work properly: from PyPDF2 import PdfReader # the same for pypdf
reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[3]
parts = []
def visitor_body(text, cm, tm, fontDict, fontSize):
y = tm[5]
if y > 50 and y < 720: # page.artbox = RectangleObject([0, 0, 612, 792])
parts.append(text)
page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)
print(text_body) Now, let's focus on the last page and try to reproduce layout of vector graphics included in this page: import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from PyPDF2 import PdfReader
reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[18]
coo = []
def visitor_before(op, args, cm, tm):
if op == b"re":
coo.append([args[i].as_numeric() for i in range(4)])
def visitor_text(text, cm, tm, fontDict, fontSize):
pass
page.extract_text(visitor_operand_before=visitor_before, visitor_text=visitor_text)
fig, ax = plt.subplots()
fig.set_dpi(300)
ax.set_aspect(1)
ax.set_xlim(0, 612)
ax.set_ylim(0, 792)
for c in coo:
ax.add_patch(Rectangle((c[0], c[1]), c[2], c[3], fill=False, lw=0.2))
plt.show() What we expect and what we get is: Please, note that rectangles related to header and footer are positioned much better. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm using pypdf 3.7.0 to extract text from a pdf file. I need the text's location to do subsequent operations, so I extract the text along with its x and y coordinates from the text matrix. However, while there is no issue with the x coordinate, there is something wrong with the y coordinates.
I tried to check the page size to make sure that the file wasn't scaled, which is correct (the page size is 612x792).
I think one of the ways to solve this issue is to do some modification with the transformation matrix (cm) with the text matrix (tm), but I haven't figured out how to do that.
Note: A reason why I think about the transformation matrix (cm) is that for other pdf files, its value is [1,0,0,1,0,0]. However, for this pdf file, the values of cm keep on changing (especially the last 2 elements in the matrix).
Link to the pdf file: https://drive.google.com/file/d/10KMQVAJPB2hQSOOT6OrnF0RGg82k6i31/view?usp=sharing
Below is a code example of the first page. (The issue happens with all the pages)
I printed the result of some of the transformation and text matrices:
Beta Was this translation helpful? Give feedback.
All reactions