Skip to content

Text exctraction #377

Answered by samkit-jain
erkin98 asked this question in Q&A
Mar 17, 2021 · 3 comments · 1 reply
Discussion options

You must be logged in to vote

Thanks for sharing the PDF @erkin98 Since, that top left text is not wrapped in a rect object, there is no straightforward way to extract it. One alternate workaround would be to get the coordinates of those horizontal lines at the top and bottom of the text, crop the page and then extract text from it.

You can do so by running the following code:

import pdfplumber

pdf = pdfplumber.open("file.pdf")
page = pdf.pages[0]
top_line = page.horizontal_edges[2]  # The top line is actually the 3rd horizontal edge.
bottom_line = page.horizontal_edges[4]  # The bottom line is actually the 5th horizontal edge.

page = page.crop(
    (top_line["x0"], top_line["top"], top_line["x1"], bottom_line["bottom"

Replies: 3 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@samkit-jain
Comment options

Answer selected by samkit-jain
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants