Text exctraction #377

erkin98 · 2021-03-17T08:17:22Z

erkin98
Mar 17, 2021

how can i extract text form particular area without surrounded lines?

Mar 17, 2021

Thanks for sharing the PDF @erkin98 Since, that top left text is not wrapped in a rect object, there is no straightforward way to extract it. One alternate workaround would be to get the coordinates of those horizontal lines at the top and bottom of the text, crop the page and then extract text from it.

You can do so by running the following code:

import pdfplumber

pdf = pdfplumber.open("file.pdf")
page = pdf.pages[0]
top_line = page.horizontal_edges[2]  # The top line is actually the 3rd horizontal edge.
bottom_line = page.horizontal_edges[4]  # The bottom line is actually the 5th horizontal edge.

page = page.crop(
    (top_line["x0"], top_line["top"], top_line["x1"], bottom_line["bottom"

View full answer

samkit-jain · 2021-03-17T09:22:32Z

samkit-jain
Mar 17, 2021
Collaborator

Hi @erkin98, could you please provide the PDF and provide more details on where that area is in the PDF?

0 replies

erkin98 · 2021-03-17T10:17:58Z

erkin98
Mar 17, 2021
Author

https://drive.google.com/file/d/18smAX6VTvqbfyEQ6e20Tg6iOl3mDBMZF/view?usp=sharing
in this pdf i want to crop areas top left as a table?is it possible or i just parse as text?

0 replies

samkit-jain · 2021-03-17T12:15:41Z

samkit-jain
Mar 17, 2021
Collaborator

Thanks for sharing the PDF @erkin98 Since, that top left text is not wrapped in a rect object, there is no straightforward way to extract it. One alternate workaround would be to get the coordinates of those horizontal lines at the top and bottom of the text, crop the page and then extract text from it.

You can do so by running the following code:

import pdfplumber

pdf = pdfplumber.open("file.pdf")
page = pdf.pages[0]
top_line = page.horizontal_edges[2]  # The top line is actually the 3rd horizontal edge.
bottom_line = page.horizontal_edges[4]  # The bottom line is actually the 5th horizontal edge.

page = page.crop(
    (top_line["x0"], top_line["top"], top_line["x1"], bottom_line["bottom"])  # Create a rect object using the coordinates of the 2 horizontal edges.
)

print(page.extract_text())

The cropped page looks like

and the output is

BAYBURT GRUP İNŞAAT NAKLİYAT MAD.İT.İHR.SAN.VE TİC.A.Ş.
IŞIK SOKAK  No:20 
06570 Çankaya/ Ankara 
Tel: 3122290808 Fax: 3122290808 
Web Sitesi: .
E-Posta: muhasebe@bayburtgroup.com
Vergi Dairesi: MALTEPE VD. 
TICARETSICILNO: 309029
MERSISNO: 0151043382200021
VKN: 1510433822

1 reply

samkit-jain Mar 17, 2021
Collaborator

@erkin98 I have selected the message as the answer based on the rocket emoji reaction from you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text exctraction #377

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Text exctraction #377

erkin98 Mar 17, 2021

Replies: 3 comments · 1 reply

samkit-jain Mar 17, 2021 Collaborator

erkin98 Mar 17, 2021 Author

samkit-jain Mar 17, 2021 Collaborator

samkit-jain Mar 17, 2021 Collaborator

erkin98
Mar 17, 2021

Replies: 3 comments 1 reply

samkit-jain
Mar 17, 2021
Collaborator

erkin98
Mar 17, 2021
Author

samkit-jain
Mar 17, 2021
Collaborator

samkit-jain Mar 17, 2021
Collaborator