search keyword's row & breaks line problem #648
Replies: 2 comments
-
about problem 1
can extract whole page's word contain "esd" . I would like to ask you if there is a better way on (search keyword part) and (get keyword's row part) Many Thanks! |
Beta Was this translation helpful? Give feedback.
-
Hi @lulu313kobe, and thanks for the kind words. There is not currently a direct method for extracting keyword bounding boxes, but it's a feature I'd like to add someday: #201 In the meantime, this code might be a slightly more efficient and flexible way to get what you want, taking advantage of regular expression capture groups: import re
from pprint import pprint
import pdfplumber
pdf = pdfplumber.open("test-2.pdf")
def extract_pattern_rows(pdf, pattern):
for page in pdf.pages:
text = page.extract_text()
for line in text.split("\n"):
match = re.search(pattern, line)
if match is not None:
yield dict(
page=page.page_number,
line=line,
match=match.group(0),
groups=match.groups()
)
pattern = re.compile(r"(ESD\(.+\))[ .]{10,}(\d+KV?)", re.IGNORECASE)
result = list(extract_pattern_rows(pdf, pattern))
pprint(result) ... which produces:
The code in issue #242 may be helpful for this.
Unfortunately, I don't think I understand this part of your question. Can you share code demonstrates the problem you're having, as well as the output you were hoping to achieve? |
Beta Was this translation helpful? Give feedback.
-
Hi! Thanks for the previous assistance and maintain such a useful library.
I have some problem about parsing pdf , appreciate you help if you are free.
And sorry about my pool english if my word offend you!
problem 1 :
I want to extract specified word's corresponding value in pdf file as pic 1 shown.(word and value NOT in table)
[pic 1.]
the keyword is "ESD" and the value always endswith "KV" or "V"
so Can pdfplumber return search keyword's whole row or row's bounding box ?
My idea is search "ESD" >>> get esd locate row >>> search row whether endswith "KV"or "V" >>> get the last number of row.
Is this idea availible or you have better idea or hint ?
truly appreciate.
problem 2 :
to reduce search time , can only search text without search table's text?
problem 3 :
to solve breaks line parsed into two rows issue ,
I refer to #488
it's work but when i apply to the pdf above (no breaks line ),it filter the whole table, i have try other no breaks line pdf ,but did not affect. (maybe special attributes?)
would you give me some hint or debug hint or solution about this issue?
Many Thanks!
Beta Was this translation helpful? Give feedback.
All reactions