search keyword's row & breaks line problem #648

lulu313kobe · 2022-05-05T02:19:13Z

lulu313kobe
May 5, 2022

Hi! Thanks for the previous assistance and maintain such a useful library.
I have some problem about parsing pdf , appreciate you help if you are free.
And sorry about my pool english if my word offend you!

problem 1 :
I want to extract specified word's corresponding value in pdf file as pic 1 shown.(word and value NOT in table)
[pic 1.]

the keyword is "ESD" and the value always endswith "KV" or "V"
so Can pdfplumber return search keyword's whole row or row's bounding box ?

My idea is search "ESD" >>> get esd locate row >>> search row whether endswith "KV"or "V" >>> get the last number of row.

Is this idea availible or you have better idea or hint ?
truly appreciate.

problem 2 :
to reduce search time , can only search text without search table's text?

problem 3 :
to solve breaks line parsed into two rows issue ,
I refer to #488

it's work but when i apply to the pdf above (no breaks line ),it filter the whole table, i have try other no breaks line pdf ,but did not affect. (maybe special attributes?)

would you give me some hint or debug hint or solution about this issue?

Many Thanks!

lulu313kobe · 2022-05-05T08:39:10Z

lulu313kobe
May 5, 2022
Author

about problem 1

def SearchPdf(pdf_filename: str, target: str):
    text_list= []
    match_page_list = []
    with pdfplumber.open(pdf_filename) as pdf:
        #Total number of pages
        NumPages = len(pdf.pages)
        for i in range(0 ,NumPages):
            obj = pdf.pages[i]
            text = obj.extract_text()
            search = re.search(target, text, re.IGNORECASE)
            if search is not None:
                match_page = i
                match_page_list.append(match_page)
                text_list.append(text)
    return match_page, text_list
page, text = SearchPdf(pdffile, 'esd')

can extract whole page's word contain "esd" .
Do some split()、strip() can get the "esd" locate row.
but it seem a little bit complex and Inefficient lol .

I would like to ask you if there is a better way on (search keyword part) and (get keyword's row part)

Many Thanks!

0 replies

jsvine · 2022-05-06T15:59:54Z

jsvine
May 6, 2022
Maintainer

Hi @lulu313kobe, and thanks for the kind words. There is not currently a direct method for extracting keyword bounding boxes, but it's a feature I'd like to add someday: #201

In the meantime, this code might be a slightly more efficient and flexible way to get what you want, taking advantage of regular expression capture groups:

import re
from pprint import pprint
import pdfplumber

pdf = pdfplumber.open("test-2.pdf")

def extract_pattern_rows(pdf, pattern):
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split("\n"):
            match = re.search(pattern, line)
            if match is not None:
                yield dict(
                    page=page.page_number,
                    line=line,
                    match=match.group(0),
                    groups=match.groups()
                )

pattern = re.compile(r"(ESD\(.+\))[ .]{10,}(\d+KV?)", re.IGNORECASE)
result = list(extract_pattern_rows(pdf, pattern))
pprint(result)

... which produces:

[{'groups': ('ESD(HBM)(1)', '2KV'),
  'line': 'Reflow Temperature (soldering, 10sec) . . . . . . .260°C  '
          'ESD(HBM)(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . '
          '.2KV ',
  'match': 'ESD(HBM)(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . '
           '. .2KV',
  'page': 1},
 {'groups': ('ESD(CDM)', '1KV'),
  'line': 'ESD(CDM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . '
          '. 1KV ',
  'match': 'ESD(CDM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . '
           '. 1KV',
  'page': 1}]

problem 2 : to reduce search time , can only search text without search table's text?

The code in issue #242 may be helpful for this.

problem 3 : to solve breaks line parsed into two rows issue , I refer to #488
it's work but when i apply to the pdf above (no breaks line ),it filter the whole table, i have try other no breaks line pdf ,but did not affect. (maybe special attributes?)

Unfortunately, I don't think I understand this part of your question. Can you share code demonstrates the problem you're having, as well as the output you were hoping to achieve?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

search keyword's row & breaks line problem #648

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

search keyword's row & breaks line problem #648

lulu313kobe May 5, 2022

Replies: 2 comments

lulu313kobe May 5, 2022 Author

jsvine May 6, 2022 Maintainer

lulu313kobe
May 5, 2022

lulu313kobe
May 5, 2022
Author

jsvine
May 6, 2022
Maintainer