multiple lines in pdf extract as a sentence #915
Replies: 3 comments 8 replies
-
Are you using It looks like the data may be a "table" - you could try using the
If the data is public and you can share an example PDF - that would make it easier to help. |
Beta Was this translation helpful? Give feedback.
-
page4_mod.pdf I tried to convert the above statement into text by using pdfplumber. The second line in the Narration column is getting extended as a s a new line. Please help: |
Beta Was this translation helpful? Give feedback.
-
Thank you very much for your prompt reply.
Your valuable suggestion worked perfectly with files which bank statements which do not have security column as first column. This security column contains QR code printed vertically(sideways) in the first column. Here our suggested method of itertools logs out with error "out of range at lines[-1] += '\n'.join(group)"I tried to solve by converting pdf into word using pypdf library. In word, i deleted security(first column) column. Then, using word - export option converted file back to pdf.
Thus I could extract 99% data with extended lines(full narration)THANK YOU ONCE AGAIN
On Saturday, August 5, 2023 at 06:53:37 PM GMT+5:30, cmdlineluser ***@***.***> wrote:
You could probably open a new post for this if you're still stuck.
If you use keep_blank_chars=True for example, each of the "new lines" start with spaces.
One common approach is to rejoin the lines afterwards using itertools.groupby
>> with pdfplumber.open("Downloads/page4_mod.pdf") as pdf:
... first_page = pdf.pages[0]
... text = first_page.extract_text(keep_blank_chars=True)
... groups = itertools.groupby(text.splitlines(), lambda line: line.startswith(' '))
... lines = []
... for newline, group in groups:
... if newline:
... lines.extend(group)
... else:
... lines[-1] += '\n'.join(group)>>> lines
[' 02 Aug Transfer to xx1805 CommBank app brain furniture 325.00 $ $12,286.68 CR ',
' 05 Aug Transfer to xx1805 CommBank app 10.00 $ $12,276.68 CR ',
' 06 Aug Transfer to xx1805 CommBank app 40.00 $ $12,236.68 CR ',
' 07 Aug Transfer to xx1805 CommBank app 90.00 $ $12,146.68 CR ',
' 08 Aug Transfer to xx1805 CommBank app tickets 1,250.00 $ $10,896.68 CR ',
' 08 Aug Transfer to xx1805 CommBank app 20.00 $ $10,876.68 CR ',
' 09 Aug Transfer to xx1805 CommBank app 40.00 $ $10,836.68 CR ',
' 09 Aug Transfer from xx1805 CommBank app $10.00 $10,846.68 CR ',
' 11 Aug Transfer from xx1805 CommBank app $650.00 $11,496.68 CR ']
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
pdfplumber extract texts result:
066-22C104978 - : CC SEQUIN TRIM ZIP CARDI
Semi-Worsted - 2/48NM_52% Cotton, 26% Rayon, 17% Nylon, 3% Spandex, 2%
Primary Material:
Cashmere: YAR-4522 - Cobalt_FOB
Vendor (Supplier): Cobalt_FOB
Season\Floor Set: Chicos Frontline 2022 Holiday, Holiday
How can I extract the text to: (any parameter to control it)
066-22C104978 - : CC SEQUIN TRIM ZIP CARDI
Primary Material:
Semi-Worsted - 2/48NM_52% Cotton, 26% Rayon, 17% Nylon, 3% Spandex, 2%
Cashmere: YAR-4522 - Cobalt_FOB
Vendor (Supplier): Cobalt_FOB
Season\Floor Set: Chicos Frontline 2022 Holiday, Holiday
Semi-Worsted - 2/48NM_52% Cotton, 26% Rayon, 17% Nylon, 3% Spandex, 2%
Cashmere: YAR-4522 - Cobalt_FOB
should be a sentence.
Beta Was this translation helpful? Give feedback.
All reactions