Grouping of lines based on grayscale #806
Faustilus
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment
-
Hi @Faustilus, and thanks for your interest in this library. For me, the first thing I do when I see a confusing table parse is to use the visual debugging tools. For instance: import pdfplumber
pdf = pdfplumber.open("/Users/jeremy/Downloads/vaha.pdf")
page = pdf.pages[2]
im = page.to_image()
im.debug_tablefinder() A few things you'll note:
To get a better extraction, I'd suggest:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey there!
Thanks for the library! Looks very promising!
My first test with it without parameters was ok-ish. However it lost valuable info in the process.
I try to extract tables in the following pdf. The tables start at page 3.
vaha.pdf
My code for now:
import pdfplumber
import pandas as pd
pdf = pdfplumber.open("vaha.pdf")
tables = []
for page in pdf.pages:
table = page.extract_table()
tables.append(table)
pdf.close()
df = pd.concat([pd.DataFrame(table) for table in tables])
df.to_csv("vaha.csv",index=False,sep=";")
df
The problems:
Number 1:
For the human eye it is clear that the info inside one row, maked by a distinctive color or grayscale, is one row.
The result however is retunred in lines like so:
That is rather remarkable, because the headers get grouped while the following rows doesn't.
Number 2:
It seems like the first column misses some values
Are there options to fix the output?
Beta Was this translation helpful? Give feedback.
All reactions