Grouping of lines based on grayscale #806

Faustilus · 2023-02-03T09:54:31Z

Faustilus
Feb 3, 2023

Hey there!
Thanks for the library! Looks very promising!

My first test with it without parameters was ok-ish. However it lost valuable info in the process.

I try to extract tables in the following pdf. The tables start at page 3.

My code for now:

import pdfplumber
import pandas as pd
pdf = pdfplumber.open("vaha.pdf")
tables = []
for page in pdf.pages:
table = page.extract_table()
tables.append(table)
pdf.close()
df = pd.concat([pd.DataFrame(table) for table in tables])
df.to_csv("vaha.csv",index=False,sep=";")
df

The problems:
Number 1:
For the human eye it is clear that the info inside one row, maked by a distinctive color or grayscale, is one row.
The result however is retunred in lines like so:

That is rather remarkable, because the headers get grouped while the following rows doesn't.

Number 2:
It seems like the first column misses some values

Are there options to fix the output?

jsvine · 2023-02-03T19:36:44Z

jsvine
Feb 3, 2023
Maintainer

Hi @Faustilus, and thanks for your interest in this library. For me, the first thing I do when I see a confusing table parse is to use the visual debugging tools. For instance:

import pdfplumber
pdf = pdfplumber.open("/Users/jeremy/Downloads/vaha.pdf")
page = pdf.pages[2]
im = page.to_image()
im.debug_tablefinder()

... results in this:

A few things you'll note:

The main objects dividing the rows are in fact those gray rectangles
But because the non-gray rows don't have any enclosing rectangles, the extraction is missing at the edges of those rows
There is a dividing line in the Pneumatico column's rows, which is causing the splitting you see

To get a better extraction, I'd suggest:

Adding "explicit_vertical_lines": [X1, X2] to the table-extraction settings, where X1 is the x-position of the left side of the table, and X2 is the x-position of the right side. Depending on your task, you could specify those manually or dynamically (e.g., by deriving them from what you see in page.rects).
Filtering out those short dividing lines (which are actually encoded as rectangles) via filtered = page.filter(lambda obj: not (obj["object_type"] == "rect" and 10 < obj["width"] < 100)).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouping of lines based on grayscale #806

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Grouping of lines based on grayscale #806

Faustilus Feb 3, 2023

Replies: 1 comment

jsvine Feb 3, 2023 Maintainer

Faustilus
Feb 3, 2023

jsvine
Feb 3, 2023
Maintainer