Detecting paragraphs or blank lines inside a table #736

Cristishor201 · 2022-09-26T22:43:18Z

Cristishor201
Sep 26, 2022

So there are these questions on stackoverflow:
pdfplumber - How to extract table with no horizontal lines? - this is mine
Use pdfplumber to extract paragraphs - this one is similar

But I will repost it again here...
So my pdf looks like this:

As you can see I don't have horizontal lines inside the table. And I need some sort of parameter or something to split the data from second column like:
['PRODUCT 1\ndescription line 1\ndescription line 2', 'PRODUCT 2\ndescription line 1', 'PRODUCT 3\ndescription line 1\ndescription line 2'] - on vertical extraction ( I jumped over the other columns)

or
[['1', 'PRODUCT 1\ndescription line 1\ndescription line 2', 'BUC', '1', '35.00', '35.00', '6.65'], ['2', 'PRODUCT 2\ndescription line 1', 'buc', '1', '7.00', '7.00', '1.33'], ['3', 'PRODUCT 3\ndescription line 1\ndescription line 2', 'buc', '1', '31.00', '31.00', '5.89']] - on horizontal extraction

On the image, I put some red rectangles to understand where should split.

Cristishor201 · 2022-09-26T22:58:12Z

Cristishor201
Sep 26, 2022
Author

Possible duplicate of #122

0 replies

jsvine · 2022-09-27T21:04:40Z

jsvine
Sep 27, 2022
Maintainer

Hi @Cristishor201, and thanks for your interest in pdfplumber. Have you tried using the "horizontal_strategy": "text" setting documented here. If so, what results do you get?

Depending on the specifics of the PDF (sharing it in this thread will help), that setting may not be fully sufficient, but it's a start.

5 replies

samkit-jain Sep 28, 2022
Collaborator

Extending on @jsvine 's suggestion to use the text strategy, you can add a post-processing layer to consider a split when the difference between the 2 horizontal lines exceeds a threshold. In your case, it can be clearly seen that where you want to split, there is a big enough gap and that gap can be used in the post-processing layer to act as the new row identifier.

Cristishor201 Sep 28, 2022
Author

On find_tables() is no difference.

For
print(page.extract_table(table_settings={ "horizontal_strategy": "text", "snap_y_tolerance": 4, }))

I get something like this

[['Nr. crt', 'Products name', None, 'U.M.', 'QTY', '', '-$-', 'TVA'],
['', '', None, '', '', '-$-', '', '-$-'],
['', '', None, '', '', '', '', ''],
['0', '1', None, '2', '3', '4', '5(3x4)', '6'],
['', '', None, '', '', '', '', ''],
['', 'PRODUCT 1', None, '', '', '', '', ''],
['1', 'description line 1', None, 'BUC', '1', '35.00', '35.00', '6.65'],
['', 'description line 2', None, '', '', '', '', ''],
['', '', None, '', '', '', '', ''],
['', '', None, '', '', '', '', ''],
['2', 'PRODUCT 2', None, 'buc', '1', '7.00', '7.00', '1.33'],
['', 'description line 1', None, '', '', '', '', ''],
['', 'PRODUCT 3', None, '', '', '', '', ''],
['3', 'description line 1', None, 'buc', '1', '31.00', '31.00', '5.89'],
['', 'description line 2', None, '', '', '', '', ''],
['', '', None, '', '', '', '', ''],
['', '', None, '', '', '', '', '']]

Pay attention to the second item, which is either "" (empty value) or a value from the table.
And as you can see I don't know which are the lines filtered, as I get each value as a new line. Also empty values are put random, so I can not guess where it's the actual split.

And if I play around with "snap_y_tolerance": 4 and make it eg. 1, I get even more random empty values.

Something it's not working. That's way I put it as a request feature.

Cristishor201 Sep 28, 2022
Author

Can not provide pdf, as it has sensitive informations.
But I can tell it's a smartBill invoice template like this one:

which also incorporate description for each product.

samkit-jain Sep 30, 2022
Collaborator

You can use the code similar to the following to get the coordinates of all the horizontal lines in the table.

tables = page.find_tables(table_settings={"horizontal_strategy": "text", "vertical_strategy": "lines"})

# get the largest table
table = max(tables, key=lambda x: len(x.cells))
h_lines = []

# add lines
for cell in table.cells:
    h_lines.append(cell[1])

# add the last line
h_lines.append(table.cells[-1][3])

Then, as a post-processing step, can keep only those lines in which the difference is greater than your threshold (could be the average height of characters on the page)

h_lines = [
    l2
    for l1, l2 in zip(h_lines[::], h_lines[1::])
    if abs(l1 - l2) >= my_threshold
]

And once that is done, extract the table again but this time use the strategy as

{"horizontal_strategy": "explicit", "explicit_horizontal_lines": h_lines, "vertical_strategy": "lines"}

Since you can't share the PDF, I can't guarantee that this will work but believe that this would give you a start. You can customise it as per your needs.

Cristishor201 Oct 5, 2022
Author

interesting idea.
But now I need to find the threshold, as it's much higher than average.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detecting paragraphs or blank lines inside a table #736

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Detecting paragraphs or blank lines inside a table #736

Cristishor201 Sep 26, 2022

Replies: 2 comments · 5 replies

Cristishor201 Sep 26, 2022 Author

jsvine Sep 27, 2022 Maintainer

samkit-jain Sep 28, 2022 Collaborator

Cristishor201 Sep 28, 2022 Author

Cristishor201 Sep 28, 2022 Author

samkit-jain Sep 30, 2022 Collaborator

Cristishor201 Oct 5, 2022 Author

Cristishor201
Sep 26, 2022

Replies: 2 comments 5 replies

Cristishor201
Sep 26, 2022
Author

jsvine
Sep 27, 2022
Maintainer

samkit-jain Sep 28, 2022
Collaborator

Cristishor201 Sep 28, 2022
Author

Cristishor201 Sep 28, 2022
Author

samkit-jain Sep 30, 2022
Collaborator

Cristishor201 Oct 5, 2022
Author