Visual Setting in PdfPlumber #934

Pranavagrl1 · 2023-07-12T10:02:12Z

Pranavagrl1
Jul 12, 2023

Hi, Is there any setting in this library so that i can extract only those tables which are visualized.
Dummy.pdf

cmdlineluser · 2023-07-12T17:36:34Z

cmdlineluser
Jul 12, 2023

There may be a simpler way to do this using the table settings, but this is what I did:

There are several nested rects which are the reason for the resulting empty strings and Nones, if we look at page1.to_image(300).draw_rects(page1.rects).save('page1.png')

If we remove all of these inner rects we are mostly there except for the last row in table 1, there is now a gap where the "empty" rects were:

On page 2 there is a similar issue with the right side of the table not being a "full line".

To remove the inner rects, we can check if we are inside any larger rect.

We can fix the bottom and right sides of the table by using extra explicit lines using the tables bbox values.

import pdfplumber
from   operator import itemgetter

def inside(self, other):
    return all((
        self['x0'] >= other['x0'],
        self['top'] >= other['top'],
        self['x1'] <= other['x1'],
        self['bottom'] <= other['bottom']
    ))

def largest_parent_rect(page, self):
    parent_rects = [other for other in page.rects if inside(self, other)]
    if parent_rects:
        parent_rect = max(parent_rects, key=itemgetter('width', 'height'))
        if self != parent_rect:
            return parent_rect

def remove_nested_rects(page, keep_largest=False):
    def filter_condition(other):
        if other['object_type'] == 'rect':
            return tuple(other['pts']) not in rects_to_remove
        return True

    rects_to_remove = set()

    for rect in page.rects:
        parent = largest_parent_rect(page, rect)
        if parent is not None:
            rects_to_remove.add(tuple(rect['pts']))
            if keep_largest is False:
                rects_to_remove.add(tuple(parent['pts']))

    return page.filter(filter_condition)

pdf = pdfplumber.open('Downloads/Dummy-1.pdf')

for page in pdf.pages:
    filtered_page = remove_nested_rects(page)

    for table in filtered_page.find_tables():
        # fill in bottom and right lines
        table = filtered_page.crop(table.bbox).extract_table(dict(
            explicit_horizontal_lines = [table.bbox[3]],
            explicit_vertical_lines = [table.bbox[2]]
        ))
        print(table)

[['Item', 'Description'],
 ['Product Name:', 'Iphone'],
 ['Sector Name:', 'Technology'],
 ['Department /Function Name:', 'Mobile'],
 ['Version Number:', '5.0'],
 ['Process Owner:', 'Apple'],
 ['Reviewed by (Business Process Management):', 'Me'],
 ['Version Date:', '24/5/2013'],
 ['Next Revision Date:', '25/5/2014']]
[['Version', 'Date', 'Prepared by', 'Reviewed by', 'Brief Explanation'],
 ['5.0', '24/5/2013', 'Pranav01', 'Karan', 'New document.']]
[['Name', 'Title', 'Signature'],
 ['Rahul', 'Solftware developer', '28 February 2013'],
 ['pranjal', 'Architecture', '22 March 2013'],
 ['Abhimannu', 'QA Testing', '24 May 2013']]
[['Name', 'Title', 'Signature'],
 ['Shrivastav',
  'Machine Learning\nEngineer',
  '9 March 2013 (Shared by email)'],
 ['Chandak', 'IT Manager', '9 March 2013 (Shared by email)'],
 ['Abhinav', 'Director', '9 March 2013 (Shared by email)']]
[['Name', 'Title', 'Comments for Approver'],
 ['Ahmed', 'Access Delivery Expert', '']]

1 reply

Pranavagrl1 Jul 13, 2023
Author

Thanks For the Help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visual Setting in PdfPlumber #934

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Visual Setting in PdfPlumber #934

Pranavagrl1 Jul 12, 2023

Replies: 1 comment · 1 reply

cmdlineluser Jul 12, 2023

Pranavagrl1 Jul 13, 2023 Author

Pranavagrl1
Jul 12, 2023

Replies: 1 comment 1 reply

cmdlineluser
Jul 12, 2023

Pranavagrl1 Jul 13, 2023
Author