Visual Setting in PdfPlumber #934
Replies: 1 comment 1 reply
-
There may be a simpler way to do this using the table settings, but this is what I did: There are several nested rects which are the reason for the resulting empty strings and Nones, if we look at If we remove all of these inner rects we are mostly there except for the last row in table 1, there is now a gap where the "empty" rects were: On page 2 there is a similar issue with the right side of the table not being a "full line". To remove the inner rects, we can check if we are inside any larger rect. We can fix the bottom and right sides of the table by using extra explicit lines using the tables bbox values. import pdfplumber
from operator import itemgetter
def inside(self, other):
return all((
self['x0'] >= other['x0'],
self['top'] >= other['top'],
self['x1'] <= other['x1'],
self['bottom'] <= other['bottom']
))
def largest_parent_rect(page, self):
parent_rects = [other for other in page.rects if inside(self, other)]
if parent_rects:
parent_rect = max(parent_rects, key=itemgetter('width', 'height'))
if self != parent_rect:
return parent_rect
def remove_nested_rects(page, keep_largest=False):
def filter_condition(other):
if other['object_type'] == 'rect':
return tuple(other['pts']) not in rects_to_remove
return True
rects_to_remove = set()
for rect in page.rects:
parent = largest_parent_rect(page, rect)
if parent is not None:
rects_to_remove.add(tuple(rect['pts']))
if keep_largest is False:
rects_to_remove.add(tuple(parent['pts']))
return page.filter(filter_condition)
pdf = pdfplumber.open('Downloads/Dummy-1.pdf')
for page in pdf.pages:
filtered_page = remove_nested_rects(page)
for table in filtered_page.find_tables():
# fill in bottom and right lines
table = filtered_page.crop(table.bbox).extract_table(dict(
explicit_horizontal_lines = [table.bbox[3]],
explicit_vertical_lines = [table.bbox[2]]
))
print(table) [['Item', 'Description'],
['Product Name:', 'Iphone'],
['Sector Name:', 'Technology'],
['Department /Function Name:', 'Mobile'],
['Version Number:', '5.0'],
['Process Owner:', 'Apple'],
['Reviewed by (Business Process Management):', 'Me'],
['Version Date:', '24/5/2013'],
['Next Revision Date:', '25/5/2014']]
[['Version', 'Date', 'Prepared by', 'Reviewed by', 'Brief Explanation'],
['5.0', '24/5/2013', 'Pranav01', 'Karan', 'New document.']]
[['Name', 'Title', 'Signature'],
['Rahul', 'Solftware developer', '28 February 2013'],
['pranjal', 'Architecture', '22 March 2013'],
['Abhimannu', 'QA Testing', '24 May 2013']]
[['Name', 'Title', 'Signature'],
['Shrivastav',
'Machine Learning\nEngineer',
'9 March 2013 (Shared by email)'],
['Chandak', 'IT Manager', '9 March 2013 (Shared by email)'],
['Abhinav', 'Director', '9 March 2013 (Shared by email)']]
[['Name', 'Title', 'Comments for Approver'],
['Ahmed', 'Access Delivery Expert', '']] |
Beta Was this translation helpful? Give feedback.
-
Hi, Is there any setting in this library so that i can extract only those tables which are visualized.
Dummy.pdf
Beta Was this translation helpful? Give feedback.
All reactions