Extracting tables with horizontal lines only #947
Replies: 2 comments 6 replies
-
Perhaps you could:
sections = [
word['bottom'] for word in page.search(r'(\d[.])+')
if word['chars'][0]['fontname'].endswith('Bold')
]
sections.append(page.height)
explicit_vertical_lines = []
for top, bottom in itertools.pairwise(sections):
for edge in page.horizontal_edges:
if top <= edge['top'] <= bottom:
# left vertcial line
line = dict(
x0 = edge['x0'],
x1 = edge['x0'],
top = top,
bottom = bottom,
height = bottom - top,
orientation = 'v',
object_type = 'line'
)
explicit_vertical_lines.append(line.copy())
# right vertical line
line['x0'] = edge['x1']
line['x1'] = edge['x1']
explicit_vertical_lines.append(line) >>> tables = page.extract_tables(dict(explicit_vertical_lines=explicit_vertical_lines))
>>> len(tables)
3
>>> tables[0]
[['', '', '31 December 2021\nEUR', '31 December 2020\nEUR'],
['fees receivable', '', '2,519,013', '1,190,688'],
['Total', '', '7,519,013', '6,190,688']]
>>> tables[1]
[['', '', '31 December 2021\nEUR', '', '31 December 2020\nEUR'],
['Taxes paid in advance\nOther receivables',
'',
'2,407,655\n12,349',
'',
'1,475,421\n-'],
['Total', '', '1,440,004', '', '1,475,421']] |
Beta Was this translation helpful? Give feedback.
-
Hello @cmdlineluser Thank you again for your assistance earlier this month. I was wondering if you could help me with following. From page 1 - 3 each table is extracted to seperate dataframe. Whereas on page 4 all tables are extracted to one dataframe. On the page 5, the same all tables are taken to one dataframe. Unfortuantaly I do need each table in seperate dataframe. Can't find the reason why on page 4 and 5 they are merged into one df.
I would really appreciate your help! The code is: import pdfplumber pdf = pdfplumber.open("Test_ano.pdf") start_page = 1 all_tables_dict = {} for page_number in range(start_page - 1, end_page):
|
Beta Was this translation helpful? Give feedback.
-
Hello.
Could you please advise if it's possible to extract table located under caption 4.1.1.1 (table 1) and 4.1.2.1 (table 2)? I need to extract both tables seperately, not as one table.
I'm able to extract 3rd table (under caption 5.1) but can't find a way to extract first two tables.
Here is pdf for reference:
FS_onepage_borders_an.pdf
I would appreciate any suggestions
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions