can table exacted without lines #926
Replies: 2 comments 4 replies
-
You might try cropping to the part of the PDF with the table, and then using the |
Beta Was this translation helpful? Give feedback.
-
Update: As the data seems well formed, perhaps it's simpler to crop and just extract the text lines? hlines = {}
for edge in page.horizontal_edges:
if edge['object_type'] == 'line':
left, right = edge['x0'], edge['x1']
hlines[left, right] = edge
left, right = min(hlines)[0], max(hlines)[1]
crop = page.crop((left, 0, right, page.height))
[ line['text'].split() for line in crop.extract_text_lines() ] [['財務概要'],
['截至十二月三十一日止年度'],
['二零一八年', '二零一九年', '二零二零年', '二零二一年', '二零二二年'],
['人民幣百萬元', '人民幣百萬元', '人民幣百萬元', '人民幣百萬元', '人民幣百萬元'],
['312,694', '377,289', '482,064', '560,118', '554,552'],
['142,120', '167,533', '221,532', '245,944', '238,746'],
['94,466', '109,400', '180,022', '248,062', '210,225'],
['79,984', '95,888', '160,125', '227,810', '188,709'],
['78,719', '93,310', '159,847', '224,822', '188,243'],
['67,760', '119,901', '281,173', '200,390', '59,564'],
['66,339', '116,670', '277,834', '200,323', '60,699'],
['92,481', '114,601', '149,404', '159,539', '153,538'],
['77,469', '94,351', '122,742', '123,788', '115,649'],
['於十二月三十一日'],
['二零一八年', '二零一九年', '二零二零年', '二零二一年', '二零二二年'],
['人民幣百萬元', '人民幣百萬元', '人民幣百萬元', '人民幣百萬元', '人民幣百萬元'],
['506,441', '700,018', '1,015,778', '1,127,552', '1,012,142'],
['217,080', '253,968', '317,647', '484,812', '565,989'],
['723,521', '953,986', '1,333,425', '1,612,364', '1,578,131'],
['323,510', '432,706', '703,984', '806,299', '721,391'],
['32,697', '56,118', '74,059', '70,394', '61,469'],
['356,207', '488,824', '778,043', '876,693', '782,860'],
['164,879', '225,006', '286,303', '332,573', '361,067'],
['202,435', '240,156', '269,079', '403,098', '434,204'],
['367,314', '465,162', '555,382', '735,671', '795,271'],
['723,521', '953,986', '1,333,425', '1,612,364', '1,578,131'],
['二零二二年年報', '3']] Maybe this is helpful: You could target the horizontal lines as @jsvine suggested: im = page.to_image(300)
hlines = {}
for edge in page.horizontal_edges:
if edge['object_type'] == 'line':
left, right = edge['x0'], edge['x1']
hlines[left, right] = edge
# im.reset().draw_lines(hlines.values(), stroke_width=10).save('lines.png') You could then extract the words that fall within the width of those lines: words = {}
for word in page.extract_words():
text, x0, x1, top, bottom = (
word['text'], word['x0'], word['x1'], word['top'], word['bottom']
)
for (left, right) in hlines:
if (((x0 >= left) or (abs(left - x0) < 0.5))
and ((x1 <= right)) or (abs(right - x1) < 0.5)):
words[text, x0, x1, top, bottom] = word
break
# im.reset().draw_rects(words.values()).save('words.png') You could then group the words into "rows" based on their "top" position: rows = []
row = []
for word in words.values():
if len(rows) == 0 and len(row) == 0:
row.append(word)
continue
if len(row) == 0:
row.append(word)
continue
if abs(word['top'] - row[-1]['top']) < 1:
row.append(word)
else:
rows.append(row)
row = []
row.append(word)
rows.append(row)
data = [[col['text'] for col in row] for row in rows]
|
Beta Was this translation helpful? Give feedback.
-
the table(without lines) in pdf can't be exacted (page 4)
is pdfplumber able to handle this ?
tencent.pdf
Beta Was this translation helpful? Give feedback.
All reactions