can table exacted without lines #926

tujinshu · 2023-07-04T06:27:14Z

tujinshu
Jul 4, 2023

the table(without lines) in pdf can't be exacted (page 4)

is pdfplumber able to handle this ?
tencent.pdf

jsvine · 2023-07-04T13:35:28Z

jsvine
Jul 4, 2023
Maintainer

You might try cropping to the part of the PDF with the table, and then using the "horizontal_strategy": "text", "vertical_strategy": "text" table-extraction settings. If that doesn't work, you could try programmatically identifying the vertical and horizontal divisions (making use of, for instance, the positions of those short horizontal lines/rects), and then using "explicit_horizontal_lines": [...], "explicit_vertical_lines": [...].

2 replies

tujinshu Jul 5, 2023
Author

thank you for your reply,i will have a try！

tujinshu Jul 5, 2023
Author

i use text_strategy to handle this tencent.pdf

p0 = pdf.pages[3]
p0 = p0.filter(keep_visible_lines)
ts = {           
            #Tables without lines and borders
            "vertical_strategy": "text",
            "horizontal_strategy": "text",
            "explicit_vertical_lines": [],
            "explicit_horizontal_lines": [],
            "snap_tolerance": 3,
            "snap_x_tolerance": 3,
            "snap_y_tolerance": 3,
            "join_tolerance": 3,
            "join_x_tolerance": 3,
            "join_y_tolerance": 3,
            "edge_min_length": 3,
            "min_words_vertical": 3,
            "min_words_horizontal": 1,
            #I get error with this line active, because I'm using find_tables
            "text_tolerance": 3,
            "text_x_tolerance": 3,
            "text_y_tolerance": 3,
            "intersection_tolerance": 3,
            "intersection_x_tolerance": 3,
            "intersection_y_tolerance": 3,
        }

im = p0.to_image()
im.debug_tablefinder(ts)

but it generate two row with origin one row , for example(in green cicyle) :

企业微信截图_767368f4-684a-4629-86e3-f2aee73c423d

is there any way to optimize this ?

cmdlineluser · 2023-07-05T05:04:32Z

cmdlineluser
Jul 5, 2023

Update: As the data seems well formed, perhaps it's simpler to crop and just extract the text lines?

hlines = {}
for edge in page.horizontal_edges:
   if edge['object_type'] == 'line':
      left, right = edge['x0'], edge['x1']
      hlines[left, right] = edge
      
left, right = min(hlines)[0], max(hlines)[1]
crop = page.crop((left, 0, right, page.height))

[ line['text'].split() for line in crop.extract_text_lines() ]

[['財務概要'],
 ['截至十二月三十一日止年度'],
 ['二零一八年', '二零一九年', '二零二零年', '二零二一年', '二零二二年'],
 ['人民幣百萬元', '人民幣百萬元', '人民幣百萬元', '人民幣百萬元', '人民幣百萬元'],
 ['312,694', '377,289', '482,064', '560,118', '554,552'],
 ['142,120', '167,533', '221,532', '245,944', '238,746'],
 ['94,466', '109,400', '180,022', '248,062', '210,225'],
 ['79,984', '95,888', '160,125', '227,810', '188,709'],
 ['78,719', '93,310', '159,847', '224,822', '188,243'],
 ['67,760', '119,901', '281,173', '200,390', '59,564'],
 ['66,339', '116,670', '277,834', '200,323', '60,699'],
 ['92,481', '114,601', '149,404', '159,539', '153,538'],
 ['77,469', '94,351', '122,742', '123,788', '115,649'],
 ['於十二月三十一日'],
 ['二零一八年', '二零一九年', '二零二零年', '二零二一年', '二零二二年'],
 ['人民幣百萬元', '人民幣百萬元', '人民幣百萬元', '人民幣百萬元', '人民幣百萬元'],
 ['506,441', '700,018', '1,015,778', '1,127,552', '1,012,142'],
 ['217,080', '253,968', '317,647', '484,812', '565,989'],
 ['723,521', '953,986', '1,333,425', '1,612,364', '1,578,131'],
 ['323,510', '432,706', '703,984', '806,299', '721,391'],
 ['32,697', '56,118', '74,059', '70,394', '61,469'],
 ['356,207', '488,824', '778,043', '876,693', '782,860'],
 ['164,879', '225,006', '286,303', '332,573', '361,067'],
 ['202,435', '240,156', '269,079', '403,098', '434,204'],
 ['367,314', '465,162', '555,382', '735,671', '795,271'],
 ['723,521', '953,986', '1,333,425', '1,612,364', '1,578,131'],
 ['二零二二年年報', '3']]

Maybe this is helpful:

You could target the horizontal lines as @jsvine suggested:

im = page.to_image(300)

hlines = {}
for edge in page.horizontal_edges:
   if edge['object_type'] == 'line':
      left, right = edge['x0'], edge['x1']
      hlines[left, right] = edge
      
# im.reset().draw_lines(hlines.values(), stroke_width=10).save('lines.png')

You could then extract the words that fall within the width of those lines:

words = {}
for word in page.extract_words():
    text, x0, x1, top, bottom = (
        word['text'], word['x0'], word['x1'], word['top'], word['bottom']
    )
    for (left, right) in hlines:
        if  (((x0 >= left)   or (abs(left  - x0) < 0.5))
        and  ((x1 <= right)) or (abs(right - x1) < 0.5)):
            words[text, x0, x1, top, bottom] = word
            break
            
# im.reset().draw_rects(words.values()).save('words.png')

You could then group the words into "rows" based on their "top" position:

rows = []
row  = []
for word in words.values():
    if len(rows) == 0 and len(row) == 0:
        row.append(word)
        continue
    if len(row) == 0:
        row.append(word)
        continue
    if abs(word['top'] - row[-1]['top']) < 1:
        row.append(word)
    else:
        rows.append(row)
        row = []
        row.append(word)
rows.append(row)
        
data = [[col['text'] for col in row] for row in rows]

[['二零一八年', '二零一九年', '二零二零年', '二零二一年', '二零二二年'],
 ['人民幣百萬元', '人民幣百萬元', '人民幣百萬元', '人民幣百萬元', '人民幣百萬元'],
 ['312,694', '377,289', '482,064', '560,118', '554,552'],
 ['142,120', '167,533', '221,532', '245,944', '238,746'],
 ['94,466', '109,400', '180,022', '248,062', '210,225'],
 ['79,984', '95,888', '160,125', '227,810', '188,709'],
 ['78,719', '93,310', '159,847', '224,822', '188,243'],
 ['67,760', '119,901', '281,173', '200,390', '59,564'],
 ['66,339', '116,670', '277,834', '200,323', '60,699'],
 ['92,481', '114,601', '149,404', '159,539', '153,538'],
 ['77,469', '94,351', '122,742', '123,788', '115,649'],
 ['二零一八年', '二零一九年', '二零二零年', '二零二一年', '二零二二年'],
 ['人民幣百萬元', '人民幣百萬元', '人民幣百萬元', '人民幣百萬元', '人民幣百萬元'],
 ['506,441', '700,018', '1,015,778', '1,127,552', '1,012,142'],
 ['217,080', '253,968', '317,647', '484,812', '565,989'],
 ['723,521', '953,986', '1,333,425', '1,612,364', '1,578,131'],
 ['323,510', '432,706', '703,984', '806,299', '721,391'],
 ['32,697', '56,118', '74,059', '70,394', '61,469'],
 ['356,207', '488,824', '778,043', '876,693', '782,860'],
 ['164,879', '225,006', '286,303', '332,573', '361,067'],
 ['202,435', '240,156', '269,079', '403,098', '434,204'],
 ['367,314', '465,162', '555,382', '735,671', '795,271'],
 ['723,521', '953,986', '1,333,425', '1,612,364', '1,578,131']]

2 replies

tujinshu Jul 5, 2023
Author

thank you for your reply，

is divide into two rows, 312,694 and ______ ，is there any parameter can combine them

cmdlineluser Jul 5, 2023

I'm not sure I understand correctly, these are the rows I get:

│ 312,694      ┆ 377,289      ┆ 482,064      ┆ 560,118      ┆ 554,552      │
│ 142,120      ┆ 167,533      ┆ 221,532      ┆ 245,944      ┆ 238,746      │
│ 94,466       ┆ 109,400      ┆ 180,022      ┆ 248,062      ┆ 210,225      │

Where does _____ come into play?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can table exacted without lines #926

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

can table exacted without lines #926

tujinshu Jul 4, 2023

Replies: 2 comments · 4 replies

jsvine Jul 4, 2023 Maintainer

tujinshu Jul 5, 2023 Author

tujinshu Jul 5, 2023 Author

cmdlineluser Jul 5, 2023

tujinshu Jul 5, 2023 Author

cmdlineluser Jul 5, 2023

tujinshu
Jul 4, 2023

Replies: 2 comments 4 replies

jsvine
Jul 4, 2023
Maintainer

tujinshu Jul 5, 2023
Author

tujinshu Jul 5, 2023
Author

cmdlineluser
Jul 5, 2023

tujinshu Jul 5, 2023
Author