表格的表头中有换行会被识别成两行 #1190
Donny2030
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment
-
Hi @Donny2030, the visual debugging tools are helpful in this case. Here you can see that the issue is extra graphical elements in the header: page = pdf.pages[0]
im = page.to_image()
im.debug_tablefinder() You can use the def remove_header_cruft(obj):
return obj.get("non_stroking_color") != (0.852,)
filtered = page.filter(remove_header_cruft)
filtered.to_image().debug_tablefinder() filtered.extract_table() [['序\n号', '资产类别', '穿透前', None, '穿透后', None],
[None, None, '资产余额(元)', '占穿透前总\n资产的比例\n(%)', '资产余额(元)', '占穿透后总\n资产的比例\n(%)'],
['1', '现金及银\n行存款', '2,876,646.27', '0.11', '31,038,270.26', '1.17'],
['2', '同业存单', '-', '-', '-', '-'],
['3', '拆放同业\n及买入返\n售', '-', '-', '2,958,188.11', '0.11'],
['4', '债券', '1,114,006,169.89', '41.88', '2,423,551,112.36', '91.10'],
['5', '非标准化\n债权类资\n产', '202,812,602.74', '7.62', '202,812,602.74', '7.62'],
['6', '权益类投\n资', '-', '-', '-', '-'],
['7', '金融衍生\n品', '-', '-', '-', '-'],
['8', '代客境外\n理财投资\nQDII', '-', '-', '-', '-'],
['9', '商品类资\n产', '-', '-', '-', '-']]
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
此pdf文件的表格的表头中会有换行,会被错误的识别成多行,调整snap_y_tolerance参数也不能被正确的识别(调太高会导致错误的合并),我设置的参数为table_settings = {
"vertical_strategy": "lines", # 可选策略,lines 或者 text
"horizontal_strategy": "lines", # 同上
"snap_y_tolerance": 7, # 增加此值以合并换行的表头
"min_words_horizontal": 2,
"snap_tolerance": 5, # 增加容差,帮助合并线条
"join_tolerance": 5, # 增加容差,帮助连接线条
"text_y_tolerance": 5,
"intersection_y_tolerance": 5
},请大佬帮忙看看有什么办法正确识别。
test22.pdf
Beta Was this translation helpful? Give feedback.
All reactions