Why are 2 cropped areas producing same .extract_text()
output when one is "empty"?
#930
-
I was experimenting with the example pdf from #926 Cropping 2 areas produce the same output from import pdfplumber
pdf = pdfplumber.open("page4.pdf")
page = pdf.pages[0]
a = page.crop((489.0136, 510.25570000000005, 533.1974, 521.686))
b = page.crop((489.0136, 521.686, 533.1974, 529.686))
a.extract_text()
# '人民幣百萬元'
b.extract_text()
# '人民幣百萬元' If we look at the cropped areas as images, A.png B.png From what i can tell, c1 = a.chars[0]
c2 = b.chars[0]
c3 = next(c for c in page.chars if c['matrix'] == c1['matrix']) >>> c1
{'matrix': (8.075, 0.0, 0.0, 9.5, 482.4896, 287.3948),
'fontname': 'NKVMXY+MHeiHK-Bold',
'adv': 1.0,
'upright': True,
'x0': 489.0136,
'y0': 285.6183,
'x1': 490.5646,
'y1': 295.1183,
'width': 1.5509999999999877,
'height': 8.930299999999988,
'size': 9.5,
'object_type': 'char',
'page_number': 1,
'text': '人',
'stroking_color': (0, 0, 0, 1),
'non_stroking_color': (0, 0, 0, 1),
'top': 512.7557,
'bottom': 521.686,
'doctop': 512.7557} >>> c2
{'matrix': (8.075, 0.0, 0.0, 9.5, 482.4896, 287.3948),
'fontname': 'NKVMXY+MHeiHK-Bold',
'adv': 1.0,
'upright': True,
'x0': 489.0136,
'y0': 285.6183,
'x1': 490.5646,
'y1': 295.1183,
'width': 1.5509999999999877,
'height': 0.5697000000000116,
'size': 9.5,
'object_type': 'char',
'page_number': 1,
'text': '人',
'stroking_color': (0, 0, 0, 1),
'non_stroking_color': (0, 0, 0, 1),
'top': 521.686,
'bottom': 522.2557,
'doctop': 521.686} >>> c3
{'matrix': (8.075, 0.0, 0.0, 9.5, 482.4896, 287.3948),
'fontname': 'NKVMXY+MHeiHK-Bold',
'adv': 1.0,
'upright': True,
'x0': 482.4896,
'y0': 285.6183,
'x1': 490.5646,
'y1': 295.1183,
'width': 8.074999999999989,
'height': 9.5,
'size': 9.5,
'object_type': 'char',
'page_number': 1,
'text': '人',
'stroking_color': (0, 0, 0, 1),
'non_stroking_color': (0, 0, 0, 1),
'top': 512.7557,
'bottom': 522.2557,
'doctop': 512.7557} The only thing that seems to differ is the top/bottom/doctop values: c1['bottom'] - c1['top']
# 8.930299999999988
c2['bottom'] - c2['top']
# 0.5697000000000116
c3['bottom'] - c3['top']
# 9.5 I'm just wondering if someone can explain to me what is happening here and how this works. I guess the char is considered inside if any point falls within cropped area? I want to discard both of the crops as "false matches" (crop It seems we can find the original matrix in Thanks all. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Yep, that's exactly it. From the readme:
I acknowledge that this can be confusing at first, especially with char objects whose bounding boxes extend beyond their visual markings. On the other hand, I wanted The "good" news: You can use |
Beta Was this translation helpful? Give feedback.
Yep, that's exactly it. From the readme:
I acknowledge that this can be confusing at first, especially with char objects whose bounding boxes extend beyond their visual markings. On the other hand, I wanted
crop
to adhere to what "crop" means in other contexts/software. It also has some more "expected"/preferred results with things like line objects.The …