Handling overlapped text #594

henrylzy · 2022-02-02T10:38:11Z

henrylzy
Feb 2, 2022

Hi, I was having trouble to extract tables and data from a pdf that has several line of texts overlapping with each other, resulting the module failed to correctly parse those texts.

The pdf contains data look like this:( | is the column line)

aaaaa | 000000 | bbbbbb | 11111111 | cccccccc | 22222222

and the objective is to parse each row of data column by column into a list like this:

['aaaaa', '000000', 'bbbbbb', '11111111', 'cccccccc', '22222222']

I cannot upload the pdf due to the sensitivity of its data, so I've recreated the layout of the pdf in the picture below:

This is an output using Camelot visual debug with matplotlib, red vertical lines are where the boundaries of each column should be.

As you can see, in the third column from the left, where two lines of texts are overlapping with each other.

Consider below is the data from this pdf, I intentionally made the values in the first, third and fifth columns to be alphabetical letters, and the rests are numbers, so that it'll be easy to see the problem.

aaaaa | 000000 | bbbbbb | 11111111 | cccccccc | 22222222

If I run this code:

pdf.pages[0].extract_table()

The result of this line of data would be something like this:

aaaaa | 0000 | b0b0bb | bb111111 | 11c1cccccc | c22222222

Then I came across this issue, where it mentioned about using cluster_objects() to extract all the texts from the pdf using this code:

def extract_ocr_text(page):
    line_chars = cluster_objects(page.chars, "top", tolerance=5)
    lines = ["".join(c["text"] for c in chars) for chars in line_chars ]
    return "\n".join(lines)

The result from running this code did solved the problem of mixing letters and numbers together. However it's still not yet there since the result would be like this:

aaaaabbbbbbcccccccc0000001111111122222222

It definitely looks way better that before, but it's really hard to separate the data.

I've been searching for a few days and trying to find some documentations about anything in pdfplumber\utils.py so that I can tweak the function above, but no luck.

I also tried to set the columns using explicit_vertical_lines and explicit_horizontal_lines attributes with extract_tables(), but that didn't seem to be in any help of the situation.

So I'm wondering if there is a way to handle pdfs like this? Or maybe a way to separate datas from different text boxes while using the cluster_objects() function?

jsvine · 2022-02-02T13:24:59Z

jsvine
Feb 2, 2022
Maintainer

Yes, this is a tricky situation, and one I've seen in a few other PDFs. One approach, which may or may not work depending on the specifics of your PDF's internal representations:

Use ~~page.extract_table(...).cells~~ page.find_tables(...)[0].cells [<- edited to fix] to identify the bounding boxes of the cells in your table.
Use page.extract_words(..., use_text_flow=True), which might, depending on your PDF, keep the words separate.
Use the output of Steps 1 and 2 to assign each word to a cell.

2 replies

henrylzy Feb 2, 2022
Author

Thanks for the help!

I've tried these codes and Step 2 alone works great.

Here's some feedback just in case you might wanna know.

By running codes in Step 2

pdf_str = pdf.pages[0].extract_words(use_text_flow=True)

The result looks perfect, just need another for loop to fine tune the list, get rid of the unnecessary data it gathered(mostly from outside of the table) and match the data from adjacent cells back together, no big deal.

The first one however, returns an AttributeError

Traceback (most recent call last):
  File "**************************", line 44, in <module>
    pdf_str = pdf.pages[0].extract_table().cells
AttributeError: 'list' object has no attribute 'cells'

I've tried to run this line with or without table_settings but it doesn't seems to matter.

I went through the documentation and found .extract_table() and .extract_tables() didn't mention anything about the access to .cells method.

.find_tables() and .debug_tablefinder() did mention about it but I got the same AttributeError with .find_tables().

.debug_tablefinder().cells works fine as it returns a list of tuples containing something that looks like the position of each cell.

However I have no idea how should I use these data and I didn't really spend much time digging into it since Step 2 already did the job.

jsvine Feb 3, 2022
Maintainer

I went through the documentation and found .extract_table() and .extract_tables() didn't mention anything about the access to .cells method.

WHOOPS, typo on my end. That was supposed to say .find_tables(...)[0].cells. I'll fix my note above so that it doesn't mislead anyone else. Thanks for flagging, and apologies for the detour.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling overlapped text #594

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Handling overlapped text #594

henrylzy Feb 2, 2022

Replies: 1 comment · 2 replies

jsvine Feb 2, 2022 Maintainer

henrylzy Feb 2, 2022 Author

jsvine Feb 3, 2022 Maintainer

henrylzy
Feb 2, 2022

Replies: 1 comment 2 replies

jsvine
Feb 2, 2022
Maintainer

henrylzy Feb 2, 2022
Author

jsvine Feb 3, 2022
Maintainer