Handling overlapped text #594
henrylzy
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment 2 replies
-
Yes, this is a tricky situation, and one I've seen in a few other PDFs. One approach, which may or may not work depending on the specifics of your PDF's internal representations:
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I was having trouble to extract tables and data from a pdf that has several line of texts overlapping with each other, resulting the module failed to correctly parse those texts.
The pdf contains data look like this:(
|
is the column line)and the objective is to parse each row of data column by column into a list like this:
I cannot upload the pdf due to the sensitivity of its data, so I've recreated the layout of the pdf in the picture below:
This is an output using Camelot visual debug with matplotlib, red vertical lines are where the boundaries of each column should be.
As you can see, in the third column from the left, where two lines of texts are overlapping with each other.
Consider below is the data from this pdf, I intentionally made the values in the first, third and fifth columns to be alphabetical letters, and the rests are numbers, so that it'll be easy to see the problem.
If I run this code:
The result of this line of data would be something like this:
Then I came across this issue, where it mentioned about using
cluster_objects()
to extract all the texts from the pdf using this code:The result from running this code did solved the problem of mixing letters and numbers together. However it's still not yet there since the result would be like this:
It definitely looks way better that before, but it's really hard to separate the data.
I've been searching for a few days and trying to find some documentations about anything in
pdfplumber\utils.py
so that I can tweak the function above, but no luck.I also tried to set the columns using
explicit_vertical_lines
andexplicit_horizontal_lines
attributes withextract_tables()
, but that didn't seem to be in any help of the situation.So I'm wondering if there is a way to handle pdfs like this? Or maybe a way to separate datas from different text boxes while using the
cluster_objects()
function?Beta Was this translation helpful? Give feedback.
All reactions