Don't extract tables without borders #883
Replies: 5 comments 4 replies
-
Hi @brendamdib, and thanks for your interest in this library. Generally speaking, there's no perfect way to identify tables in PDFs; PDFs have no internal concept of a "table," and much of what we perceive to be a "table" is based on human perception. This is particularly true for borderless tables like in your first and second examples. If you know, ahead of time, what the possible types of table layouts could be in your set of PDFs, then you might be able to distinguish between them by looking for particular attributes of the objects in |
Beta Was this translation helpful? Give feedback.
-
I am also interested in these types of table. Both are very common. The H-line separators (first two examples) are fairly standard and the banded/background table (3rd) is also very common. It's messy. The H-lines may not be lines (thin rectangles) or they may be polyline segments (one for each column). The main challenge is that these are not always simple cartesian tables (row. s*cols) - the third table has mini-trees in the headers. This sort of logic can get messy. So there are two phases, I think:
I'm interested in working gently in this area on a continuing basis. Generally I hack my own code downstream of |
Beta Was this translation helpful? Give feedback.
-
Two separate topics:
On Thu, Jun 1, 2023 at 2:41 PM Jeremy Singer-Vine ***@***.***> wrote:
Hi @petermr <https://github.com/petermr>, and thanks for chiming in. I
agree: It'd be great to provide more support for tricky-but-frequent table
structures like these.
Besides tables we need to support lists. I have some well flagged lists
(bulleted) and I'll try them over the next day or two. The harder thing is
to auto-recognise tables and lists, especially where it's only indentation
(no lines or symbols)
P.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Beta Was this translation helpful? Give feedback.
-
On Thu, Jun 1, 2023 at 2:41 PM Jeremy Singer-Vine ***@***.***> wrote:
Does pdfplumber have reusable downstream contrib code?
Could you provide a bit more detail about what you have in mind?
Thanks to pdfplumber I've made a lot of progress in reading IPCC reports (7
reports, 30+ documents, 15000 (sic) pages). I've written a lot of fairly
generic code (not IPCC specific) and people have recently said they want to
do other reports. M, uch of this has logic which is downstream of
pdfplumber (I capture the character stream and weld it into lines, (nested)
paragraphs, etc.) An example - not finished - is
https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/syr/lr/html/fulltext/groups_groups.html
where a PDF file is autotransformed to structured HTML (retaining styles,
annealing page-breaks, collating footnotes and much other document stuff).
So there's roughly the parts:
1, PDF -> word-stream (fonts, sizes, styles) and images, vectors, etc.
2, word-stream -> structured document (ideally in JATS-XML/HTML) - what the
author wrote!
3, images/vectors -> SVG
(I have done this with PDFBox/Java so it's possible)
2 and 3 are largely independent of PDFPlumber but it's often what people
want. Tables , lists and paragraphs are in 2
PDFPlumber does most of the work for 1, but misses
A. control points in curves (I think this is pdfminer.six's problem?)
and in 3,
B interpretation of bitstreams for images
https://github.com/petermr/pyami/blob/pmr16/test/test_integrate.py
At present my PDF code is in a larger system (which includes GUIs) and I'm
thinking of separating it. A natural place would be a pdfplumber/contrib
where others can use it if they like and it could hopefully be improved.
No rush... and this won't happen quickly from me, but it may be useful to
find fellow minds.
P.
… —
Reply to this email directly, view it on GitHub
<#883 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFTCS523G4B3MQTY644GX3XJCLXNANCNFSM6AAAAAAXZ6TGBE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Beta Was this translation helpful? Give feedback.
-
Note that tables created with (recent versions of) MS Word can reliably be extracted using the structure tree (see #937 (comment)) - however this does not solve the problem of not having the borders, because we can only get the bounding box of the table from the marked content sections, i.e. the characters and images. I ended up just adding a margin around the table's bounding box (I am extracting them as images for display, since even with the structure tree the internal layout is not accurately represented)... |
Beta Was this translation helpful? Give feedback.
-
Hi!
I'm using the code below to save only text around tables.
But some tables doesn't been detected...
I have this problem with tables that have a more simple layout... looking like a table made with a text editor like MS Word like this below (they are at attached pdf).
Another problem, is that the words are without spaces between them and word are cropped...
What could I do, to identify this two types of table?
I
This type of tables are detected without problem... with this type of table, I can use explicit without problems at vertical and horizontal strategy.
But unfortunately I have a lot of files with this tables, and I don't have success yet. I'm using the code bellow
111642_020036_03052023111023.pdf
This is the output file
111642_020036_03052023111023.txt
import aiofiles
from pathlib import Path
#from scripts.SplitSentences import SpacyExec
import pdfplumber
async def ConvertToTXT(pathPdfFiles,inputMenu):
#List PDF files from dir
for f in Path(pathPdfFiles).glob("*.pdf"):
# Import the PDF.
pdf = pdfplumber.open(f)
Beta Was this translation helpful? Give feedback.
All reactions