Don't extract tables without borders #883

brendamdib · 2023-05-08T13:34:21Z

brendamdib
May 8, 2023

Hi!
I'm using the code below to save only text around tables.

But some tables doesn't been detected...

I have this problem with tables that have a more simple layout... looking like a table made with a text editor like MS Word like this below (they are at attached pdf).
Another problem, is that the words are without spaces between them and word are cropped...

What could I do, to identify this two types of table?
I

This type of tables are detected without problem... with this type of table, I can use explicit without problems at vertical and horizontal strategy.

But unfortunately I have a lot of files with this tables, and I don't have success yet. I'm using the code bellow
111642_020036_03052023111023.pdf

This is the output file
111642_020036_03052023111023.txt

import aiofiles
from pathlib import Path
#from scripts.SplitSentences import SpacyExec
import pdfplumber

async def ConvertToTXT(pathPdfFiles,inputMenu):
#List PDF files from dir
for f in Path(pathPdfFiles).glob("*.pdf"):
# Import the PDF.
pdf = pdfplumber.open(f)

    #Set target path
    if inputMenu == "1":
            trgPath = 'TXT_Files/ITR/'
    elif inputMenu == "2":
        trgPath = 'TXT_Files/DFP/'        
    
    #reading pages
    for pages in pdf.pages:  
        def curves_to_edges(cs):
            """See https://github.com/jsvine/pdfplumber/issues/127"""
            edges = []
            for c in cs:
                edges += pdfplumber.utils.rect_to_edges(c)
            return edges

        # Table settings.
        ts = {
            # Works with tables with lines and borders
            # vertical_strategy": "text",
            # "horizontal_strategy": "text",
            # "explicit_vertical_lines": curves_to_edges(pages.curves) + pages.edges,
            # "explicit_horizontal_lines": curves_to_edges(pages.curves) + pages.edges,
            
            #Tables without lines and borders
            "vertical_strategy": "text",
            "horizontal_strategy": "text",
            "explicit_vertical_lines": [],
            "explicit_horizontal_lines": [],
            "snap_tolerance": 3,
            "snap_x_tolerance": 3,
            "snap_y_tolerance": 3,
            "join_tolerance": 3,
            "join_x_tolerance": 3,
            "join_y_tolerance": 3,
            "edge_min_length": 3,
            "min_words_vertical": 3,
            "min_words_horizontal": 1,
            #I get error with this line active, because I'm using find_tables
            #"keep_blank_chars": False,
            "text_tolerance": 3,
            "text_x_tolerance": 3,
            "text_y_tolerance": 3,
            "intersection_tolerance": 3,
            "intersection_x_tolerance": 3,
            "intersection_y_tolerance": 3,
        }

        # Get the bounding boxes of the tables on the page.
        bboxes = [table.bbox for table in pages.find_tables(table_settings=ts)]
        
        def not_within_bboxes(obj):
            """Check if the object is in any of the table's bbox."""
            def obj_in_bbox(_bbox):
                """See https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L404"""
                v_mid = (obj["top"] + obj["bottom"]) / 2
                h_mid = (obj["x0"] + obj["x1"]) / 2
                x0, top, x1, bottom = _bbox
                return (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)
            return not any(obj_in_bbox(__bbox) for __bbox in bboxes)

        #Setting txt file name
        filename = f.name.replace(".pdf",".txt")               
        
        #Print Page to see if table was removed
        print("Text outside the tables:")
        print(pages.filter(not_within_bboxes).extract_text(layout = True))
        
        # Saving pages to txt file
        async with aiofiles.open(trgPath+filename, "a+") as file:
            await file.write(pages.filter(not_within_bboxes).extract_text(layout = True))            
            #file.write(SpacyExec(pages.filter(not_within_bboxes).extract_text()))

jsvine · 2023-05-09T00:52:21Z

jsvine
May 9, 2023
Maintainer

Hi @brendamdib, and thanks for your interest in this library. Generally speaking, there's no perfect way to identify tables in PDFs; PDFs have no internal concept of a "table," and much of what we perceive to be a "table" is based on human perception. This is particularly true for borderless tables like in your first and second examples.

If you know, ahead of time, what the possible types of table layouts could be in your set of PDFs, then you might be able to distinguish between them by looking for particular attributes of the objects in page.lines, page.chars, et cetera.

0 replies

petermr · 2023-05-28T10:11:02Z

petermr
May 28, 2023

I am also interested in these types of table. Both are very common. The H-line separators (first two examples) are fairly standard and the banded/background table (3rd) is also very common.
I wrote tools in Java to extract these and it worked well for a lot of cases. I think the strategy would transfer to Python/pdfplumber as long as all the primitives were accessible.

It's messy. The H-lines may not be lines (thin rectangles) or they may be polyline segments (one for each column). The main challenge is that these are not always simple cartesian tables (row. s*cols) - the third table has mini-trees in the headers. This sort of logic can get messy. So there are two phases, I think:

identify the cells
analyse the table structure

I'm interested in working gently in this area on a continuing basis. Generally I hack my own code downstream of pdfplumber output (an adapter) but it could make sense to link the logic more closely to the distribution. Does pdfplumber have reusable downstream contrib code?

1 reply

jsvine Jun 1, 2023
Maintainer

Hi @petermr, and thanks for chiming in. I agree: It'd be great to provide more support for tricky-but-frequent table structures like these.

I'm interested to hear more about this proposal:

Does pdfplumber have reusable downstream contrib code?

Could you provide a bit more detail about what you have in mind?

I also like the idea of making table-extraction more extensible — something I've wanted to improve for a while, but haven't quite found the right solution for. This gives me more motivation to pursue that.

petermr · 2023-06-01T13:48:23Z

petermr
Jun 1, 2023

Two separate topics:

On Thu, Jun 1, 2023 at 2:41 PM Jeremy Singer-Vine ***@***.***> wrote: Hi @petermr <https://github.com/petermr>, and thanks for chiming in. I agree: It'd be great to provide more support for tricky-but-frequent table structures like these.

Besides tables we need to support lists. I have some well flagged lists (bulleted) and I'll try them over the next day or two. The harder thing is to auto-recognise tables and lists, especially where it's only indentation (no lines or symbols) P. -- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

0 replies

petermr · 2023-06-01T14:13:14Z

petermr
Jun 1, 2023

On Thu, Jun 1, 2023 at 2:41 PM Jeremy Singer-Vine ***@***.***> wrote: Does pdfplumber have reusable downstream contrib code? Could you provide a bit more detail about what you have in mind?

Thanks to pdfplumber I've made a lot of progress in reading IPCC reports (7 reports, 30+ documents, 15000 (sic) pages). I've written a lot of fairly generic code (not IPCC specific) and people have recently said they want to do other reports. M, uch of this has logic which is downstream of pdfplumber (I capture the character stream and weld it into lines, (nested) paragraphs, etc.) An example - not finished - is https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/syr/lr/html/fulltext/groups_groups.html where a PDF file is autotransformed to structured HTML (retaining styles, annealing page-breaks, collating footnotes and much other document stuff). So there's roughly the parts: 1, PDF -> word-stream (fonts, sizes, styles) and images, vectors, etc. 2, word-stream -> structured document (ideally in JATS-XML/HTML) - what the author wrote! 3, images/vectors -> SVG (I have done this with PDFBox/Java so it's possible) 2 and 3 are largely independent of PDFPlumber but it's often what people want. Tables , lists and paragraphs are in 2 PDFPlumber does most of the work for 1, but misses A. control points in curves (I think this is pdfminer.six's problem?) and in 3, B interpretation of bitstreams for images https://github.com/petermr/pyami/blob/pmr16/test/test_integrate.py At present my PDF code is in a larger system (which includes GUIs) and I'm thinking of separating it. A natural place would be a pdfplumber/contrib where others can use it if they like and it could hopefully be improved. No rush... and this won't happen quickly from me, but it may be useful to find fellow minds. P.

…

— Reply to this email directly, view it on GitHub <#883 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFTCS523G4B3MQTY644GX3XJCLXNANCNFSM6AAAAAAXZ6TGBE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

0 replies

dhdaines · 2023-07-21T19:28:31Z

dhdaines
Jul 21, 2023

Note that tables created with (recent versions of) MS Word can reliably be extracted using the structure tree (see #937 (comment)) - however this does not solve the problem of not having the borders, because we can only get the bounding box of the table from the marked content sections, i.e. the characters and images.

I ended up just adding a margin around the table's bounding box (I am extracting them as images for display, since even with the structure tree the internal layout is not accurately represented)...

3 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't extract tables without borders #883

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Don't extract tables without borders #883

Replies: 5 comments · 4 replies

jsvine May 9, 2023 Maintainer

jsvine Jun 1, 2023 Maintainer

Replies: 5 comments 4 replies

jsvine
May 9, 2023
Maintainer

jsvine Jun 1, 2023
Maintainer