Skip to content

extract jointly body paragraphs and the table in the pdf #1005

Closed Answered by cmdlineluser
zzflybird asked this question in Q&A
Discussion options

You must be logged in to vote

It usually helps if you could provide an example file.

I made one here: table.pdf (using fpdf2)

If you use .find_tables() you get the actual table objects which allows you to access their coords/positional values.

You could then use .outside_bbox() to filter out the tables from the page with this information.

.extract_text_lines() gives you the text line objects with coords/positional values.

With separate table and line objects you could sort based on their position in the page.

Something like:

import pdfplumber
from operator import itemgetter

page = pdfplumber.open("table.pdf").pages[0]
tables = page.find_tables()

page_without_tables = page
for table in tables:
   page_without_tables = 

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by zzflybird
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants