extract table headings along with table contents #1011

poojitharamachandra · 2023-10-11T07:15:40Z

poojitharamachandra
Oct 11, 2023

hi,
how to extract the table headings in addition to table contents ?

Answered by jsvine

Oct 12, 2023

Hi @poojitharamachandra; thanks for your interest in this library. There's no built-in automated functionality for what you describe, because the logic ends up being quite custom to the particular layout and structure of any given PDF. But the general idea would be to use page.find_tables() to find the tables on a page; each table's .bbox property will give you its coordinates, which you can use along with page.crop((x0, top, x1, bottom)).extract_text() to select an area above the table (perhaps optionally with page.filter(...), depending on what else is in that area), with those coordinates determined by you based on the spacing between the table and its heading.

View full answer

jsvine · 2023-10-12T16:08:13Z

jsvine
Oct 12, 2023
Maintainer

Hi @poojitharamachandra; thanks for your interest in this library. There's no built-in automated functionality for what you describe, because the logic ends up being quite custom to the particular layout and structure of any given PDF. But the general idea would be to use page.find_tables() to find the tables on a page; each table's .bbox property will give you its coordinates, which you can use along with page.crop((x0, top, x1, bottom)).extract_text() to select an area above the table (perhaps optionally with page.filter(...), depending on what else is in that area), with those coordinates determined by you based on the spacing between the table and its heading.

1 reply

poojitharamachandra Oct 13, 2023
Author

thanks, that helped

poojitharamachandra · 2023-10-13T05:46:58Z

poojitharamachandra
Oct 13, 2023
Author

is there a way to extract the section headings under which a particular table is present in the pdf?

1 reply

dhdaines Oct 13, 2023

There isn't any general way to do this, because section headings could be represented visually in a multitude of ways.

If it's a tagged PDF then the section headings themselves may be tagged with a header-level tag, e.g. H1, H2, H3, etc, which you can access in the tag entry of the dictionaries returned by .chars or .extract_words(). But even in tagged PDFs there isn't any notion of a "section", so you would just have to assume that the most recent section heading is the one which contains a given table.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract table headings along with table contents #1011

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

extract table headings along with table contents #1011

poojitharamachandra Oct 11, 2023

Replies: 2 comments · 2 replies

jsvine Oct 12, 2023 Maintainer

poojitharamachandra Oct 13, 2023 Author

poojitharamachandra Oct 13, 2023 Author

dhdaines Oct 13, 2023

poojitharamachandra
Oct 11, 2023

Replies: 2 comments 2 replies

jsvine
Oct 12, 2023
Maintainer

poojitharamachandra Oct 13, 2023
Author

poojitharamachandra
Oct 13, 2023
Author