Text extraction excluding Image and Tables #614

ytiam · 2022-02-24T06:54:00Z

ytiam
Feb 24, 2022

Hi guys,
Is there any trick or method available, if anyone wants to just extract the textual information available just inside the paragraph lines, excluding the texts available inside Image and Table? If anyone knows any logic, please help with that as well.

jsvine · 2022-03-03T02:12:56Z

jsvine
Mar 3, 2022
Maintainer

Hi @ytiam, and thanks for your interest in this library. I'm not quite sure what you mean about text inside an "Image", but here's one way you could exclude text within a table:

from pdfplumber.utils import intersects_bbox

def get_nontable_text(page):
  tables = page.find_tables()

  def outside_tables(obj):
      return not any(intersects_bbox([obj], t.bbox) for t in tables)

  return page.filter(outside_tables).extract_text()

2 replies

ytiam Mar 4, 2022
Author

Thanks for your reply and solution @jsvine, I will try this and update you. As I was in hurry for project deadlines and the implementation, I got an alternative solution from stackoverflow and it is typically based on similar kind of concepts related to bounding box.

Link: https://stackoverflow.com/questions/69407850/pdfplumber-extract-text-function-also-extracts-text-from-the-table-only-want-to

I developed my custom code based on this and it is working fine for both tables and images.

jsvine Mar 5, 2022
Maintainer

Great, and thanks for the update!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text extraction excluding Image and Tables #614

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Text extraction excluding Image and Tables #614

ytiam Feb 24, 2022

Replies: 1 comment · 2 replies

jsvine Mar 3, 2022 Maintainer

ytiam Mar 4, 2022 Author

jsvine Mar 5, 2022 Maintainer

ytiam
Feb 24, 2022

Replies: 1 comment 2 replies

jsvine
Mar 3, 2022
Maintainer

ytiam Mar 4, 2022
Author

jsvine Mar 5, 2022
Maintainer