Generalize Parser to handle all types of PDFs (2-cols, 3-cols, or Combination) #974

DevanshuBrahmbhatt · 2023-08-26T19:50:05Z

DevanshuBrahmbhatt
Aug 26, 2023

Hello Folks,
I want to write a code if a user uploads a PDF and based on its type (2-col or 3-col) the parser will extract data in proper sequence, in this code: I cannot extract properly 2-col data, please help me. Thank you

 page_text = page.extract_text()

Answered by samkit-jain

Aug 29, 2023

Hi @DevanshuBrahmbhatt Appreciate your interest in the library. Request you to please provide more information on what you want to achieve here. Assuming you want to read a N-column PDF column be column, you can refer to the code I have shared at #975 (comment) which will give you all the vertical lines that divide the PDF into columns. Once you have those, you can recursively crop the page and extract the text. Something like

import math

import pdfplumber

pdf = pdfplumber.open("tests/pdfs/federal-register-2020-17221.pdf")  # https://github.com/jsvine/pdfplumber/blob/stable/tests/pdfs/federal-register-2020-17221.pdf
page = pdf.pages[0]

# Crop the top and bottom 5% of the page.
page = page

View full answer

samkit-jain · 2023-08-29T10:49:46Z

samkit-jain
Aug 29, 2023
Collaborator

Hi @DevanshuBrahmbhatt Appreciate your interest in the library. Request you to please provide more information on what you want to achieve here. Assuming you want to read a N-column PDF column be column, you can refer to the code I have shared at #975 (comment) which will give you all the vertical lines that divide the PDF into columns. Once you have those, you can recursively crop the page and extract the text. Something like

import math

import pdfplumber

pdf = pdfplumber.open("tests/pdfs/federal-register-2020-17221.pdf")  # https://github.com/jsvine/pdfplumber/blob/stable/tests/pdfs/federal-register-2020-17221.pdf
page = pdf.pages[0]

# Crop the top and bottom 5% of the page.
page = page.crop((0.0, 0.05 * float(page.height), float(page.width), 0.95 * float(page.height)))

# Assuming a page is a table having "width" columns,
# this list will tell whether a column has text at a particular column or not.
# If the list becomes like
#   T F F F F F T T F F F T
# Then we have a 2-column PDF as we have 3 vertical lines that don't cut
# through any page.
is_pos_blank = [True] * int(page.width)
for word in page.extract_words():
    for pos in range(int(word["x0"]), math.ceil(word["x1"])):
        is_pos_blank[pos] = False

empty_columns = []

match = False
for pos, is_blank in enumerate(is_pos_blank):
    if is_blank and not match:
        empty_columns.append(pos)
        match = True
    elif not is_blank:
        match = False

print(len(empty_columns) - 1)

im = page.to_image(resolution=200)
im.draw_rects(page.extract_words(), stroke_width=2)
im.draw_vlines(empty_columns, stroke_width=5)
im.save("image.png", format="PNG")

prev_index = 0
for pos in empty_columns:
    if prev_index != pos:
        print(page.crop((prev_index, 0, pos, page.height), relative=True).extract_text())
    prev_index = pos
    print()
    print()
    print()
    print()

print(page.crop((prev_index, 0, page.width, page.height), relative=True).extract_text())

1 reply

samkit-jain Aug 31, 2023
Collaborator

Use .extract_words() to get all the words' bounding boxes and then perform clustering to get all the paragraphs. Then read the paragraphs based on the ordering you want.
Use .extract_words(use_text_flow=True) and it will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.
Use a different tool like https://artifex.com/blog/extract-text-from-a-multi-column-document-using-pymupdf-inpython

samkit-jain · 2023-08-31T16:47:43Z

samkit-jain
Aug 31, 2023
Collaborator

Closing since a similar #975 is more active.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize Parser to handle all types of PDFs (2-cols, 3-cols, or Combination) #974

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Generalize Parser to handle all types of PDFs (2-cols, 3-cols, or Combination) #974

DevanshuBrahmbhatt Aug 26, 2023

Replies: 2 comments · 1 reply

samkit-jain Aug 29, 2023 Collaborator

samkit-jain Aug 31, 2023 Collaborator

samkit-jain Aug 31, 2023 Collaborator

DevanshuBrahmbhatt
Aug 26, 2023

Replies: 2 comments 1 reply

samkit-jain
Aug 29, 2023
Collaborator

samkit-jain Aug 31, 2023
Collaborator

samkit-jain
Aug 31, 2023
Collaborator