Generalize Parser to handle all types of PDFs (2-cols, 3-cols, or Combination) #974
-
Hello Folks,
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Hi @DevanshuBrahmbhatt Appreciate your interest in the library. Request you to please provide more information on what you want to achieve here. Assuming you want to read a N-column PDF column be column, you can refer to the code I have shared at #975 (comment) which will give you all the vertical lines that divide the PDF into columns. Once you have those, you can recursively crop the page and extract the text. Something like import math
import pdfplumber
pdf = pdfplumber.open("tests/pdfs/federal-register-2020-17221.pdf") # https://github.com/jsvine/pdfplumber/blob/stable/tests/pdfs/federal-register-2020-17221.pdf
page = pdf.pages[0]
# Crop the top and bottom 5% of the page.
page = page.crop((0.0, 0.05 * float(page.height), float(page.width), 0.95 * float(page.height)))
# Assuming a page is a table having "width" columns,
# this list will tell whether a column has text at a particular column or not.
# If the list becomes like
# T F F F F F T T F F F T
# Then we have a 2-column PDF as we have 3 vertical lines that don't cut
# through any page.
is_pos_blank = [True] * int(page.width)
for word in page.extract_words():
for pos in range(int(word["x0"]), math.ceil(word["x1"])):
is_pos_blank[pos] = False
empty_columns = []
match = False
for pos, is_blank in enumerate(is_pos_blank):
if is_blank and not match:
empty_columns.append(pos)
match = True
elif not is_blank:
match = False
print(len(empty_columns) - 1)
im = page.to_image(resolution=200)
im.draw_rects(page.extract_words(), stroke_width=2)
im.draw_vlines(empty_columns, stroke_width=5)
im.save("image.png", format="PNG")
prev_index = 0
for pos in empty_columns:
if prev_index != pos:
print(page.crop((prev_index, 0, pos, page.height), relative=True).extract_text())
prev_index = pos
print()
print()
print()
print()
print(page.crop((prev_index, 0, page.width, page.height), relative=True).extract_text()) |
Beta Was this translation helpful? Give feedback.
-
Closing since a similar #975 is more active. |
Beta Was this translation helpful? Give feedback.
Hi @DevanshuBrahmbhatt Appreciate your interest in the library. Request you to please provide more information on what you want to achieve here. Assuming you want to read a N-column PDF column be column, you can refer to the code I have shared at #975 (comment) which will give you all the vertical lines that divide the PDF into columns. Once you have those, you can recursively crop the page and extract the text. Something like