How to check 2-col pdfs? (Code for checking center of pdf for white space or text) #975
DevanshuBrahmbhatt
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment 7 replies
-
Hi @DevanshuBrahmbhatt Appreciate your interest in the library. Are you facing any issue with the code you've provided? Request you to please provide more details on the issue you are facing or what you need help with. Attaching the PDF will be helpful too. Assuming you want to know how many columns a PDF has, you can try with the following code: import math
import pdfplumber
pdf = pdfplumber.open("tests/pdfs/federal-register-2020-17221.pdf") # https://github.com/jsvine/pdfplumber/blob/stable/tests/pdfs/federal-register-2020-17221.pdf
page = pdf.pages[0]
# Crop the top and bottom 5% of the page.
page = page.crop((0.0, 0.05 * float(page.height), float(page.width), 0.95 * float(page.height)))
# Assuming a page is a table having "width" columns,
# this list will tell whether a column has text at a particular column or not.
# If the list becomes like
# T F F F F F T T F F F T
# Then we have a 2-column PDF as we have 3 vertical lines that don't cut
# through any page.
is_pos_blank = [True] * int(page.width)
for word in page.extract_words():
for pos in range(int(word["x0"]), math.ceil(word["x1"])):
is_pos_blank[pos] = False
empty_columns = []
match = False
for pos, is_blank in enumerate(is_pos_blank):
if is_blank and not match:
empty_columns.append(pos)
match = True
elif not is_blank:
match = False
print(len(empty_columns) - 1) # Number of columns in the PDF.
# For debugging.
im = page.to_image(resolution=200)
im.draw_rects(page.extract_words(), stroke_width=2)
im.draw_vlines(empty_columns, stroke_width=5)
im.save("image.png", format="PNG") Feel free to optimise the code based on your needs. |
Beta Was this translation helpful? Give feedback.
7 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Beta Was this translation helpful? Give feedback.
All reactions