Reading order of the page #712
gokiberk
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment
-
Hi @gokiberk, and you've encountered one of the interesting things about PDFs: Generally speaking, they don't strictly specify a reading order. What I would suggest for this particular PDF is to process the two halves of the page separately. E.g., based on your current code: with pdfplumber.open("ssb.pdf") as pdf:
pdf2string = ""
for page in pdf.pages:
left = page.crop((0, 0, page.width / 2, page.height))
right = page.crop((page.width / 2, 0, page.width, page.height))
pdf2string += left.extract_text() + "\n" + right.extract_text() This solution isn't perfect — for instance, it splits the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
Thank you for this powerful tool, it works amazing for most of the part. However, this part of the pdf acts little weird.
The output is as follows:
The way I want to receive:
STATİK YAKMA FIRINI
NATO Support Agency (NSPA) Patlayıcı Malzeme
İmhası Projesi kapsamında Kırıkkale’de
mühimmat ayırma ayıklama tesisleri kurulmuştur.
TAHTA YAKMA FIRINI
Sistem, koruyucu olarak Penta Kloro Fenol
içeren mühimmat sandıklarını çevreye uyumlu
bir şekilde imha etmekte ve ısı geri kazanımıyla
buhar üretmektedir. İmha sisteminde saatte
500 kg sandık iki aşamalı olarak kırılmakta,
metalik parçalar otomatik olarak ayrılmakta
ve akışkan yatak yöntemiyle yakılmaktadır.
Çıkan gaz yıkayıcılardan geçirilerek içerdiği
toz ve kimyasallardan arıtılmaktadır. NSPA için
geliştirilen bu sistem, 2006 yılından beri Türk
Silahlı Kuvvetleri tarafından kullanılmaktadır.
I have realized that in similar cases there is double space between left and right columns so I thought I could do some substring operations and fix this situation by detecting double spaces. However, when there is a full stop there is no double space that could help me to solve this problem in the way I thought.
How can I receive first the left then the right side? Order of the products in the page are not important, if top text is on the right and there are two bunch of texts on left and right bottom top text can be in second order. Main issue for me is to have complete paragraphs.
My code is as follows:
P.S. I run the code on complete pdf file that's why I am iterating all the pages in the code.
Page that I have trouble with is here:
page0503.pdf
Beta Was this translation helpful? Give feedback.
All reactions