Skip to content

get_text() method uses incorrect reading order when transitioning to next page #2396

Answered by JorjMcKie
kodymoodley asked this question in Q&A
Discussion options

You must be logged in to vote

This is not a bug. "Naive" text extraction as you are using it, will always extract in the order as specified on the page.
No attempt will be made to make sense out of the page's layout or whatever reading order.
The PDF creator decides when to write which portions of the text. He may decide to first write all the text of the left, then that of the right column for multi-column page layout.
But also possible is that each character is being written separately - in an arbitrary sequence out of the n! possibilities if we have n characters on page.
... And the page would still look exactly the same in your PDF viewer!

Your document consists of scanned pages with an OCR layer. Therefore, your …

Replies: 2 comments 6 replies

Comment options

You must be logged in to vote
5 replies
@kodymoodley
Comment options

@JorjMcKie
Comment options

@kodymoodley
Comment options

@JorjMcKie
Comment options

@Ripe88
Comment options

Answer selected by JorjMcKie
Comment options

You must be logged in to vote
1 reply
@JorjMcKie
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
not a bug not a bug / user error / unable to reproduce
4 participants
Converted from issue

This discussion was converted from issue #2393 on May 09, 2023 14:29.