get_text() method uses incorrect reading order when transitioning to next page #2396
-
Describe the bug (mandatory)get_text() does not follow the correct reading order when compiling or extracting the text in a double-column PDF in sequential order. To Reproduce (mandatory)
See attached file for the unexpected behavior (output) of the code: 31976R2339_unexpected.txt. When you open the text file in a text editor where you can see the line numbers, Line 143 is the problematic one. It starts with Expected behavior (optional)Line 143 of 31976R2339_unexpected.txt should start with Your configuration (mandatory)
Output of
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 6 replies
-
This is not a bug. "Naive" text extraction as you are using it, will always extract in the order as specified on the page. Your document consists of scanned pages with an OCR layer. Therefore, your OCR software has made the decisions about the sequence of storing the recognized text. You could use |
Beta Was this translation helpful? Give feedback.
-
Thanks to this whole thread for asking this question as well as the detailed answers about text order. I've been struggling with this also and I really appreciate everybody's insights! |
Beta Was this translation helpful? Give feedback.
This is not a bug. "Naive" text extraction as you are using it, will always extract in the order as specified on the page.
No attempt will be made to make sense out of the page's layout or whatever reading order.
The PDF creator decides when to write which portions of the text. He may decide to first write all the text of the left, then that of the right column for multi-column page layout.
But also possible is that each character is being written separately - in an arbitrary sequence out of the
n!
possibilities if we have n characters on page.... And the page would still look exactly the same in your PDF viewer!
Your document consists of scanned pages with an OCR layer. Therefore, your …