get_text() method uses incorrect reading order when transitioning to next page #2396

kodymoodley · 2023-05-09T12:50:23Z

kodymoodley
May 9, 2023

Describe the bug (mandatory)

get_text() does not follow the correct reading order when compiling or extracting the text in a double-column PDF in sequential order.

To Reproduce (mandatory)

with fitz.open('31976R2339.pdf') as doc:
     text = ""
     page_count = 0
     for page in doc:
           current_page_text = page.get_text()
           current_page_text = 'PAGE ' + str(page_count) + '\n\n' + current_page_text
           page_count += 1
           text += current_page_text + '\n\n\n'
    print(text)

See attached file for the unexpected behavior (output) of the code: 31976R2339_unexpected.txt. When you open the text file in a text editor where you can see the line numbers, Line 143 is the problematic one. It starts with consignment is split... which is in the right hand side column of the second page.

Expected behavior (optional)

Line 143 of 31976R2339_unexpected.txt should start with cate, on production of movement certificate A.G.I which is in the left hand side column of the second page. I attach the input source file 31976R2339.pdf as well.

Your configuration (mandatory)

MacOSX Ventura (13.1)
Python 3.8.3
PyMuPDF 1.22.2

Output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__):

3.8.3 (v3.8.3:6f8c8320e9, May 13 2020, 16:29:34) 
[Clang 6.0 (clang-600.0.57)] 
 darwin 
 
PyMuPDF 1.22.2: Python bindings for the MuPDF 1.22.0 library.
Version date: 2023-04-26 00:00:01.
Built for Python 3.8 on darwin (64-bit).

Answered by JorjMcKie

May 9, 2023

This is not a bug. "Naive" text extraction as you are using it, will always extract in the order as specified on the page.
No attempt will be made to make sense out of the page's layout or whatever reading order.
The PDF creator decides when to write which portions of the text. He may decide to first write all the text of the left, then that of the right column for multi-column page layout.
But also possible is that each character is being written separately - in an arbitrary sequence out of the n! possibilities if we have n characters on page.
... And the page would still look exactly the same in your PDF viewer!

Your document consists of scanned pages with an OCR layer. Therefore, your …

View full answer

JorjMcKie · 2023-05-09T14:28:29Z

JorjMcKie
May 9, 2023
Maintainer

This is not a bug. "Naive" text extraction as you are using it, will always extract in the order as specified on the page.
No attempt will be made to make sense out of the page's layout or whatever reading order.
The PDF creator decides when to write which portions of the text. He may decide to first write all the text of the left, then that of the right column for multi-column page layout.
But also possible is that each character is being written separately - in an arbitrary sequence out of the n! possibilities if we have n characters on page.
... And the page would still look exactly the same in your PDF viewer!

Your document consists of scanned pages with an OCR layer. Therefore, your OCR software has made the decisions about the sequence of storing the recognized text.

You could use page.get_text(sort=True) which sorts the text by paragraphs (as identified by MuPDF) by vertical, then by horizontal coordinates.
This may be an improvement.
You are aware of the other options (words, dict, rawdict, ...) which deliver position info? That may help to build a suitable text sequence.

5 replies

kodymoodley May 9, 2023
Author

Thanks for the clarification, @JorjMcKie. I understand your point now - it is indeed not a bug with PyMuPDF but rather an issue with how I expect the text to be sorted. I agree, though, that this could be a good feature to add because the sequence I would like to extract from the text is not a very niche / corner-case one. It is a standard reading layout or sequence of Western / English documents with double-column.

JorjMcKie May 13, 2023
Maintainer

I understand. However:
If you look at two- or multi-column page layouts you will see that not the whole page is separated into these columns - there usually are parts above and below those columns (or even in between) which are not affected by that separation.
So how do we know we have two columns at all? Where are top and bottom of this page part? Look at this example page from a science magazine: 3-columns ... or not?

In addition, if we think we see three columns of text (to be read column by column), it may as well be a table with three columns - which obviously should be read row by row! See image above: what are the criteria that let us be so sure we in fact are not looking at two 3-column tables on that page? Here is another example, this time from the Adobe PDF specifications: 2 column table? 2-column page?

So in order to really understand what we are looking at on a page it ultimately requires a human brain. I am aware of AI solutions trying to do this: they do need a lot of training before being capable of anything roughly acceptable.

kodymoodley May 13, 2023
Author

Thanks @JorjMcKie, I think you make some good points with these examples. What I take away from this is that parts of two-column textual documents can still have more or less than two columns (first example). And, that some textual documents can have very specific formats of text e.g. programming code or tables etc. (second example). I agree that there is no canonical way to parse such information.

However, the core issue I had with my example document that I attached above was the odd behaviour of transitioning between one page and another in a two-columned document. I would have expected the parser to start again from the left hand side of the next page (and not the right). It seems odd in whatever document you have to start on the right hand side. I understand that this was due to the OCR software used and is not a problem with PyMUPDF, but nevertheless it is an odd behaviour.

I think probably it is the best solution for me is to re-OCR. I leave it up to you and others to decide whether it makes sense for PyMUPDF to be modified to deal with this. I guess it comes down to how often PyMUPDF users encounter such issues and what proportion of PDFs in the wild have been OCR'd this way.

Thanks for acknowledging and trying to address my issue, really appreciate your time.

JorjMcKie May 13, 2023
Maintainer

Here is a a tracing-type visualization of that page, which wraps every word with a red rectangle and prints the sequence number in which the word occurs in the page's appearance source. you can clearly see that after the top line, the next word is "consignment", word number 14.

Generated by this code:

page=doc[1]
for i,w in enumerate(page.get_text("words")):
    bbox=fitz.Rect(w[:4])
    page.draw_rect(bbox,color=(1,0,0),width=0.3)
    page.insert_text(bbox.tl, str(i), fontsize=6, color=(1,0,0))

I am guessing that that page was put on the scanner in a different orientation that page one. And the OCR is being built-in into that scanner and therefore started recognizing at its usual place on the page ...

BTW you don't need to re-OCR: simply sort in the way I did it.

Ripe88 Jul 25, 2023

very detail and clearly about the explaination

LurieHR · 2024-10-29T21:58:41Z

LurieHR
Oct 29, 2024

Thanks to this whole thread for asking this question as well as the detailed answers about text order. I've been struggling with this also and I really appreciate everybody's insights!

1 reply

JorjMcKie Oct 29, 2024
Maintainer

Introduced with the latest versions, page.get_text(sort=True) will establish "natural" reading sequence in most situations.
There have been great improvements which even lead to establishing (an approximation of) layout fidelity.
Give it a try!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_text() method uses incorrect reading order when transitioning to next page #2396

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

get_text() method uses incorrect reading order when transitioning to next page #2396

kodymoodley May 9, 2023

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Your configuration (mandatory)

Replies: 2 comments · 6 replies

JorjMcKie May 9, 2023 Maintainer

kodymoodley May 9, 2023 Author

JorjMcKie May 13, 2023 Maintainer

kodymoodley May 13, 2023 Author

JorjMcKie May 13, 2023 Maintainer

Ripe88 Jul 25, 2023

LurieHR Oct 29, 2024

JorjMcKie Oct 29, 2024 Maintainer

kodymoodley
May 9, 2023

Replies: 2 comments 6 replies

JorjMcKie
May 9, 2023
Maintainer

kodymoodley May 9, 2023
Author

JorjMcKie May 13, 2023
Maintainer

kodymoodley May 13, 2023
Author

JorjMcKie May 13, 2023
Maintainer

LurieHR
Oct 29, 2024

JorjMcKie Oct 29, 2024
Maintainer