Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blank pages in pdf lead to the wrong number of pages #72

Open
niuzaisheng opened this issue Feb 26, 2024 · 2 comments
Open

Blank pages in pdf lead to the wrong number of pages #72

niuzaisheng opened this issue Feb 26, 2024 · 2 comments

Comments

@niuzaisheng
Copy link

I was dealing with a document triggered this error in papermage/rasterizers/rasterizer.py:

raise ValueError(f"Failed to attach. {len(images)} images != {len(pages)} pages in doc.")

I did a deep debug found that the reason is my pdf has a blank page, and this code, in papermage/parsers/pdfplumber_parser.py, to determine the number of pages is by traversing the existence of all the objects, which will skip the blank page, resulting in the number of page objects in page_annos list to be less than the actual number of pages.

for page_id, tups in itertools.groupby(iterable=tokens_with_group_ids, key=lambda tup: tup[2]):

        for page_id, tups in itertools.groupby(iterable=tokens_with_group_ids, key=lambda tup: tup[2]):
            page_tokens = [token for token, _, _ in tups]
            page_w, page_h, page_unit = dims[page_id]
            page = Entity(
                spans=[
                    Span(
                        start=page_tokens[0].spans[0].start,
                        end=page_tokens[-1].spans[0].end,
                    )
                ],
                boxes=[Box.create_enclosing_box(boxes=[box for t in page_tokens for box in t.boxes])],
                metadata=Metadata(width=page_w, height=page_h, user_unit=page_unit),
            )
            page_annos.append(page)

Some further modifications may be needed here to deal with this rare case. Thank you.

@kyleclo
Copy link
Collaborator

kyleclo commented Mar 18, 2024

interesting, can you email me a sample file to test this out on @niuzaisheng?

@freyam13
Copy link

freyam13 commented Sep 4, 2024

+1 I have the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants