Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Width in run elements is too small/col_count too high #536

Open
mmatotan opened this issue Mar 25, 2024 · 0 comments
Open

Width in run elements is too small/col_count too high #536

mmatotan opened this issue Mar 25, 2024 · 0 comments

Comments

@mmatotan
Copy link

mmatotan commented Mar 25, 2024

I am not sure if this is a problem with the pdf itself, but it seems like when mapping the mean_character_width from @runs in initialize of lib/pdf/reader/page_layout.rb that the width on some runs is extremely low(less than 1e-15) and getting the median from those results returns an abnormal number for the number of columns.

I've made a workaround in this fork and it now works for those PDFs:
kodius@6b232e9

The PDF that is causing these issues for me is this one:
dorset.pdf

Specifically pages 31 and 39-50, so those that are mostly blank or contain images.

This is how to reproduce it:

data = File.open("dorset.pdf").read

PDF::Reader.open(StringIO.new(data)) do |reader|
  reader.pages.each_with_index do |page, index|
    pp "page #{index + 1}"
    pp page.text
  end
end

It should break at page 31 with no error message given, when debugged deeply it actually fails to allocate memory because it does the following in to_s(same as .text method) of page_layout.rb and the col_count is simply too high.

      page = row_count.times.map { |i| " " * col_count }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant