Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF with multiple column doesn't extract text properly #508

Open
nus-kingsley opened this issue Mar 21, 2023 · 2 comments
Open

PDF with multiple column doesn't extract text properly #508

nus-kingsley opened this issue Mar 21, 2023 · 2 comments

Comments

@nus-kingsley
Copy link

nus-kingsley commented Mar 21, 2023

PDF with multiple columns doesn’t extract text properly
When I tried to extract text in a PDF with 2 columns style. The text is read in a row by row fashion.
For example, I have a pdf like so

image

the reader will extract the text like so:
image

How can I configure it to extract text column by column?

Code snippet

reader = PDF::Reader.new("somefile.pdf")
reader.pages.last.text
@danielfriis
Copy link

danielfriis commented Apr 25, 2024

You can have GPT-3.5 clean it up and format properly

@GMolini
Copy link

GMolini commented May 29, 2024

I had this same problem and this function extracts the text in the columns and puts it in one. It wont probably work for all cases (specially if there are multiple spaces between words inside the same column) but maybe you can try it and see if it works for you. In my case it was a two column pdf and seems to extract the text mostly fine.

def parse_pdf_columns(pdf_url, numcols = 2)
  io = URI.open(pdf_url)

  reader = PDF::Reader.new(io)

  parsed_text = ""
  
  reader.pages.each_with_index do |page, pIndex|
    columns_text = Array.new(numcols) { "" }
    lines_in_columns = []

    p "Processing page #{pIndex}"
    page.text.split("\n").each do |line|
      #First remove up to 50 leading spaces, otherwise we might think a padded heading belongs to another column,
      #then split by 5 or more spaces
      
      lines_in_columns << line.sub(/^\s{0,50}/, '').split(/\s{5,}/)
    end
    lines_in_columns.each do |line|
      (0..numcols-1).each do |colIndex|

        if line[colIndex].present?
          columns_text[colIndex] += line[colIndex] + "\n"
        end
      end
    end

    parsed_text += columns_text.join("\n")
  end
  return parsed_text
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants