-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF with multiple column doesn't extract text properly #508
Comments
You can have GPT-3.5 clean it up and format properly |
I had this same problem and this function extracts the text in the columns and puts it in one. It wont probably work for all cases (specially if there are multiple spaces between words inside the same column) but maybe you can try it and see if it works for you. In my case it was a two column pdf and seems to extract the text mostly fine.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
PDF with multiple columns doesn’t extract text properly
When I tried to extract text in a PDF with 2 columns style. The text is read in a row by row fashion.
For example, I have a pdf like so
the reader will extract the text like so:
How can I configure it to extract text column by column?
Code snippet
The text was updated successfully, but these errors were encountered: