Replies: 1 comment
-
Hi @fylls, thanks for your interest and for the kind words. PDFs come in such a range of layouts, it's hard to say what the best settings would be. The default settings are supposed to be useful for tables that have clear, visible cell borders. But not all tables do have that. As for text extraction, the same principle applies, but I think you have more flexibility in dynamically determining the best settings by (for example) examining character size, spacing, et cetera. Open to other thoughts / suggestions on this, though! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
if I had to use only one table_settings, what should I use?
I'm working on a SaaS project and can't manually change values depending on the context
I'm dealing with the following types of documents:
Some ideas?
right now for these types of document seems that the base
pdfminer.six
works best for text extraction, and that another library calledtabula-py
works best for extracting tables.maybe I am missing something, this library seems so well-written.
Beta Was this translation helpful? Give feedback.
All reactions