are the default config for extracting text/tables the best ones? #802

sergenti · 2023-02-01T23:19:44Z

sergenti
Feb 1, 2023

if I had to use only one table_settings, what should I use?
I'm working on a SaaS project and can't manually change values depending on the context

I'm dealing with the following types of documents:

scientific papers
legal documents
standard financial documents (like 10k, 10q, s1, etc)
other confidential financial documents (PE internal reports, transcripts, memorandums)
etc

Some ideas?

right now for these types of document seems that the base pdfminer.six works best for text extraction, and that another library called tabula-py works best for extracting tables.

maybe I am missing something, this library seems so well-written.

jsvine · 2023-02-01T23:31:49Z

jsvine
Feb 1, 2023
Maintainer

Hi @fylls, thanks for your interest and for the kind words. PDFs come in such a range of layouts, it's hard to say what the best settings would be. The default settings are supposed to be useful for tables that have clear, visible cell borders. But not all tables do have that.

As for text extraction, the same principle applies, but I think you have more flexibility in dynamically determining the best settings by (for example) examining character size, spacing, et cetera.

Open to other thoughts / suggestions on this, though!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

are the default config for extracting text/tables the best ones? #802

{{title}}

Replies: 1 comment

{{title}}

Select a reply

are the default config for extracting text/tables the best ones? #802

sergenti Feb 1, 2023

Replies: 1 comment

jsvine Feb 1, 2023 Maintainer

sergenti
Feb 1, 2023

jsvine
Feb 1, 2023
Maintainer