Replies: 3 comments 7 replies
-
The score is the levenshtein ratio I picked the first "low" score: (PosixPath('pdfplumber/1602.06541.txt'), 0.5866773388981397) https://arxiv.org/pdf/1602.06541.pdf The truth text
I ran
So it seems like for the input
The truth text is
i.e. they want each column extracted separately and stacked vertically Update: Looking at the result of pdfium for that file:
|
Beta Was this translation helpful? Give feedback.
-
Thank you @dhdaines for opening this discussion. As you point out, Re. the quality metrics, see my note here. Certainly open to finding ways to improve the extraction, but some seems to be more a matter of expectations rather than accuracy. |
Beta Was this translation helpful? Give feedback.
-
The main "accuracy" problem here (of Updated results can be seen at https://github.com/dhdaines/benchmarks |
Beta Was this translation helpful? Give feedback.
-
In the PyPDFium2 documentation on text extraction we find this comment:
See this benchmark for a performance and quality comparison with other tools.
I went and looked, and
pdfplumber
doesn't look so great. I find this sad because I really likepdfplumber
and its friendly license and its friendly API and the fact that it doesn't just give me a lump of text and make me guess how it got it, and the fact that it doesn't depend on Java, and so on.For speed, well, we know that already, it's because of
pdfminer.six
. So no big deal, I'm not in a hurry. But what of the "Text Extraction Quality" numbers here?Has anyone done some error analysis to figure out where
pdfplumber
is going wrong here? The ground truth texts (of unknown origin) are here: https://github.com/py-pdf/benchmarks/tree/main/read/extraction-ground-truthBeta Was this translation helpful? Give feedback.
All reactions