Benchmarking #955

dhdaines · 2023-08-02T17:17:25Z

dhdaines
Aug 2, 2023

In the PyPDFium2 documentation on text extraction we find this comment:

See this benchmark for a performance and quality comparison with other tools.

I went and looked, and pdfplumber doesn't look so great. I find this sad because I really like pdfplumber and its friendly license and its friendly API and the fact that it doesn't just give me a lump of text and make me guess how it got it, and the fact that it doesn't depend on Java, and so on.

For speed, well, we know that already, it's because of pdfminer.six. So no big deal, I'm not in a hurry. But what of the "Text Extraction Quality" numbers here?

Has anyone done some error analysis to figure out where pdfplumber is going wrong here? The ground truth texts (of unknown origin) are here: https://github.com/py-pdf/benchmarks/tree/main/read/extraction-ground-truth

cmdlineluser · 2023-08-02T21:44:39Z

cmdlineluser
Aug 2, 2023

The score is the levenshtein ratio

https://github.com/py-pdf/benchmarks/blob/ce340e84af84a755be12f824fa49c0240e30425f/pdf_benchmark/score.py#L1

I picked the first "low" score:

(PosixPath('pdfplumber/1602.06541.txt'), 0.5866773388981397)

https://arxiv.org/pdf/1602.06541.pdf

The truth text

A Survey of Semantic Segmentation

Martin Thoma
info@martin-thoma.de

Abstract—This survey gives an overview over different
techniques used for pixel-level semantic segmentation.
Metrics and datasets for the evaluation of segmentation
algorithms and traditional approaches for segmentation
such as unsupervised methods, Decision Forests
and SVMs are described and pointers to the relevant
papers are given. Recently published approaches with
convolutional neural networks are mentioned and typical
problematic situations for segmentation algorithms are
examined. A taxonomy of segmentation algorithms is
given.

I. INTRODUCTION

Semantic segmentation is the task of clustering
parts of images together which belong to the same
object class. This type of algorithm has several use-cases
such as detecting road signs [MBLAGJ+07],

pdfplumber.extract_text()

1
A Survey of Semantic Segmentation II. TAXONOMYOFSEGMENTATIONALGORITHMS
Martin Thoma The computer vision community has published a
info@martin-thoma.de wide range of segmentation algorithms so far. Those
algorithms can be grouped by the kind of data they
operate on and the kind of segmentation they are able
Abstract—Thissurveygivesanoverviewoverdifferent
to produce.
techniques used for pixel-level semantic segmentation.
Metrics and datasets for the evaluation of segmenta- The following subsections will give four different
tion algorithms and traditional approaches for segmen- criteria by which segmentation algorithms can be
tation such as unsupervised methods, Decision Forests
classified.
and SVMs are described and pointers to the relevant
papers are given. Recently published approaches with This survey describes fixed-class (see Section II-A),
convolutionalneuralnetworksarementionedandtypical single-class affiliation (see Section II-B) algorithms
6102 problematic situations for segmentation algorithms are whichworkongrayscaleorcoloredsinglepixelimages
examined. A taxonomy of segmentation algorithms is
(see Section II-C) in a completely automated, passive
given.
fashion (see Section II-D).
yaM
I. INTRODUCTION
Semantic segmentation is the task of clustering A. Allowed classes
parts of images together which belong to the same
11 object class. This type of algorithm has several use- Semantic segmentation is a classification task. As
cases such as detecting road signs [MBLAGJ+07], such, the classes on which the algorithm is trained is a
central design decision.

I ran .extract_text(layout=True) manually to check what it does:

                                                                        1         
                                                                                  
            A Survey of Semantic Segmentation II. TAXONOMYOFSEGMENTATIONALGORITHMS
                                                                                  
                     Martin Thoma          The computer vision community has published a
                   info@martin-thoma.de   wide range of segmentation algorithms so far. Those
                                          algorithms can be grouped by the kind of data they
                                          operate on and the kind of segmentation they are able
           Abstract—Thissurveygivesanoverviewoverdifferent                        
                                          to produce.                             
          techniques used for pixel-level semantic segmentation.                  
          Metrics and datasets for the evaluation of segmenta- The following subsections will give four different
          tion algorithms and traditional approaches for segmen- criteria by which segmentation algorithms can be
          tation such as unsupervised methods, Decision Forests                   
                                          classified.                             
          and SVMs are described and pointers to the relevant                     
          papers are given. Recently published approaches with This survey describes fixed-class (see Section II-A),
          convolutionalneuralnetworksarementionedandtypical single-class affiliation (see Section II-B) algorithms
          problematic situations for segmentation algorithms are whichworkongrayscaleorcoloredsinglepixelimages 6102
          examined. A taxonomy of segmentation algorithms is                      
                                          (see Section II-C) in a completely automated, passive
          given.                                                                  
                                          fashion (see Section II-D).             
  yaM                                                                             
                    I. INTRODUCTION                                               
           Semantic segmentation is the task of clustering A. Allowed classes     
          parts of images together which belong to the same                       
                                           Semantic segmentation is a classification task. As object class. This type of algorithm has several use- 11
          cases such as detecting road signs [MBLAGJ+07]

So it seems like for the input

AAAAA BBBBB CCCCC
AAAAA BBBBB CCCCC
AAAAA BBBBB CCCCC

The truth text is

AAAAA
AAAAA
AAAAA

BBBBB
BBBBB
BBBBB

CCCCC
CCCCC
CCCCC

i.e. they want each column extracted separately and stacked vertically

Update: Looking at the result of pdfium for that file:

1 1^M 
2 A Survey of Semantic Segmentation^M
3 Martin Thoma^M
4 info@martin-thoma.de^M
5 Abstract—This survey gives an overview over different^M
6 techniques used for pixel-level semantic segmentation.^M

~~Perhaps a more accurate description is that pdfplumber doesn't move to a new line when it sees \r\n?~~
Silly me, pdfium is just using \r\n line-endings for its output.

5 replies

jsvine Aug 3, 2023
Maintainer

Thank you for digging into that, @cmdlineluser. I'm open to disagreement on this matter, but for me this points to a difference in goals between pdfplumber and the other tools. The goal of .extract_text(...) is to represent the text as it is on the page, rather than how a human expects to read it. That said, I think adding something like .extract_text(reading_order=True) could be a neat feature.

dhdaines Aug 3, 2023
Author

Actually I wonder if simply calling extract_text(use_text_flow=True) will improve the benchmark results.

dhdaines Aug 3, 2023
Author

But also - I agree, this is the reason I use pdfplumber, because I don't just want an undifferentiated blob of text in reading order. That is what people who are ahem training LLMs want, so it's the use case of other libraries. For doing more sophisticated information extraction it is not particularly useful except for pre-training or fine-tuning word embeddings.

dhdaines Aug 3, 2023
Author

Note also that for the other big use case of PDF text extraction, namely search engine indexing, it doesn't particularly matter what order the words are in (actually this is also somewhat true for BERT type language models, amazingly enough), though it seems that pdfplumber's word breaking could be improved a bit from the results above.

cmdlineluser Aug 3, 2023

It's somewhat of an "unfair comparison" as pdfplumber is more about retaining the original layout as you say.

I think adding something like .extract_text(reading_order=True) could be a neat feature.

Yeah, my first thought was perhaps an .extract_text(strategy="columns") option similar to the table functions.

Not sure if there are any other types of "strategy" for this type of thing though, so perhaps an explicit option is a better choice.

jsvine · 2023-08-03T14:06:11Z

jsvine
Aug 3, 2023
Maintainer

Thank you @dhdaines for opening this discussion. As you point out,pdfplumber's relative slowness is, indeed, largely a product of depending on Python (i.e., pdfminer.six) for parsing. And, likewise, this slowness frustrates me. It's one of main reasons (along with spec-completeness / attribute-accessibility) that I've been considering moving the parsing layer from pdfminer.six to pypdfium2 (or similar).

Re. the quality metrics, see my note here. Certainly open to finding ways to improve the extraction, but some seems to be more a matter of expectations rather than accuracy.

2 replies

dhdaines Aug 3, 2023
Author

My perception of the pdfminer.six code is that it probably has a fair amount of room for optimization. Pure-python libraries are obviously always slower than (well-written) C/C++/Rust (don't say J*va) but don't have to be that much slower, see Whoosh for instance.

dhdaines Aug 5, 2023
Author

Yes, pdfminer.six is not really designed with speed in mind. The profiler output shows some interesting things like:

 93891385   12.828    0.000   12.828    0.000 {built-in method builtins.isinstance}

In general the extensive use of object as data (rather than dataclass or just dict) is probably slowing things down. The pdfminer.six parser, which actually isn't a parser but just a lexical analyzer, also seems to be quite inefficient.

Actually there seems to be quite a bit of overhead in pdfplumber itself, particularly in process_object which, to be fair, is a victim of pdfminer.six and its Javascript-y way of storing data. But also, perhaps, resolve_all could be rewritten without recursion, which should make it more efficient. I'll take a quick shot at a few things next time I look at this stuff.

dhdaines · 2023-09-06T20:21:49Z

dhdaines
Sep 6, 2023
Author

The main "accuracy" problem here (of use_text_flow not being respected by extract_text) is fixed in #983

Updated results can be seen at https://github.com/dhdaines/benchmarks

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking #955

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Benchmarking #955

dhdaines Aug 2, 2023

Replies: 3 comments · 7 replies

cmdlineluser Aug 2, 2023

jsvine Aug 3, 2023 Maintainer

dhdaines Aug 3, 2023 Author

dhdaines Aug 3, 2023 Author

dhdaines Aug 3, 2023 Author

cmdlineluser Aug 3, 2023

jsvine Aug 3, 2023 Maintainer

dhdaines Aug 3, 2023 Author

dhdaines Aug 5, 2023 Author

dhdaines Sep 6, 2023 Author

dhdaines
Aug 2, 2023

Replies: 3 comments 7 replies

cmdlineluser
Aug 2, 2023

jsvine Aug 3, 2023
Maintainer

dhdaines Aug 3, 2023
Author

dhdaines Aug 3, 2023
Author

dhdaines Aug 3, 2023
Author

jsvine
Aug 3, 2023
Maintainer

dhdaines Aug 3, 2023
Author

dhdaines Aug 5, 2023
Author

dhdaines
Sep 6, 2023
Author