Running tests with many PDF documents #1042

MartinThoma · 2022-06-29T17:03:05Z

MartinThoma
Jun 29, 2022
Maintainer

If you want to run tests with many (hundreds / thousands) of PDF documents, you can use https://github.com/py-pdf/pdf-crawler to build such a test dataset.

I'm currently thinking if we should add a list of URLs where the PDFs can be downloaded. The locations might change, but at least we would have a starting point for building a private test dataset.

Please be reminded: Just because those are public, you are not allowed to share them in any way. Please also keep in mind that running the crawler might put load on the website. I would not parallelize it too much in order to keep the load on the crawled website low. Additionally, we might want to look at robots.txt files.

MartinThoma · 2022-06-29T17:03:21Z

MartinThoma
Jun 29, 2022
Maintainer Author

@pubpub-zz / @MasterOdin that one might be interesting to you :-)

1 reply

MasterOdin Jun 29, 2022
Collaborator

Please be reminded: Just because those are public, you are not allowed to share them in any way.

So the apache corpus README that the crawler goes through links to https://digitalcorpora.org/corpora/files which states that:

For these reasons, we have created and released a corpus of 1 million documents that are freely available for research and may be (to the best of our knowledge) freely redistributed.

Sounds like really so long as you have a README somewhere reasonable around the downloaded files with the link to where the files came from and citation ("Garfinkel, Farrell, Roussev and Dinolt, Bringing Science to Digital Forensics with Standardized Forensic Corpora, DFRWS 2009, Montreal, Canada"), it would be fine to have these files live within this repo (or the crawler repo), as well as just generally share them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running tests with many PDF documents #1042

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Running tests with many PDF documents #1042

MartinThoma Jun 29, 2022 Maintainer

Replies: 1 comment · 1 reply

MartinThoma Jun 29, 2022 Maintainer Author

MasterOdin Jun 29, 2022 Collaborator

MartinThoma
Jun 29, 2022
Maintainer

Replies: 1 comment 1 reply

MartinThoma
Jun 29, 2022
Maintainer Author

MasterOdin Jun 29, 2022
Collaborator