Running tests with many PDF documents #1042
MartinThoma
announced in
Announcements
Replies: 1 comment 1 reply
-
@pubpub-zz / @MasterOdin that one might be interesting to you :-) |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
If you want to run tests with many (hundreds / thousands) of PDF documents, you can use https://github.com/py-pdf/pdf-crawler to build such a test dataset.
I'm currently thinking if we should add a list of URLs where the PDFs can be downloaded. The locations might change, but at least we would have a starting point for building a private test dataset.
Please be reminded: Just because those are public, you are not allowed to share them in any way. Please also keep in mind that running the crawler might put load on the website. I would not parallelize it too much in order to keep the load on the crawled website low. Additionally, we might want to look at
robots.txt
files.Beta Was this translation helpful? Give feedback.
All reactions