-
Notifications
You must be signed in to change notification settings - Fork 9
FAQs
petermr edited this page Jul 25, 2021
·
3 revisions
- Search repository or publishers sites for scholarly articles.
- Iteratively improves queries from dictionaries and previous searches.
- provide a unified system to cover many different sites.
- integrate with downstream content-mining and analysis.
-
pygetpapers
is modular and designed for RESTful APIs. It has modules for EuropePMC (EPMC) (fulltext), preprint servers:arXiv
,biorxiv
,medrxiv
,rkivist
and metadata server:crossref
. - if you are familiar with the content and manual search it is relatively easy to add code for a new RESTFul repository. Note that the socio-legal aspects are often critical (copyright, server load, etc.)
- pygetpapers stores all data (fulltexts, metadata, analyses, etc.) on your machine, wherever you choose.
No.
Currently you have to install Python but there are simple tested commands for this. Later we may package everything as docker
or Jupyter Notebooks
Not by default.
There is an optional LOGfile which stores the query and records downloads.
We are working on integrating pygetpapers
into Jupyter Notebooks so complex workflows can be re-run.
It is not currently packaged as a server (although this shouldn't be difficult), but we are exploring Cloud solutions such as Binder or Google Colab.
-
pygetpapers
is generally embarrassingly parallel. The main resource is bandwidth and remote server capacity. Several jobs can be run simultaneously , e.g. by division by publication-date slices. The main concern is not to overload the remote server, creating a denial of service so be careful. - downloaded files can be quite large (e.g. 20+MB PDFs) so 10_000 files might take 50 GB.
- malformed queries in
getpapers
could cause problems; not sure if this is true forpygetpapers
.