-
Notifications
You must be signed in to change notification settings - Fork 9
AlphaTest (petermr)
Note: pygetpapers is well presented. This report is written with the future possibility of publishing the software in a scientific software journal/repository.
- operator: Peter Murray-Rust, founder of ContentMine which sponsored
getpapers
- background: heavy user of
getpapers
and starting to integratepygetpapers
into the `pyami system - familiarity with
pygetpapers
: very substantial - OS: MacOS 11.4
- Python 3.8 (as 3.9 is incompatible with
PyCharm
)
- philosophy: acceptance test combined with academic review
- to comment on the documentation
- to install the system
- to test a given query+download on all supported repos
- to deliberately challenge the system with extreme requests (to mimic possible misunderstandings).
Add:
"pygetpapers aims to provide a consistent query interface over several repositories, abstracting the syntax of dates, hits, quoting. This isn't always possible. Some repositories have currently unique commands (e.g synonyms
in EPMC)."
[Ayush , please annotate unique commands in the docs.]
"The main medium of its interaction with users is through a command-line interface.", [insert: "but is it also available in prototype in pyami
through a GUI"].
Add: "pyami can be used iteratively, e.g. by building term-lists which refine the query from initial downloads"
Add: "Note that some commands are repository-specific. They may reference specific actions, or use a range of repository commands, often by using a dictionary-like structure with repository key-value pairs".
config file: [Please give details of this. purpose, syntax, keywords , location. Is it only for saved queries?
General: detailed descriptions of some options should be created in separate doc files [a NOTES.md, or even one for specific issues], and not overburden the keyword description (see some UNIX tool descriptions). Thus "query" could be:
query string transmitted to repository API. Eg. "Artificial Intelligence" or "Plant Parts".
escape special characters within the quotes, use backslash. In case of nested/Boolean quotes, ensure
that the initial quotes are double and the quotes inside are single. See [NOTES.md#query...] for examples. Dates should be
queried using startdate
and enddate
. [Ayush, are there other abstractions here, e.g. fulltext-availability, license, sections, etc.]
directory name is missing.
[I think these are EPMC-specific]. Add "specialist search option". ["See DOCS", which would point to a URL in EPMC]
Also EPMC-only?
Don't use "the json". Give the precise name.
"Reads eumpc_results.json
metadata from previous run/s and downloads [WHAT PRECISELY?] Does it only download XML?
name "metadata json file"
Give filenames.
"Stores" => "outputs"
[EPMC only]. Adds inbuilt synonyms to the query string [NOTE]
"Gives" => "selects"
Reads a comma-separated TERMS file with a list of additional terms to be OR'ed into the query. [See SNOWBALL.md]
I think EPMC don't like the abbreviation EUPMC.
[API-dependent] A list of key-value pairs listed in the search site API (currently unchecked) to be added to the query.
***I will now test these ***
Decided to uninstall and then reinstall
(base) pm286macbook:pyami pm286$ pip uninstall pygetpapers
Found existing installation: pygetpapers 0.0.6.3
Uninstalling pygetpapers-0.0.6.3:
Would remove:
/opt/anaconda3/bin/pygetpapers
/opt/anaconda3/lib/python3.8/site-packages/pygetpapers-0.0.6.3.dist-info/*
/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/*
Proceed (y/n)? y
Successfully uninstalled pygetpapers-0.0.6.3
(base) pm286macbook:pyami pm286$ pip install git+git://github.com/petermr/pygetpapers
Collecting git+git://github.com/petermr/pygetpapers
Cloning git://github.com/petermr/pygetpapers to /private/var/folders/ft/7j605bsd10l0ftqygyxjjflh0000gq/T/pip-req-build-qqe1d94g
Running command git clone -q git://github.com/petermr/pygetpapers /private/var/folders/ft/7j605bsd10l0ftqygyxjjflh0000gq/T/pip-req-build-qqe1d94g
Requirement already satisfied: requests in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (2.20.0)
Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (1.2.0)
Requirement already satisfied: lxml in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (4.6.2)
Requirement already satisfied: xmltodict in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (0.12.0)
Requirement already satisfied: configargparse in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (1.4)
Requirement already satisfied: habanero in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (0.7.4)
Requirement already satisfied: arxiv in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (1.2.0)
Requirement already satisfied: dict2xml in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (1.7.0)
Requirement already satisfied: tqdm in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (4.49.0)
Collecting coloredlogs
Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
|████████████████████████████████| 46 kB 1.5 MB/s
Requirement already satisfied: feedparser in /opt/anaconda3/lib/python3.8/site-packages (from arxiv->pygetpapers==0.0.7.1) (6.0.8)
Collecting humanfriendly>=9.1
Downloading humanfriendly-9.2-py2.py3-none-any.whl (86 kB)
|████████████████████████████████| 86 kB 3.7 MB/s
Requirement already satisfied: sgmllib3k in /opt/anaconda3/lib/python3.8/site-packages (from feedparser->arxiv->pygetpapers==0.0.7.1) (1.0.0)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/anaconda3/lib/python3.8/site-packages (from requests->pygetpapers==0.0.7.1) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/lib/python3.8/site-packages (from requests->pygetpapers==0.0.7.1) (2020.6.20)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/anaconda3/lib/python3.8/site-packages (from requests->pygetpapers==0.0.7.1) (3.0.4)
Requirement already satisfied: idna<2.8,>=2.5 in /opt/anaconda3/lib/python3.8/site-packages (from requests->pygetpapers==0.0.7.1) (2.7)
Requirement already satisfied: pytz>=2017.3 in /opt/anaconda3/lib/python3.8/site-packages (from pandas->pygetpapers==0.0.7.1) (2020.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/anaconda3/lib/python3.8/site-packages (from pandas->pygetpapers==0.0.7.1) (2.8.1)
Requirement already satisfied: numpy>=1.16.5 in /opt/anaconda3/lib/python3.8/site-packages (from pandas->pygetpapers==0.0.7.1) (1.19.1)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas->pygetpapers==0.0.7.1) (1.15.0)
Building wheels for collected packages: pygetpapers
Building wheel for pygetpapers (setup.py) ... done
Created wheel for pygetpapers: filename=pygetpapers-0.0.7.1-py3-none-any.whl size=36161 sha256=62d3d9f28bc659ab9fc1bcb48153578b9ff620694bdb442a3ab3fb5079972af3
Stored in directory: /private/var/folders/ft/7j605bsd10l0ftqygyxjjflh0000gq/T/pip-ephem-wheel-cache-rkd5_4kk/wheels/ba/0b/2b/8b106dfd2ba44ce9bf97af89862b616347b83d87e3bf2f6ed5
Successfully built pygetpapers
Installing collected packages: humanfriendly, coloredlogs, pygetpapers
Successfully installed coloredlogs-15.0.1 humanfriendly-9.2 pygetpapers-0.0.7.1
(base) pm286macbook:pyami pm286$ pygetpapers
usage: pygetpapers [-h] [--config CONFIG] [-v] [-q QUERY] [-o OUTPUT] [--save_query] [-x] [-p] [-s]
[-z] [--references REFERENCES] [-n] [--citations CITATIONS] [-l LOGLEVEL]
[-f LOGFILE] [-k LIMIT] [-r RESTART] [-u UPDATE] [--onlyquery] [-c] [--makehtml]
[--synonym] [--startdate STARTDATE] [--enddate ENDDATE] [--terms TERMS]
[--api API] [--filter FILTER]
Welcome to Pygetpapers version 0.0.7.1. -h or --help for help
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file path to read query for pygetpapers
-v, --version output the version number
-q QUERY, --query QUERY
query string transmitted to repository API. Eg. "Artificial Intelligence" or
"Plant Parts". To escape special characters within the quotes, use
backslash. Incase of nested quotes, ensure that the initial quotes are
double and the qutoes inside are single. For eg: `'(LICENSE:"cc by" OR
LICENSE:"cc-by") AND METHODS:"transcriptome assembly"' ` is wrong. We should
instead use `"(LICENSE:'cc by' OR LICENSE:'cc-by') AND
METHODS:'transcriptome assembly'"`
-o OUTPUT, --output OUTPUT
output directory (Default: Folder inside current working directory named )
--save_query saved the passed query in a config file
-x, --xml download fulltext XMLs if available or save metadata as XML
-p, --pdf download fulltext PDFs if available (only eupmc and arxiv supported)
-s, --supp download supplementary files if available (only eupmc supported)
-z, --zip download files from ftp endpoint if available (only eupmc supported)
--references REFERENCES
Download references if available. (only eupmc supported)Requires source for
references (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
-n, --noexecute report how many results match the query, but don't actually download
anything
--citations CITATIONS
Download citations if available (only eupmc supported). Requires source for
citations (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
-l LOGLEVEL, --loglevel LOGLEVEL
Provide logging level. Example --log warning
<<info,warning,debug,error,critical>>, default='info'
-f LOGFILE, --logfile LOGFILE
save log to specified file in output directory as well as printing to
terminal
-k LIMIT, --limit LIMIT
maximum number of hits (default: 100)
-r RESTART, --restart RESTART
Reads the json and makes the xml files. Takes the path to the json as the
input (only eupmc supported)
-u UPDATE, --update UPDATE
Updates the corpus by downloading new papers. Takes the path of metadata
json file of the orignal corpus as the input. Requires -k or --limit (If not
provided, default will be used) and -q or --query (must be provided) to be
given. Takes the path to the json as the input.
--onlyquery Saves json file containing the result of the query in storage. (only eupmc
supported)The json file can be given to --restart to download the papers
later.
-c, --makecsv Stores the per-document metadata as csv.
--makehtml Stores the per-document metadata as html.
--synonym Results contain synonyms as well.
--startdate STARTDATE
Gives papers starting from given date. Format: YYYY-MM-DD
--enddate ENDDATE Gives papers till given date. Format: YYYY-MM-DD
--terms TERMS Location of the txt file which contains terms serperated by a comma which
will beOR'ed among themselves and AND'ed with the query
--api API API to search [eupmc, crossref,arxiv,biorxiv,medrxiv,rxivist] (default:
eupmc)
--filter FILTER filter by key value pair (only crossref supported)
- created test directory in
pygetpapers/usertests/petermr
(https://github.com/petermr/pygetpapers/usertests/petermr) - example queries based on terpene synthases (TPSdd) which give 10-50 hits
- comparison with manual query on EPMC site, and
getpapers 0.4.17
andpygetpapers
It's critical that getpapers
and pygetpapers
behave identically - currently they don't. This will confuse people badly.
- TPS20 (or "TPS20" or 'TPS20') gave 28 hits
- checking "free to read" reduced this to 25
- checking "free to read and use" reduced this to 20
-
-q TPS20 -a
(or-q "TPS20" -a
or-q 'TPS20' -a
) gave 28 hits -
-q TPS20
(or-q "TPS20"
or-q 'TPS20'
) gave 20 hits
pygetpapers -q TPS20 -n -a
usage: pygetpapers [-h] [--config CONFIG] [-v] [-q QUERY] [-o OUTPUT] [--save_query] [-x] [-p] [-s] [-z] [--references REFERENCES] [-n]
[--citations CITATIONS] [-l LOGLEVEL] [-f LOGFILE] [-k LIMIT] [-r RESTART] [-u UPDATE] [--onlyquery] [-c] [--makehtml]
[--synonym] [--startdate STARTDATE] [--enddate ENDDATE] [--terms TERMS] [--api API] [--filter FILTER]
pygetpapers: error: unrecognized arguments: -a
(base) pm286macbook:petermr pm286$ pygetpapers -q TPS20 -n
INFO: Final query is TPS20
INFO: Total number of hits for the query are 28
It appears that the pygetpapers
default is all papers (i.e. no checked boxes on EPMC site)
How to get just the Open Access papers??
(base) pm286macbook:petermr pm286$ pygetpapers -q "TPS20" -n
INFO: Final query is TPS20
INFO: Total number of hits for the query are 28
(base) pm286macbook:petermr pm286$ pygetpapers -q 'TPS20' -n
INFO: Final query is TPS20
INFO: Total number of hits for the query are 28
(base) pm286macbook:petermr pm286$ pygetpapers -q 'TPS20' -n
INFO: Final query is TPS20
INFO: Total number of hits for the query are 28
(base) pm286macbook:petermr pm286$ pygetpapers -q '(TPS20) and (TPS21)' -n
INFO: Final query is (TPS20) and (TPS21)
INFO: Total number of hits for the query are 9
(base) pm286macbook:petermr pm286$ pygetpapers -q '(TPS20) NOT (TPS21)' -n
INFO: Final query is (TPS20) NOT (TPS21)
INFO: Total number of hits for the query are 19
(base) pm286macbook:petermr pm286$ pygetpapers -q '(TPS20) OR (TPS21)' -n
INFO: Final query is (TPS20) OR (TPS21)
INFO: Total number of hits for the query are 106
(base) pm286macbook:petermr pm286$ pygetpapers -q '(TPS21)' -n
INFO: Final query is (TPS21)
INFO: Total number of hits for the query are 87
(base) pm286macbook:petermr pm286$ pygetpapers -q '(TPS21) NOT (TPS20)' -n
INFO: Final query is (TPS21) NOT (TPS20)
INFO: Total number of hits for the query are 78
These are all consistent
(base) pm286macbook:petermr pm286$ pygetpapers -q TPS30
This should download metadata only, into a computed subdirectory.
INFO: Final query is TPS30
INFO: Total Hits are 18
WARNING: Could not find more papers
0it [00:00, ?it/s]WARNING: Keywords not found for paper 4
WARNING: Keywords not found for paper 5
WARNING: Keywords not found for paper 10
WARNING: html url not found for paper 11
WARNING: Keywords not found for paper 11
WARNING: pdf url not found for paper 11
WARNING: html url not found for paper 12
WARNING: Keywords not found for paper 12
WARNING: pdf url not found for paper 12
WARNING: html url not found for paper 13
WARNING: Abstract not found for paper 13
WARNING: Keywords not found for paper 13
WARNING: pdf url not found for paper 13
WARNING: Author list not found for paper 13
WARNING: html url not found for paper 14
WARNING: Abstract not found for paper 14
WARNING: Keywords not found for paper 14
WARNING: pdf url not found for paper 14
WARNING: Author list not found for paper 14
WARNING: html url not found for paper 15
WARNING: Keywords not found for paper 15
WARNING: pdf url not found for paper 15
1it [00:00, 140.17it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:06<00:00, 2.37it/s]
(base) pm286macbook:petermr pm286$
(base) pm286macbook:petermr pm286$ tree 2021_07_24_19_18_09
2021_07_24_19_18_09
├── PMC1211283
│ └── eupmc_result.json
├── PMC1211284
│ └── eupmc_result.json
├── PMC212153
│ └── eupmc_result.json
├── PMC214377
│ └── eupmc_result.json
├── PMC3268506
│ └── eupmc_result.json
├── PMC4457800
│ └── eupmc_result.json
├── PMC5122590
│ └── eupmc_result.json
├── PMC5161391
│ └── eupmc_result.json
├── PMC5655044
│ └── eupmc_result.json
├── PMC6266747
│ └── eupmc_result.json
├── PMC6742361
│ └── eupmc_result.json
├── PMC7305226
│ └── eupmc_result.json
├── PMC7600171
│ └── eupmc_result.json
├── PMC8036305
│ └── eupmc_result.json
├── PMC8201348
│ └── eupmc_result.json
└── eupmc_results.json
15 directories, 16 files
- one overall
eupmc_results.json
and 15 directories, each with an individualeupmc_result.json
- typical metadata
eupmc_result.json
(split after commas):
base) pm286macbook:petermr pm286$ more 2021_07_24_19_18_09/PMC1211283/eupmc_result.json
{"downloaded": false,
"htmlmade": false,
"full": {"id": "17248343",
"source": "MED",
"pmid": "17248343",
"pmcid": "PMC1211283",
"fullTextIdList": {"fullTextId": "PMC1211283"},
"title": "Nonrandom Location of Temperature-Sensitive Mutants on the Linkage Map of STREPTOMYCES COELICOLOR.",
"authorString": "Hopwood DA.",
"authorList": {"author": {"fullName": "Hopwood DA",
"firstName": "D A",
"lastName": "Hopwood",
"initials": "DA",
"authorAffiliationDetailsList": {"authorAffiliation": {"affiliation": "Department of Genetics,
University of Glasgow,
Glasgow,
Scotland."}}}},
"journalInfo": {"issue": "5",
"volume": "54",
"journalIssueId": "1374536",
"dateOfPublication": "1966 Nov",
"monthOfPublication": "11",
"yearOfPublication": "1966",
"printPublicationDate": "1966-11-01",
"journal": {"title": "Genetics",
"ISOAbbreviation": "Genetics",
"medlineAbbreviation": "Genetics",
"NLMid": "0374636",
"ISSN": "0016-6731",
"ESSN": "1943-2631"}},
"pubYear": "1966",
"pageInfo": "1169-1176",
"affiliation": "Department of Genetics,
University of Glasgow,
Glasgow,
Scotland.",
"publicationStatus": "ppublish",
"language": "eng",
"pubModel": "Print",
"pubTypeList": {"pubType": ["research-article",
"Journal Article"]},
"fullTextUrlList": {"fullTextUrl": [{"availability": "Free",
"availabilityCode": "F",
"documentStyle": "html",
"site": "PubMedCentral",
"url": "https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/17248343/?tool=EBI"},
{"availability": "Free",
"availabilityCode": "F",
"documentStyle": "pdf",
"site": "PubMedCentral",
"url": "https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/17248343/pdf/?tool=EBI"},
{"availability": "Free",
"availabilityCode": "F",
"documentStyle": "html",
"site": "Europe_PMC",
"url": "https://europepmc.org/articles/PMC1211283"},
{"availability": "Free",
"availabilityCode": "F",
"documentStyle": "pdf",
"site": "Europe_PMC",
"url": "https://europepmc.org/articles/PMC1211283?pdf=render"}]},
"isOpenAccess": "N",
"inEPMC": "Y",
"inPMC": "Y",
"hasPDF": "Y",
"hasBook": "N",
"hasSuppl": "N",
"citedByCount": "7",
"hasData": "N",
"hasReferences": "Y",
"hasTextMinedTerms": "Y",
"hasDbCrossReferences": "N",
"hasLabsLinks": "N",
"authMan": "N",
"epmcAuthMan": "N",
"nihAuthMan": "N",
"hasTMAccessionNumbers": "N",
"dateOfCompletion": "2010-06-28",
"dateOfCreation": "1966-11-01",
"firstIndexDate": "2010-09-16",
"fullTextReceivedDate": "2020-07-09",
"dateOfRevision": "2018-11-13",
"firstPublicationDate": "1966-11-01"},
"journaltitle": "Genetics",
"title": "Nonrandom Location of Temperature-Sensitive Mutants on the Linkage Map of STREPTOMYCES COELICOLOR."}
21 directories, 55 files
base) pm286macbook:petermr pm286$ pygetpapers -q TPS30 -o 2021_07_24_19_18_09/ -x -p -s --save_query
are all the messages correct? PDF is supported.
-bash: syntax error near unexpected token `pm286macbook:petermr'
(base) pm286macbook:petermr pm286$ WARNING: Pdf is not supported for this api
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ INFO: Final query is TPS30
-bash: INFO:: command not found
(base) pm286macbook:petermr pm286$ INFO: Total Hits are 18
-bash: INFO:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Could not find more papers
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ 0it [00:00, ?it/s]WARNING: Keywords not found for paper 4
-bash: 0it: command not found
(base) pm286macbook:petermr pm286$ WARNING: Keywords not found for paper 5
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Keywords not found for paper 10
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: html url not found for paper 11
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Keywords not found for paper 11
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: pdf url not found for paper 11
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: html url not found for paper 12
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Keywords not found for paper 12
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: pdf url not found for paper 12
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: html url not found for paper 13
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Abstract not found for paper 13
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Keywords not found for paper 13
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: pdf url not found for paper 13
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Author list not found for paper 13
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: html url not found for paper 14
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Abstract not found for paper 14
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Keywords not found for paper 14
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: pdf url not found for paper 14
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Author list not found for paper 14
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: html url not found for paper 15
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Keywords not found for paper 15
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: pdf url not found for paper 15
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ 1it [00:00, 189.82it/s]
-bash: 1it: command not found
(base) pm286macbook:petermr pm286$ INFO: Saving XML files to /Users/pm286/workspace/pygetpapers/usertests/petermr/2021_07_24_19_18_09/*/fulltext.xml
-bash: INFO:: command not found
(base) pm286macbook:petermr pm286$ 0%| | 0/15 [00:00<?, ?it/s]WARNING: supplementary files not found for PMC7600171
-bash: syntax error near unexpected token `|'
(base) pm286macbook:petermr pm286$ 7%|███████▎ | 1/15 [00:02<00:41, 2.96s/it]INFO: Wrote supplementary files for supplementary
-bash: ███████▎: command not found
-bash: 00:41,: No such file or directory
-bash: 7%: command not found
(base) pm286macbook:petermr pm286$ 13%|█████████
pm286macbook:petermr pm286$ tree | more
.
├── 2021_07_24_19_18_09
│ ├── PMC1211283
│ │ ├── eupmc_result.json
│ │ └── fulltext.xml
│ ├── PMC1211284
│ │ ├── eupmc_result.json
│ │ └── fulltext.xml
│ ├── PMC212153
│ │ ├── eupmc_result.json
│ │ └── fulltext.xml
│ ├── PMC214377
│ │ ├── eupmc_result.json
│ │ └── fulltext.xml
│ ├── PMC3268506
│ │ ├── eupmc_result.json
│ │ └── fulltext.xml
│ ├── PMC4457800
│ │ ├── eupmc_result.json
│ │ ├── fulltext.pdf
│ │ ├── fulltext.xml
│ │ └── supplementaryfiles
│ │ ├── pone.0128808.s001.xls
│ │ ├── pone.0128808.s002.xls
│ │ ├── pone.0128808.s003.xls
│ │ ├── pone.0128808.s004.xls
│ │ └── pone.0128808.s005.xls
│ ├── PMC5122590
│ │ ├── eupmc_result.json
│ │ ├── fulltext.pdf
│ │ ├── fulltext.xml
│ │ └── supplementaryfiles
│ │ └── Data_Sheet_1.pdf
│ ├── PMC5161391
│ │ ├── eupmc_result.json
│ │ ├── fulltext.pdf
│ │ └── fulltext.xml
│ ├── PMC5655044
│ │ ├── eupmc_result.json
│ │ ├── fulltext.pdf
│ │ ├── fulltext.xml
│ │ └── supplementaryfiles
│ │ ├── AAC.00959-17_zac011176606s1.pdf
│ │ └── supp_61_11_e00959-17__index.html
│ ├── PMC6266747
│ │ ├── eupmc_result.json
│ │ ├── fulltext.pdf
│ │ └── fulltext.xml
│ ├── PMC6742361
│ │ ├── eupmc_result.json
│ │ ├── fulltext.pdf
│ │ ├── fulltext.xml
│ │ └── supplementaryfiles
│ │ ├── pone.0222363.s001.tif
│ │ ├── pone.0222363.s002.txt
│ │ └── pone.0222363.s003.tsv
│ ├── PMC7305226
│ │ ├── eupmc_result.json
│ │ ├── fulltext.pdf
│ │ ├── fulltext.xml
│ │ └── supplementaryfiles
│ │ └── 41598_2020_66866_MOESM1_ESM.pdf
│ ├── PMC7600171
│ │ ├── eupmc_result.json
│ │ ├── fulltext.pdf
│ │ └── fulltext.xml
│ ├── PMC8036305
│ │ ├── eupmc_result.json
│ │ ├── fulltext.pdf
│ │ └── fulltext.xml
│ ├── PMC8201348
│ │ ├── eupmc_result.json
│ │ ├── fulltext.pdf
│ │ └── fulltext.xml
│ ├── eupmc_results.json
│ └── saved_config.ini
└── README.md
- some files do not have PDFs, or supplementary files
- the naming/numbering of supplementary files is publisher-dependent.
The --save_query
created a saved_config.ini
file, containing:
more 2021_07_24_19_18_09/saved_config.ini
[SAVED]
config = None
version = False
query = TPS30
output = 2021_07_24_19_18_09/
save_query = True
xml = True
pdf = True
supp = True
zip = False
references = False
noexecute = False
citations = False
loglevel = info
logfile = False
limit = 100
restart = False
update = False
onlyquery = False
makecsv = False
makehtml = False
synonym = False
startdate = False
enddate = False
terms = False
api = eupmc
filter = None
I am not familiar with most of these sources so only used MED
pm286macbook:petermr pm286$ pygetpapers -q TPS31 -o TPS31 --references MED --citations MED
INFO: Final query is TPS31
INFO: Total Hits are 11
WARNING: Could not find more papers
0it [00:00, ?it/s]WARNING: html url not found for paper 1
WARNING: pdf url not found for paper 1
WARNING: Keywords not found for paper 5
WARNING: html url not found for paper 6
WARNING: pdf url not found for paper 6
WARNING: Keywords not found for paper 10
WARNING: Keywords not found for paper 11
1it [00:00, 205.22it/s]
tree TPS31
TPS31
├── PMC3193516
│ ├── citation.xml
│ ├── eupmc_result.json
│ └── references.xml
├── PMC3997964
│ ├── citation.xml
│ ├── eupmc_result.json
│ └── references.xml
├── PMC4457800
│ ├── citation.xml
│ ├── eupmc_result.json
│ └── references.xml
├── PMC5122590
│ ├── citation.xml
│ ├── eupmc_result.json
│ └── references.xml
├── PMC5378189
│ ├── citation.xml
│ ├── eupmc_result.json
│ └── references.xml
├── PMC6360234
│ ├── citation.xml
│ ├── eupmc_result.json
│ └── references.xml
├── PMC7049213
│ ├── citation.xml
│ ├── eupmc_result.json
│ └── references.xml
├── PMC7214349
│ ├── citation.xml
│ ├── eupmc_result.json
│ └── references.xml
├── PMC7304153
│ ├── citation.xml
│ ├── eupmc_result.json
│ └── references.xml
├── PMC7422722
│ ├── citation.xml
│ ├── eupmc_result.json
│ └── references.xml
├── PMC8002989
│ ├── citation.xml
│ ├── eupmc_result.json
│ └── references.xml
└── eupmc_results.json
11 directories, 34 files
-z, --zip download files from ftp endpoint if available (only eupmc supported)
-l LOGLEVEL, --loglevel LOGLEVEL
-f LOGFILE, --logfile LOGFILE
-k LIMIT, --limit LIMIT
-r RESTART, --restart RESTART
-u UPDATE, --update UPDATE
--onlyquery Saves json file containing the result of the query in storage. (only eupmc
-c, --makecsv Stores the per-document metadata as csv.
--makehtml Stores the per-document metadata as html.
--synonym Results contain synonyms as well.
--startdate STARTDATE
--enddate ENDDATE Gives papers till given date. Format: YYYY-MM-DD
--terms TERMS Location of the txt file which contains terms serperated by a comma which
--api API API to search [eupmc, crossref,arxiv,biorxiv,medrxiv,rxivist] (default:
--filter FILTER filter by key value pair (only crossref supported)