Skip to content

AlphaTest (petermr)

petermr edited this page Jul 24, 2021 · 17 revisions

Alpha test of pygetpapers (starting 2021-07-24)

Note: pygetpapers is well presented. This report is written with the future possibility of publishing the software in a scientific software journal/repository.

metadata

  • operator: Peter Murray-Rust, founder of ContentMine which sponsored getpapers
  • background: heavy user of getpapers and starting to integrate pygetpapers into the `pyami system
  • familiarity with pygetpapers: very substantial
  • OS: MacOS 11.4
  • Python 3.8 (as 3.9 is incompatible with PyCharm)

strategy

  • philosophy: acceptance test combined with academic review
  • to comment on the documentation
  • to install the system
  • to test a given query+download on all supported repos
  • to deliberately challenge the system with extreme requests (to mimic possible misunderstandings).

documentation

comments on existing documentation

1. What is pygetpapers:

Add: "pygetpapers aims to provide a consistent query interface over several repositories, abstracting the syntax of dates, hits, quoting. This isn't always possible. Some repositories have currently unique commands (e.g synonyms in EPMC)." [Ayush , please annotate unique commands in the docs.]

"The main medium of its interaction with users is through a command-line interface.", [insert: "but is it also available in prototype in pyami through a GUI"].

Add: "pyami can be used iteratively, e.g. by building term-lists which refine the query from initial downloads"

7. Usage

Add: "Note that some commands are repository-specific. They may reference specific actions, or use a range of repository commands, often by using a dictionary-like structure with repository key-value pairs".

config file: [Please give details of this. purpose, syntax, keywords , location. Is it only for saved queries?

General: detailed descriptions of some options should be created in separate doc files [a NOTES.md, or even one for specific issues], and not overburden the keyword description (see some UNIX tool descriptions). Thus "query" could be:
query string

query string transmitted to repository API. Eg. "Artificial Intelligence" or "Plant Parts".

                    escape special characters within the quotes, use backslash. In case of nested/Boolean quotes, ensure
                    that the initial quotes are double and the quotes inside are single. See [NOTES.md#query...] for examples. Dates should be 

queried using startdate and enddate. [Ayush, are there other abstractions here, e.g. fulltext-availability, license, sections, etc.]

output

directory name is missing.

references

[I think these are EPMC-specific]. Add "specialist search option". ["See DOCS", which would point to a URL in EPMC]

citations

Also EPMC-only?

restart

Don't use "the json". Give the precise name. "Reads eumpc_results.json metadata from previous run/s and downloads [WHAT PRECISELY?] Does it only download XML?

update

name "metadata json file"

onlyquery

Give filenames.

makehtml and makecsv

"Stores" => "outputs"

synonym

[EPMC only]. Adds inbuilt synonyms to the query string [NOTE]

start/end date

"Gives" => "selects"

terms

Reads a comma-separated TERMS file with a list of additional terms to be OR'ed into the query. [See SNOWBALL.md]

api

I think EPMC don't like the abbreviation EUPMC.

filter

[API-dependent] A list of key-value pairs listed in the search site API (currently unchecked) to be added to the query.

***I will now test these ***

Installation

Decided to uninstall and then reinstall

(base) pm286macbook:pyami pm286$ pip uninstall pygetpapers
Found existing installation: pygetpapers 0.0.6.3
Uninstalling pygetpapers-0.0.6.3:
  Would remove:
    /opt/anaconda3/bin/pygetpapers
    /opt/anaconda3/lib/python3.8/site-packages/pygetpapers-0.0.6.3.dist-info/*
    /opt/anaconda3/lib/python3.8/site-packages/pygetpapers/*
Proceed (y/n)? y
  Successfully uninstalled pygetpapers-0.0.6.3
(base) pm286macbook:pyami pm286$ pip install git+git://github.com/petermr/pygetpapers
Collecting git+git://github.com/petermr/pygetpapers
  Cloning git://github.com/petermr/pygetpapers to /private/var/folders/ft/7j605bsd10l0ftqygyxjjflh0000gq/T/pip-req-build-qqe1d94g
  Running command git clone -q git://github.com/petermr/pygetpapers /private/var/folders/ft/7j605bsd10l0ftqygyxjjflh0000gq/T/pip-req-build-qqe1d94g
Requirement already satisfied: requests in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (2.20.0)
Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (1.2.0)
Requirement already satisfied: lxml in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (4.6.2)
Requirement already satisfied: xmltodict in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (0.12.0)
Requirement already satisfied: configargparse in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (1.4)
Requirement already satisfied: habanero in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (0.7.4)
Requirement already satisfied: arxiv in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (1.2.0)
Requirement already satisfied: dict2xml in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (1.7.0)
Requirement already satisfied: tqdm in /opt/anaconda3/lib/python3.8/site-packages (from pygetpapers==0.0.7.1) (4.49.0)
Collecting coloredlogs
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
     |████████████████████████████████| 46 kB 1.5 MB/s 
Requirement already satisfied: feedparser in /opt/anaconda3/lib/python3.8/site-packages (from arxiv->pygetpapers==0.0.7.1) (6.0.8)
Collecting humanfriendly>=9.1
  Downloading humanfriendly-9.2-py2.py3-none-any.whl (86 kB)
     |████████████████████████████████| 86 kB 3.7 MB/s 
Requirement already satisfied: sgmllib3k in /opt/anaconda3/lib/python3.8/site-packages (from feedparser->arxiv->pygetpapers==0.0.7.1) (1.0.0)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/anaconda3/lib/python3.8/site-packages (from requests->pygetpapers==0.0.7.1) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/lib/python3.8/site-packages (from requests->pygetpapers==0.0.7.1) (2020.6.20)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/anaconda3/lib/python3.8/site-packages (from requests->pygetpapers==0.0.7.1) (3.0.4)
Requirement already satisfied: idna<2.8,>=2.5 in /opt/anaconda3/lib/python3.8/site-packages (from requests->pygetpapers==0.0.7.1) (2.7)
Requirement already satisfied: pytz>=2017.3 in /opt/anaconda3/lib/python3.8/site-packages (from pandas->pygetpapers==0.0.7.1) (2020.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/anaconda3/lib/python3.8/site-packages (from pandas->pygetpapers==0.0.7.1) (2.8.1)
Requirement already satisfied: numpy>=1.16.5 in /opt/anaconda3/lib/python3.8/site-packages (from pandas->pygetpapers==0.0.7.1) (1.19.1)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas->pygetpapers==0.0.7.1) (1.15.0)
Building wheels for collected packages: pygetpapers
  Building wheel for pygetpapers (setup.py) ... done
  Created wheel for pygetpapers: filename=pygetpapers-0.0.7.1-py3-none-any.whl size=36161 sha256=62d3d9f28bc659ab9fc1bcb48153578b9ff620694bdb442a3ab3fb5079972af3
  Stored in directory: /private/var/folders/ft/7j605bsd10l0ftqygyxjjflh0000gq/T/pip-ephem-wheel-cache-rkd5_4kk/wheels/ba/0b/2b/8b106dfd2ba44ce9bf97af89862b616347b83d87e3bf2f6ed5
Successfully built pygetpapers
Installing collected packages: humanfriendly, coloredlogs, pygetpapers
Successfully installed coloredlogs-15.0.1 humanfriendly-9.2 pygetpapers-0.0.7.1

verify installation

(base) pm286macbook:pyami pm286$ pygetpapers
usage: pygetpapers [-h] [--config CONFIG] [-v] [-q QUERY] [-o OUTPUT] [--save_query] [-x] [-p] [-s]
                   [-z] [--references REFERENCES] [-n] [--citations CITATIONS] [-l LOGLEVEL]
                   [-f LOGFILE] [-k LIMIT] [-r RESTART] [-u UPDATE] [--onlyquery] [-c] [--makehtml]
                   [--synonym] [--startdate STARTDATE] [--enddate ENDDATE] [--terms TERMS]
                   [--api API] [--filter FILTER]

Welcome to Pygetpapers version 0.0.7.1. -h or --help for help

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       config file path to read query for pygetpapers
  -v, --version         output the version number
  -q QUERY, --query QUERY
                        query string transmitted to repository API. Eg. "Artificial Intelligence" or
                        "Plant Parts". To escape special characters within the quotes, use
                        backslash. Incase of nested quotes, ensure that the initial quotes are
                        double and the qutoes inside are single. For eg: `'(LICENSE:"cc by" OR
                        LICENSE:"cc-by") AND METHODS:"transcriptome assembly"' ` is wrong. We should
                        instead use `"(LICENSE:'cc by' OR LICENSE:'cc-by') AND
                        METHODS:'transcriptome assembly'"`
  -o OUTPUT, --output OUTPUT
                        output directory (Default: Folder inside current working directory named )
  --save_query          saved the passed query in a config file
  -x, --xml             download fulltext XMLs if available or save metadata as XML
  -p, --pdf             download fulltext PDFs if available (only eupmc and arxiv supported)
  -s, --supp            download supplementary files if available (only eupmc supported)
  -z, --zip             download files from ftp endpoint if available (only eupmc supported)
  --references REFERENCES
                        Download references if available. (only eupmc supported)Requires source for
                        references (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
  -n, --noexecute       report how many results match the query, but don't actually download
                        anything
  --citations CITATIONS
                        Download citations if available (only eupmc supported). Requires source for
                        citations (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
  -l LOGLEVEL, --loglevel LOGLEVEL
                        Provide logging level. Example --log warning
                        <<info,warning,debug,error,critical>>, default='info'
  -f LOGFILE, --logfile LOGFILE
                        save log to specified file in output directory as well as printing to
                        terminal
  -k LIMIT, --limit LIMIT
                        maximum number of hits (default: 100)
  -r RESTART, --restart RESTART
                        Reads the json and makes the xml files. Takes the path to the json as the
                        input (only eupmc supported)
  -u UPDATE, --update UPDATE
                        Updates the corpus by downloading new papers. Takes the path of metadata
                        json file of the orignal corpus as the input. Requires -k or --limit (If not
                        provided, default will be used) and -q or --query (must be provided) to be
                        given. Takes the path to the json as the input.
  --onlyquery           Saves json file containing the result of the query in storage. (only eupmc
                        supported)The json file can be given to --restart to download the papers
                        later.
  -c, --makecsv         Stores the per-document metadata as csv.
  --makehtml            Stores the per-document metadata as html.
  --synonym             Results contain synonyms as well.
  --startdate STARTDATE
                        Gives papers starting from given date. Format: YYYY-MM-DD
  --enddate ENDDATE     Gives papers till given date. Format: YYYY-MM-DD
  --terms TERMS         Location of the txt file which contains terms serperated by a comma which
                        will beOR'ed among themselves and AND'ed with the query
  --api API             API to search [eupmc, crossref,arxiv,biorxiv,medrxiv,rxivist] (default:
                        eupmc)
  --filter FILTER       filter by key value pair (only crossref supported)

Query test

It's critical that getpapers and pygetpapers behave identically - currently they don't. This will confuse people badly.

manual query on EPMC site

  • TPS20 (or "TPS20" or 'TPS20') gave 28 hits
  • checking "free to read" reduced this to 25
  • checking "free to read and use" reduced this to 20

getpapers

  • -q TPS20 -a (or -q "TPS20" -a or -q 'TPS20' -a) gave 28 hits
  • -q TPS20 (or -q "TPS20" or -q 'TPS20') gave 20 hits

pygetpapers

all available papers
pygetpapers -q TPS20 -n -a
usage: pygetpapers [-h] [--config CONFIG] [-v] [-q QUERY] [-o OUTPUT] [--save_query] [-x] [-p] [-s] [-z] [--references REFERENCES] [-n]
                   [--citations CITATIONS] [-l LOGLEVEL] [-f LOGFILE] [-k LIMIT] [-r RESTART] [-u UPDATE] [--onlyquery] [-c] [--makehtml]
                   [--synonym] [--startdate STARTDATE] [--enddate ENDDATE] [--terms TERMS] [--api API] [--filter FILTER]
pygetpapers: error: unrecognized arguments: -a
default papers
(base) pm286macbook:petermr pm286$ pygetpapers -q TPS20 -n
INFO: Final query is TPS20
INFO: Total number of hits for the query are 28

It appears that the pygetpapers default is all papers (i.e. no checked boxes on EPMC site)

How to get just the Open Access papers??

quoting appears to be optional
(base) pm286macbook:petermr pm286$ pygetpapers -q "TPS20" -n
INFO: Final query is TPS20
INFO: Total number of hits for the query are 28
(base) pm286macbook:petermr pm286$ pygetpapers -q 'TPS20' -n
INFO: Final query is TPS20
INFO: Total number of hits for the query are 28

booleans

(base) pm286macbook:petermr pm286$ pygetpapers -q 'TPS20' -n
INFO: Final query is TPS20
INFO: Total number of hits for the query are 28
(base) pm286macbook:petermr pm286$ pygetpapers -q '(TPS20) and (TPS21)' -n
INFO: Final query is (TPS20) and (TPS21)
INFO: Total number of hits for the query are 9
(base) pm286macbook:petermr pm286$ pygetpapers -q '(TPS20) NOT (TPS21)' -n
INFO: Final query is (TPS20) NOT (TPS21)
INFO: Total number of hits for the query are 19
(base) pm286macbook:petermr pm286$ pygetpapers -q '(TPS20) OR (TPS21)' -n
INFO: Final query is (TPS20) OR (TPS21)
INFO: Total number of hits for the query are 106
(base) pm286macbook:petermr pm286$ pygetpapers -q '(TPS21)' -n
INFO: Final query is (TPS21)
INFO: Total number of hits for the query are 87
(base) pm286macbook:petermr pm286$ pygetpapers -q '(TPS21) NOT (TPS20)' -n
INFO: Final query is (TPS21) NOT (TPS20)
INFO: Total number of hits for the query are 78

These are all consistent

saving files

(base) pm286macbook:petermr pm286$ pygetpapers -q TPS30 

This should download metadata only, into a computed subdirectory.

INFO: Final query is TPS30
INFO: Total Hits are 18
WARNING: Could not find more papers
0it [00:00, ?it/s]WARNING: Keywords not found for paper 4
WARNING: Keywords not found for paper 5
WARNING: Keywords not found for paper 10
WARNING: html url not found for paper 11
WARNING: Keywords not found for paper 11
WARNING: pdf url not found for paper 11
WARNING: html url not found for paper 12
WARNING: Keywords not found for paper 12
WARNING: pdf url not found for paper 12
WARNING: html url not found for paper 13
WARNING: Abstract not found for paper 13
WARNING: Keywords not found for paper 13
WARNING: pdf url not found for paper 13
WARNING: Author list not found for paper 13
WARNING: html url not found for paper 14
WARNING: Abstract not found for paper 14
WARNING: Keywords not found for paper 14
WARNING: pdf url not found for paper 14
WARNING: Author list not found for paper 14
WARNING: html url not found for paper 15
WARNING: Keywords not found for paper 15
WARNING: pdf url not found for paper 15
1it [00:00, 140.17it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:06<00:00,  2.37it/s]
(base) pm286macbook:petermr pm286$ 

Files created:

(base) pm286macbook:petermr pm286$ tree 2021_07_24_19_18_09
2021_07_24_19_18_09
├── PMC1211283
│   └── eupmc_result.json
├── PMC1211284
│   └── eupmc_result.json
├── PMC212153
│   └── eupmc_result.json
├── PMC214377
│   └── eupmc_result.json
├── PMC3268506
│   └── eupmc_result.json
├── PMC4457800
│   └── eupmc_result.json
├── PMC5122590
│   └── eupmc_result.json
├── PMC5161391
│   └── eupmc_result.json
├── PMC5655044
│   └── eupmc_result.json
├── PMC6266747
│   └── eupmc_result.json
├── PMC6742361
│   └── eupmc_result.json
├── PMC7305226
│   └── eupmc_result.json
├── PMC7600171
│   └── eupmc_result.json
├── PMC8036305
│   └── eupmc_result.json
├── PMC8201348
│   └── eupmc_result.json
└── eupmc_results.json

15 directories, 16 files
Note
  • one overall eupmc_results.json and 15 directories, each with an individual eupmc_result.json
  • typical metadata eupmc_result.json (split after commas):
base) pm286macbook:petermr pm286$ more 2021_07_24_19_18_09/PMC1211283/eupmc_result.json 
{"downloaded": false,
 "htmlmade": false,
 "full": {"id": "17248343",
 "source": "MED",
 "pmid": "17248343",
 "pmcid": "PMC1211283",
 "fullTextIdList": {"fullTextId": "PMC1211283"},
 "title": "Nonrandom Location of Temperature-Sensitive Mutants on the Linkage Map of STREPTOMYCES COELICOLOR.",
 "authorString": "Hopwood DA.",
 "authorList": {"author": {"fullName": "Hopwood DA",
 "firstName": "D A",
 "lastName": "Hopwood",
 "initials": "DA",
 "authorAffiliationDetailsList": {"authorAffiliation": {"affiliation": "Department of Genetics,
 University of Glasgow,
 Glasgow,
 Scotland."}}}},
 "journalInfo": {"issue": "5",
 "volume": "54",
 "journalIssueId": "1374536",
 "dateOfPublication": "1966 Nov",
 "monthOfPublication": "11",
 "yearOfPublication": "1966",
 "printPublicationDate": "1966-11-01",
 "journal": {"title": "Genetics",
 "ISOAbbreviation": "Genetics",
 "medlineAbbreviation": "Genetics",
 "NLMid": "0374636",
 "ISSN": "0016-6731",
 "ESSN": "1943-2631"}},
 "pubYear": "1966",
 "pageInfo": "1169-1176",
 "affiliation": "Department of Genetics,
 University of Glasgow,
 Glasgow,
 Scotland.",
 "publicationStatus": "ppublish",
 "language": "eng",
 "pubModel": "Print",
 "pubTypeList": {"pubType": ["research-article",
 "Journal Article"]},
 "fullTextUrlList": {"fullTextUrl": [{"availability": "Free",
 "availabilityCode": "F",
 "documentStyle": "html",
 "site": "PubMedCentral",
 "url": "https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/17248343/?tool=EBI"},
 {"availability": "Free",
 "availabilityCode": "F",
 "documentStyle": "pdf",
 "site": "PubMedCentral",
 "url": "https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/17248343/pdf/?tool=EBI"},
 {"availability": "Free",
 "availabilityCode": "F",
 "documentStyle": "html",
 "site": "Europe_PMC",
 "url": "https://europepmc.org/articles/PMC1211283"},
 {"availability": "Free",
 "availabilityCode": "F",
 "documentStyle": "pdf",
 "site": "Europe_PMC",
 "url": "https://europepmc.org/articles/PMC1211283?pdf=render"}]},
 "isOpenAccess": "N",
 "inEPMC": "Y",
 "inPMC": "Y",
 "hasPDF": "Y",
 "hasBook": "N",
 "hasSuppl": "N",
 "citedByCount": "7",
 "hasData": "N",
 "hasReferences": "Y",
 "hasTextMinedTerms": "Y",
 "hasDbCrossReferences": "N",
 "hasLabsLinks": "N",
 "authMan": "N",
 "epmcAuthMan": "N",
 "nihAuthMan": "N",
 "hasTMAccessionNumbers": "N",
 "dateOfCompletion": "2010-06-28",
 "dateOfCreation": "1966-11-01",
 "firstIndexDate": "2010-09-16",
 "fullTextReceivedDate": "2020-07-09",
 "dateOfRevision": "2018-11-13",
 "firstPublicationDate": "1966-11-01"},
 "journaltitle": "Genetics",
 "title": "Nonrandom Location of Temperature-Sensitive Mutants on the Linkage Map of STREPTOMYCES COELICOLOR."}

Downloading content

21 directories, 55 files
base) pm286macbook:petermr pm286$ pygetpapers -q TPS30 -o 2021_07_24_19_18_09/ -x -p -s --save_query

are all the messages correct? PDF is supported.

-bash: syntax error near unexpected token `pm286macbook:petermr'
(base) pm286macbook:petermr pm286$ WARNING: Pdf is not supported for this api
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ INFO: Final query is TPS30
-bash: INFO:: command not found
(base) pm286macbook:petermr pm286$ INFO: Total Hits are 18
-bash: INFO:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Could not find more papers
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ 0it [00:00, ?it/s]WARNING: Keywords not found for paper 4
-bash: 0it: command not found
(base) pm286macbook:petermr pm286$ WARNING: Keywords not found for paper 5
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Keywords not found for paper 10
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: html url not found for paper 11
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Keywords not found for paper 11
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: pdf url not found for paper 11
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: html url not found for paper 12
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Keywords not found for paper 12
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: pdf url not found for paper 12
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: html url not found for paper 13
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Abstract not found for paper 13
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Keywords not found for paper 13
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: pdf url not found for paper 13
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Author list not found for paper 13
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: html url not found for paper 14
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Abstract not found for paper 14
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Keywords not found for paper 14
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: pdf url not found for paper 14
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Author list not found for paper 14
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: html url not found for paper 15
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: Keywords not found for paper 15
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ WARNING: pdf url not found for paper 15
-bash: WARNING:: command not found
(base) pm286macbook:petermr pm286$ 1it [00:00, 189.82it/s]
-bash: 1it: command not found
(base) pm286macbook:petermr pm286$ INFO: Saving XML files to /Users/pm286/workspace/pygetpapers/usertests/petermr/2021_07_24_19_18_09/*/fulltext.xml
-bash: INFO:: command not found
(base) pm286macbook:petermr pm286$   0%|                                                                                                                      | 0/15 [00:00<?, ?it/s]WARNING: supplementary files not found for PMC7600171
-bash: syntax error near unexpected token `|'
(base) pm286macbook:petermr pm286$   7%|███████▎                                                                                                      | 1/15 [00:02<00:41,  2.96s/it]INFO: Wrote supplementary files for supplementary
-bash: ███████▎: command not found
-bash: 00:41,: No such file or directory
-bash: 7%: command not found
(base) pm286macbook:petermr pm286$  13%|█████████

files created/downloaded

pm286macbook:petermr pm286$ tree | more
.
├── 2021_07_24_19_18_09
│   ├── PMC1211283
│   │   ├── eupmc_result.json
│   │   └── fulltext.xml
│   ├── PMC1211284
│   │   ├── eupmc_result.json
│   │   └── fulltext.xml
│   ├── PMC212153
│   │   ├── eupmc_result.json
│   │   └── fulltext.xml
│   ├── PMC214377
│   │   ├── eupmc_result.json
│   │   └── fulltext.xml
│   ├── PMC3268506
│   │   ├── eupmc_result.json
│   │   └── fulltext.xml
│   ├── PMC4457800
│   │   ├── eupmc_result.json
│   │   ├── fulltext.pdf
│   │   ├── fulltext.xml
│   │   └── supplementaryfiles
│   │       ├── pone.0128808.s001.xls
│   │       ├── pone.0128808.s002.xls
│   │       ├── pone.0128808.s003.xls
│   │       ├── pone.0128808.s004.xls
│   │       └── pone.0128808.s005.xls
│   ├── PMC5122590
│   │   ├── eupmc_result.json
│   │   ├── fulltext.pdf
│   │   ├── fulltext.xml
│   │   └── supplementaryfiles
│   │       └── Data_Sheet_1.pdf
│   ├── PMC5161391
│   │   ├── eupmc_result.json
│   │   ├── fulltext.pdf
│   │   └── fulltext.xml
│   ├── PMC5655044
│   │   ├── eupmc_result.json
│   │   ├── fulltext.pdf
│   │   ├── fulltext.xml
│   │   └── supplementaryfiles
│   │       ├── AAC.00959-17_zac011176606s1.pdf
│   │       └── supp_61_11_e00959-17__index.html
│   ├── PMC6266747
│   │   ├── eupmc_result.json
│   │   ├── fulltext.pdf
│   │   └── fulltext.xml
│   ├── PMC6742361
│   │   ├── eupmc_result.json
│   │   ├── fulltext.pdf
│   │   ├── fulltext.xml
│   │   └── supplementaryfiles
│   │       ├── pone.0222363.s001.tif
│   │       ├── pone.0222363.s002.txt
│   │       └── pone.0222363.s003.tsv
│   ├── PMC7305226
│   │   ├── eupmc_result.json
│   │   ├── fulltext.pdf
│   │   ├── fulltext.xml
│   │   └── supplementaryfiles
│   │       └── 41598_2020_66866_MOESM1_ESM.pdf
│   ├── PMC7600171
│   │   ├── eupmc_result.json
│   │   ├── fulltext.pdf
│   │   └── fulltext.xml
│   ├── PMC8036305
│   │   ├── eupmc_result.json
│   │   ├── fulltext.pdf
│   │   └── fulltext.xml
│   ├── PMC8201348
│   │   ├── eupmc_result.json
│   │   ├── fulltext.pdf
│   │   └── fulltext.xml
│   ├── eupmc_results.json
│   └── saved_config.ini
└── README.md

Notes

  • some files do not have PDFs, or supplementary files
  • the naming/numbering of supplementary files is publisher-dependent.

config/query files

The --save_query created a saved_config.ini file, containing:

more 2021_07_24_19_18_09/saved_config.ini 
[SAVED]
config = None
version = False
query = TPS30
output = 2021_07_24_19_18_09/
save_query = True
xml = True
pdf = True
supp = True
zip = False
references = False
noexecute = False
citations = False
loglevel = info
logfile = False
limit = 100
restart = False
update = False
onlyquery = False
makecsv = False
makehtml = False
synonym = False
startdate = False
enddate = False
terms = False
api = eupmc
filter = None

references and citations

I am not familiar with most of these sources so only used MED

pm286macbook:petermr pm286$ pygetpapers -q TPS31 -o TPS31 --references MED --citations MED 
INFO: Final query is TPS31
INFO: Total Hits are 11
WARNING: Could not find more papers
0it [00:00, ?it/s]WARNING: html url not found for paper 1
WARNING: pdf url not found for paper 1
WARNING: Keywords not found for paper 5
WARNING: html url not found for paper 6
WARNING: pdf url not found for paper 6
WARNING: Keywords not found for paper 10
WARNING: Keywords not found for paper 11
1it [00:00, 205.22it/s]

outputs

tree TPS31
TPS31
├── PMC3193516
│   ├── citation.xml
│   ├── eupmc_result.json
│   └── references.xml
├── PMC3997964
│   ├── citation.xml
│   ├── eupmc_result.json
│   └── references.xml
├── PMC4457800
│   ├── citation.xml
│   ├── eupmc_result.json
│   └── references.xml
├── PMC5122590
│   ├── citation.xml
│   ├── eupmc_result.json
│   └── references.xml
├── PMC5378189
│   ├── citation.xml
│   ├── eupmc_result.json
│   └── references.xml
├── PMC6360234
│   ├── citation.xml
│   ├── eupmc_result.json
│   └── references.xml
├── PMC7049213
│   ├── citation.xml
│   ├── eupmc_result.json
│   └── references.xml
├── PMC7214349
│   ├── citation.xml
│   ├── eupmc_result.json
│   └── references.xml
├── PMC7304153
│   ├── citation.xml
│   ├── eupmc_result.json
│   └── references.xml
├── PMC7422722
│   ├── citation.xml
│   ├── eupmc_result.json
│   └── references.xml
├── PMC8002989
│   ├── citation.xml
│   ├── eupmc_result.json
│   └── references.xml
└── eupmc_results.json

11 directories, 34 files

still to test

  -z, --zip             download files from ftp endpoint if available (only eupmc supported)
  -l LOGLEVEL, --loglevel LOGLEVEL
  -f LOGFILE, --logfile LOGFILE
  -k LIMIT, --limit LIMIT
  -r RESTART, --restart RESTART
  -u UPDATE, --update UPDATE
  --onlyquery           Saves json file containing the result of the query in storage. (only eupmc
  -c, --makecsv         Stores the per-document metadata as csv.
  --makehtml            Stores the per-document metadata as html.
  --synonym             Results contain synonyms as well.
  --startdate STARTDATE
  --enddate ENDDATE     Gives papers till given date. Format: YYYY-MM-DD
  --terms TERMS         Location of the txt file which contains terms serperated by a comma which
  --api API             API to search [eupmc, crossref,arxiv,biorxiv,medrxiv,rxivist] (default:
  --filter FILTER       filter by key value pair (only crossref supported)

Suggestions for additional documentation

Clone this wiki locally