Micro-API to query a COVID-19 preprint's publication status
The micro-API service hosted on PythonAnywhere offers a free, programmatic way for a client to make a query on a COVID-19 preprint's (or a batch of COVID-91 preprints) publication status. The API return a JSON dictionary with CORD-19's enhanced metadata. If the preprint's publication status is positive, then, the API return a JSON containing metadata pertaining to the published article. The UoP API, at least in its current implementation, require no keys/authentication. But please be mindful of the fact that this is an unfunded initiative run by a single person.
This tool is currently in its beta-version. It is in active development.
The current version (vbeta) covers three preprint servers: arXiv, bioRxiv, and medRxiv. I use Covid-19 Open Research Dataset (CORD-19) to match preprints from those three repositories with their final published counterpart. As we know, there are dozens upon dozens of online preperint repositories. Therefore, one of my first goal is to extend UoP's repo coverage. NIH iSearch COVID-19 publication database is a great first step in that direction since it includes three preprint servers not coverated by the CORD-19 dataset, namely, ChemRxiv, SSRN, and ResearchSquare.
Adding your preprint server to UoP
If you are the manager/admin/developer in charge of a preprint repository and you would like to see 'your' COVID-19 preprint manuscripts' metadata added to the UoP API please email me. In a nutshell, what I would need is a CSV containing all your COVID-19 preprints's metadata (title, authors, doi, etc..) using CORD-19 metadata formatting. I am aware that some preprint repositories have API capabilities, but at this point, I am NOT planning to extend the UoP coverage by scraping websites or building API pipelines. I do not have the ressources nor the time to accomplish that kind of data architecture.
Adding functionalities to UoP
The current querying functions of the API are rather limited (see documentation below). If you have suggestions about a function you would like to use, please feel free to contact me.
- For more details on the methodology used to generate preprints' publication status, please see my medRxiv preprint paper here.
As the COVID-19 pandemic persists around the world, the scientific community continues to produce and circulate knowledge on the deadly disease at an unprecedented rate. During the early stage of the pandemic, preprints represented nearly 40% of all English-language COVID-19 scientific corpus (6, 000+ preprints | 16, 000+ articles). As of mid-August 2020, that proportion dropped to around 28% (13, 000+ preprints | 49, 000+ articles). Nevertheless, preprint servers remain a key engine in the efficient dissemination of scientific work on this infectious disease. But, giving the ‘uncertified’ nature of the scientific manuscripts curated on preprint repositories, their integration to the global ecosystem of scientific communication is not without creating serious tensions. This is especially the case for biomedical knowledge since the dissemination of bad science can have widespread societal consequences.
In the spirit of open science, and especially in the context of the COVID-19 pandemic, I develop a free API. I am running this out of my own pocket. My current plan with Python Everything allows for 100, 000 API queries per day. I strongly encourage intelligent and mindful users. Don't be stupid. Don't query the same data-point over and over. Don't use over-kill parallel processing that will overload the server. If you notice that your requests are not working anymore, just stop your program, ok! Finally, please use a user-agent header that identify you as a user, including your email. I reserve the right to restrict or block clients that are wont follow this etiquette.
I relied on "rxvist.org/docs" to write this section. Please note that "rxvist" draw from the Crossref API documentation to craft their own.
If you use UoP data in your research, please cite:
Upload-or-Publish: Lachapelle, F. (2020). COVID-19 Preprints and Their Publishing Rate: An Improved Method. medRxiv. 1-34. doi: https://doi.org/10.1101/2020.09.04.20188771.
CORD-19 Project: Wang, L. L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., ... & Mooney, P. (2020). CORD-19:The Covid-19 Open Research Dataset. ArXiv.
The current beta-version only allows one route "http://heibufan.pythonanywhere.com/json/pp_meta/doi"
For example, if a client want to determine the publication status of a specific COVID-19 preprint, using the doi, the url should be: http://heibufan.pythonanywhere.com/json/pp_meta/10.1101/2020.03.19.998179
The returned JSON will look like this:
The most important returned metadata is match_status: True=preprint has a peer-reviewed published counterpart; False=preprint doesnt have one (see documentation below).
{"result": {"indx_pp": 11811,
"indx_pr": 26,
"ti_pp": "molecular characterization of sars-cov-2 in the first covid-19 cluster in france reveals an amino-acid deletion in nsp2 (asp268del)",
"ti_pr": "molecular characterization of sars-cov-2 in the first covid-19 cluster in france reveals an amino acid deletion in nsp2 (asp268del)",
"fuzz_score": 99,
"no_fuzz_test": 1,
"no_fuzz_test_above": 1,
"prop_au_match": 1.0,
"z_fuzzy_test_history": [],
"au_pp": "Bal, Antonin; Destras, Gru00e9gory; Gaymard, Alexandre; Bouscambert-Duchamp, Maude; Valette, Martine; Escuret, Vanessa; Frobert, Emilie; Billaud, Geneviu00e8ve; Trouillet-Assant, Sophie; Cheynet, Valu00e9rie; Brengel-Pesce, Karen; Morfin, Florence; Lina, Bruno; Josset, Laurence",
"au_pr": "Bal, A.; Destras, G.; Gaymard, A.; Bouscambert-Duchamp, M.; Valette, M.; Escuret, V.; Frobert, E.; Billaud, G.; Trouillet-Assant, S.; Cheynet, V.; Brengel-Pesce, K.; Morfin, F.; Lina, B.; Josset, L.",
"source_x_pp": "biorxiv",
"source_x_pr": "pmc",
"journal_pp": "bioRxiv",
"journal_pr": "Clin Microbiol Infect",
"pub_time_pp": "3/21/2020",
"pub_time_pr": "3/28/2020",
"cord_uid_pp": "wnh6h9f0",
"cord_uid_pr": "4c0zwhdh",
"sha_pp": NaN,
"sha_pr": NaN,
"pmcid_pp": NaN,
"pmcid_pr": "PMC7142683",
"pubmedid_pp": NaN,
"pubmedid_pr": 32234449.0,
"doi_pp": "10.1101/2020.03.19.998179",
"doi_pr": "10.1016/j.cmi.2020.03.020",
"diff_day": 7,
"internal_method": "fuzzy",
"match_status": true,
"cord_19_version": "2020_08_12",
"fuzzy_matching_date": "2020_08_12"}}
import json
import requests
from bs4 import BeautifulSoup
UoP_url_base = 'http://heibufan.pythonanywhere.com/json/pp_meta/'
l_pp_to_query = ['10.1101/2020.03.19.998179', doi2, doi3, etc]
for pp_doi in l_pp_to_query:
url_query = f'{UoP_url_base}{pp_doi}'
raw_data = requests.get(url_query)
if raw_data.status_code!=200:
raise Exception("HTTP code " + str(raw_data.status_code))
json_data = json.loads(raw_data.text)
The most important returned metadata is match_status: True=preprint has a peer-reviewed published counterpart; False=preprint doesnt have one.
'pp' stands for preprint
'pr' stands for peer-review
'(c)' indicates that the metadate comes from the CORD-19 dataset
"indx_pp": internal working id for admin
"indx_pr": internal working id for admin
"ti_pp": title of preprint (c)
"ti_pr": title of published/peer-review counterpart (c)
"fuzz_score": fuzzy logic score yields from comparing both titles
"no_fuzz_test": total number (raw) of fuzzy mathing score produced (see methods)
"no_fuzz_test_above": total number of fuzzy matching produced that were above the cut-off point of 0.60
"prop_au_match": proportion of preprint's authors' last names that was found in the list of authors of the pr article
"z_fuzzy_test_history": list/array of results of all fuzzy matching tests performed if >1
"au_pp": list of authors (preprint) (c)
"au_pr": list of authors (published version) (c)
"source_x_pp": bibliometric source where CORD-19 got metadata from (c)
"source_x_pr": ibid.
"journal_pp": journal venue (c)
"journal_pr": ibid.
"pub_time_pp": date of preprint upload (c); note: a preprint can have multiple uploaded versions. Still need to validate that CORD-19 always use v1 date
"pub_time_pr": date of peer-reviewed article's publication (c)
"cord_uid_pp": preprint's id (c) - note: not a unique id
"cord_uid_pr": peer-review's id (c) - note: not a unique id
"sha_pp": id and name of pdf json of preprint (c)
"sha_pr": id and name of pdf json of article (c)
"pmcid_pp": pub med central id (c)
"pmcid_pr": ibid.
"pubmedid_pp": pub med id (c)
"pubmedid_pr": pub med id (c)
"doi_pp": digital unique identifier note: arXiv doesnt automatically generate doi for the preprint manuscripts its curated. (see methods)
"doi_pr": ibid.
"diff_day": difference in day between preprint upload and final publication
"internal_method": (see methods)
"match_status": True: pp has a pr, False: pp has no pr
"cord_19_version": version of CORD-19 dataset used for matching algo.
"fuzzy_matching_date": date when the fuzzy matching code was performed
Please feel free to email me if you have any questions or if you are interested in contributing.
- Francois Lachapelle - subFIELD.lab | PhD Cand. University of British Columbia, Vancouver, Canada
This project is licensed under the Apache License 2.0 - see the LICENSE.md file for details
At its foundation, this project is a Record Linkage initiative. Therefore, it would not be possible without the great work of researchers at:
CORD-19 Project, Wang, L. L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., ... & Mooney, P. (2020). CORD-19:The Covid-19 Open Research Dataset. ArXiv.
For advisory support:
- Adam Howe - Statistics Canada | UBC
For inspirational support (as in, ah ok, building an API is cool and doable)
- Elian Carsenat - NAMSOR
- Abdill RJ, Blekhman R. - Rxivist
For moral support:
- Heather Thom - Simon Fraser University
- Philippe Lachapelle - Centre Hospitalier Universitaire de Quebec