GitHub - 1heidi/inventory_prelim_2021: Preliminary exploration for the GBC Biodata Inventory - testing EuropePMC API, coarse classifications, and URL processing.

Script = STEP 1
Summary: Contains query for retrieving records and the code for getting the abstracts; some fussing to getting abstracts for records from other sources (def better way to do this) - API and package work great themselves.

Script = STEP 2
Summary: Contains expected keywords that can be used as classifiers based on observations from manual curation (e.g. 'database' = positive, 'package' = negative). Titles tested via a crude cascade with a set of 638 manually curated records (variable hji_recheck) - looks promising. Scripts for abstract not complete. Note that classifiers for title vs. abstracts different (e.g. more and more granular for abstract). Title classifiers tested are directly in code and abstracts classifers are in a file. All preliminary.
Note: file "manual_checks_hji_2021-08-20.csv" is the full set (638) manually curated records and file "manual_checks_hji_cec_2021-07-28.xlsx" is a subset of 20 that were especailly tricky - reviewed by both Chuck Cook (cec_manual_review) and Heidi Imker (hji_manual_review); see those columns for notes

Script = Step N
Summary: A challenge will be getting resource name - sometimes are contained within the URL. This was early cleaning/processing and testing of URL so see if they'll be useful - looks promising. Scripts partially work. Will be somewhere downstream in workflow, hence "Step N"

Also did testing against data from Wren et 2017 (doi:10.1093/nar/gkx182), some testing of stopwords, and use of re3data API for registered data resources (some may count for this inventory and some may not).

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
prelim		prelim
README.md		README.md
STEP_1_getdata_europepmc.R		STEP_1_getdata_europepmc.R
STEP_2_classification.R		STEP_2_classification.R
STEP_N_URL_Processing.R		STEP_N_URL_Processing.R
abstract_classifiers_2021-09-03.csv		abstract_classifiers_2021-09-03.csv
abstracts_1_4999_2021-08-06.csv		abstracts_1_4999_2021-08-06.csv
classification_summary_2021-09-03.csv		classification_summary_2021-09-03.csv
manual_checks_hji_2021-08-20.csv		manual_checks_hji_2021-08-20.csv
manual_checks_hji_cec_2021-07-28.xlsx		manual_checks_hji_cec_2021-07-28.xlsx
pmc_seed_all_2021-08-06.csv		pmc_seed_all_2021-08-06.csv
url_testing_20.csv		url_testing_20.csv

Provide feedback