Processing all data, not just a subset, in step 2 (full-texts --> JSON data structure) #27

npscience · 2018-06-11T18:30:22Z

I'm running through the pipeline to see if it is all possible locally (see issue #23) and I think there is a problem with step 2 (cc @npch), as follows:

When using EuPMCCodeReferences notebook to process the full-texts from getppapers text mining and extract URLs into a JSON data structure, with paper DOIs, etc., the notebook gets stuck at ln[5]:

KeyError                                  Traceback (most recent call last)
<ipython-input-21-d81bdb32fc4e> in <module>()
      1 # Process the papers and extract all the references to GitHub and Zenodo urls
----> 2 papers_info = process_eupmc.process_papers(paper_ids, data_dir)

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in process_papers(list_of_pmcids, data_dir)
     97
     98     for pmcid in list_of_pmcids:
---> 99         papers.append(process_paper(pmcid, data_dir))
    100
    101     return papers

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in process_paper(pmcid, data_dir)
     66             paper_json = json.load(f)
     67             # Get the DOI
---> 68             doi = get_doi(paper_json)
     69             pub_date = get_pub_date(paper_json)
     70     except IOError:

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in get_doi(paper_json)
     29
     30 def get_doi(paper_json):
---> 31     paper_doi = paper_json['doi'][0]
     32     return paper_doi
     33

KeyError: 'doi'

I think this is because if there is a folder without XML or JSON, or a JSON file without a DOI, the process_eupmc.py script cannot complete. This second situation is not fully tested, but what I can glean from simple checks. The notebook works when I run it on a small subset of data and remove problematic directories.

I propose amending the process_eupmc.py script to have run-throughs for when the getpapers result does not contain the expected info. So instead of pausing at these points, the script continues and that data is ignored.

I don't yet know how to do this, I'll try to give this a crack, feel free to jump in if anyone feels up to it.

set a default?
try except?

The text was updated successfully, but these errors were encountered:

npscience · 2018-06-11T19:28:26Z

Nudging @yochannah as homework... I'll email you a subset of data :)

npscience mentioned this issue Jun 11, 2018

Understand if current pipeline is operational #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing all data, not just a subset, in step 2 (full-texts --> JSON data structure) #27

Processing all data, not just a subset, in step 2 (full-texts --> JSON data structure) #27

npscience commented Jun 11, 2018 •

edited

Loading

npscience commented Jun 11, 2018

Processing all data, not just a subset, in step 2 (full-texts --> JSON data structure) #27

Processing all data, not just a subset, in step 2 (full-texts --> JSON data structure) #27

Comments

npscience commented Jun 11, 2018 • edited Loading

npscience commented Jun 11, 2018

npscience commented Jun 11, 2018 •

edited

Loading