Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing all data, not just a subset, in step 2 (full-texts --> JSON data structure) #27

Open
2 tasks
npscience opened this issue Jun 11, 2018 · 1 comment

Comments

@npscience
Copy link

npscience commented Jun 11, 2018

I'm running through the pipeline to see if it is all possible locally (see issue #23) and I think there is a problem with step 2 (cc @npch), as follows:

When using EuPMCCodeReferences notebook to process the full-texts from getppapers text mining and extract URLs into a JSON data structure, with paper DOIs, etc., the notebook gets stuck at ln[5]:

KeyError                                  Traceback (most recent call last)
<ipython-input-21-d81bdb32fc4e> in <module>()
      1 # Process the papers and extract all the references to GitHub and Zenodo urls
----> 2 papers_info = process_eupmc.process_papers(paper_ids, data_dir)

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in process_papers(list_of_pmcids, data_dir)
     97
     98     for pmcid in list_of_pmcids:
---> 99         papers.append(process_paper(pmcid, data_dir))
    100
    101     return papers

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in process_paper(pmcid, data_dir)
     66             paper_json = json.load(f)
     67             # Get the DOI
---> 68             doi = get_doi(paper_json)
     69             pub_date = get_pub_date(paper_json)
     70     except IOError:

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in get_doi(paper_json)
     29
     30 def get_doi(paper_json):
---> 31     paper_doi = paper_json['doi'][0]
     32     return paper_doi
     33

KeyError: 'doi'

I think this is because if there is a folder without XML or JSON, or a JSON file without a DOI, the process_eupmc.py script cannot complete. This second situation is not fully tested, but what I can glean from simple checks. The notebook works when I run it on a small subset of data and remove problematic directories.

I propose amending the process_eupmc.py script to have run-throughs for when the getpapers result does not contain the expected info. So instead of pausing at these points, the script continues and that data is ignored.

I don't yet know how to do this, I'll try to give this a crack, feel free to jump in if anyone feels up to it.

  • set a default?
  • try except?
@npscience
Copy link
Author

Nudging @yochannah as homework... I'll email you a subset of data :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant