You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running through the pipeline to see if it is all possible locally (see issue #23) and I think there is a problem with step 2 (cc @npch), as follows:
When using EuPMCCodeReferences notebook to process the full-texts from getppapers text mining and extract URLs into a JSON data structure, with paper DOIs, etc., the notebook gets stuck at ln[5]:
KeyError Traceback (most recent call last)
<ipython-input-21-d81bdb32fc4e> in <module>()
1 # Process the papers and extract all the references to GitHub and Zenodo urls
----> 2 papers_info = process_eupmc.process_papers(paper_ids, data_dir)
[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in process_papers(list_of_pmcids, data_dir)
97
98 for pmcid in list_of_pmcids:
---> 99 papers.append(process_paper(pmcid, data_dir))
100
101 return papers
[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in process_paper(pmcid, data_dir)
66 paper_json = json.load(f)
67 # Get the DOI
---> 68 doi = get_doi(paper_json)
69 pub_date = get_pub_date(paper_json)
70 except IOError:
[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in get_doi(paper_json)
29
30 def get_doi(paper_json):
---> 31 paper_doi = paper_json['doi'][0]
32 return paper_doi
33
KeyError: 'doi'
I think this is because if there is a folder without XML or JSON, or a JSON file without a DOI, the process_eupmc.py script cannot complete. This second situation is not fully tested, but what I can glean from simple checks. The notebook works when I run it on a small subset of data and remove problematic directories.
I propose amending the process_eupmc.py script to have run-throughs for when the getpapers result does not contain the expected info. So instead of pausing at these points, the script continues and that data is ignored.
I don't yet know how to do this, I'll try to give this a crack, feel free to jump in if anyone feels up to it.
set a default?
try except?
The text was updated successfully, but these errors were encountered:
I'm running through the pipeline to see if it is all possible locally (see issue #23) and I think there is a problem with step 2 (cc @npch), as follows:
When using EuPMCCodeReferences notebook to process the full-texts from getppapers text mining and extract URLs into a JSON data structure, with paper DOIs, etc., the notebook gets stuck at ln[5]:
I think this is because if there is a folder without XML or JSON, or a JSON file without a DOI, the process_eupmc.py script cannot complete. This second situation is not fully tested, but what I can glean from simple checks. The notebook works when I run it on a small subset of data and remove problematic directories.
I propose amending the process_eupmc.py script to have run-throughs for when the getpapers result does not contain the expected info. So instead of pausing at these points, the script continues and that data is ignored.
I don't yet know how to do this, I'll try to give this a crack, feel free to jump in if anyone feels up to it.
The text was updated successfully, but these errors were encountered: