Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understand if current pipeline is operational #23

Open
npscience opened this issue Apr 3, 2018 · 3 comments
Open

Understand if current pipeline is operational #23

npscience opened this issue Apr 3, 2018 · 3 comments
Assignees

Comments

@npscience
Copy link

Run through entire pipeline as it stands locally to find bugs / see if it works.

@npscience npscience self-assigned this Apr 3, 2018
@andreww
Copy link

andreww commented Apr 8, 2018

First step fails for me:

$ getpapers --query query -o output -x
info: Searching using eupmc API
info: Found 42467 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 5.3.5 reported by api
Retrieving results [===================-----------] 64% (eta 99.7s)
<--- Last few GCs --->

[56542:0x102804600]   200534 ms: Mark-sweep 1400.0 (1434.2) -> 1400.0 (1434.2) MB, 773.2 / 0.0 ms  allocation failure GC in old space requested
[56542:0x102804600]   201541 ms: Mark-sweep 1400.0 (1434.2) -> 1400.0 (1427.2) MB, 1004.1 / 0.0 ms  last resort 
[56542:0x102804600]   202307 ms: Mark-sweep 1400.0 (1427.2) -> 1400.0 (1427.2) MB, 766.3 / 0.0 ms  last resort 


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x13b598913471 <JS Object>
    1: assignOrPush [/Users/earawa/.nvm/versions/node/v7.10.1/lib/node_modules/getpapers/node_modules/xml2js/lib/parser.js:~93] [pc=0x7a71e266626](this=0x2a2aa3049029 <a Parser with map 0x143935ebfa9>,obj=0x2df8d5623e21 <an Object with map 0xabbf50fb411>,key=0x2df8d5623e89 <String[16]: availabilityCode>,newValue=0x2df8d5623f71 <String[2]: OA>)
    2: onclosetag [/Users/earawa/.nvm/versions/node...
 
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [/Users/earawa/.nvm/versions/node/v7.10.1/bin/node]
 2: node::FatalException(v8::Isolate*, v8::Local<v8::Value>, v8::Local<v8::Message>) [/Users/earawa/.nvm/versions/node/v7.10.1/bin/node]
 3: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [/Users/earawa/.nvm/versions/node/v7.10.1/bin/node]
 4: v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationSpace) [/Users/earawa/.nvm/versions/node/v7.10.1/bin/node]
 5: v8::internal::Runtime_AllocateInTargetSpace(int, v8::internal::Object**, v8::internal::Isolate*) [/Users/earawa/.nvm/versions/node/v7.10.1/bin/node]
 6: 0x7a71de0ea5f
Abort trap: 6

This is on a MacPro with a local installation of getpapers etc. - I suspect I need to set some larger limit on the memory node is permitted to use or something.

@npscience
Copy link
Author

Hmmm, sorry @andreww I'm not sure what to suggest for that.

@npscience
Copy link
Author

npscience commented Jun 11, 2018

Notes from May 21:

Step 1: find full-texts containing 'search-phrase'

Use getpapers to mine for ‘github.com’ and find the full-texts - Naomi's notebook at https://github.com/softwaresaved/code-cite/blob/master/notebooks/getpapers.md

I did this at the collabw18 hackday (March 28), not repeated for this test, but note Andrew's report above.

Output data is in /data folder in github repo and locally (~/Github/collabw18/code-cite/code-cite/data).

Step 2: process full-texts --> JSON data with 'search-term' URLs

Extract URLs from full-texts and create JSON data structure (article DOI, URLs, pubdate) - Neil's notebook at https://github.com/softwaresaved/code-cite/blob/master/notebooks/EuPMCCodeReferences.ipynb.

  • Set local file directory in notebook (instead of ../data in ln[2]) and run --> Error in ln [4]: Key error. Doesn't understand doi.
  • Tried using a subset of data files--> Error: file not found.
  • Changed errors in process_eupmc.py script such that each error is unique (i.e. specify cannot find XML or JSON file) --> pull request. Found that in my subset of data folders, there was one without the xml. Removed this paper directory --> script works.

Bug (reported June 11, issue #27): if there is a folder without xml or JSON, or a JSON file without a doi (assumption), script --> errors, cannot run past ln[21]:

KeyError                                  Traceback (most recent call last)
<ipython-input-21-d81bdb32fc4e> in <module>()
      1 # Process the papers and extract all the references to GitHub and Zenodo urls
----> 2 papers_info = process_eupmc.process_papers(paper_ids, data_dir)

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in process_papers(list_of_pmcids, data_dir)
     97
     98     for pmcid in list_of_pmcids:
---> 99         papers.append(process_paper(pmcid, data_dir))
    100
    101     return papers

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in process_paper(pmcid, data_dir)
     66             paper_json = json.load(f)
     67             # Get the DOI
---> 68             doi = get_doi(paper_json)
     69             pub_date = get_pub_date(paper_json)
     70     except IOError:

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in get_doi(paper_json)
     29
     30 def get_doi(paper_json):
---> 31     paper_doi = paper_json['doi'][0]
     32     return paper_doi
     33

KeyError: 'doi'

For following, using the subset data JSON output
This file is located in ~/local-resources/a_subset/

Step 3: analyse results data

Step 3: Use JSON data from step 2 to check if URLs resolve, and if repositories have documentation, license files, etc. using Andrew's notebook at https://github.com/softwaresaved/code-cite/blob/master/notebooks/resolvre_and_check_resources.ipynb

[npscience notes: This requires a github token. Generate on github. What scope? guessed only repo/public_repo. Notebook amended for file location, manually enter github token.]

Results (head):

doi	licence_exists	pub_date	resolves	resourcetype	score	timestamp	url
0	10.1186/1471-2105-11-93	False	2010-01-01	True	github	1	2018-05-21T20:39:36.259006	http://github.com/danmaclean/NiBLS
1	10.1371/journal.pcbi.1000724	False	2010-04-01	True	github	1	2018-05-21T20:39:37.605631	http://github.com/malaria-atlas-project/mbg-wo...
2	10.1093/bioinformatics/btq099	False	2010-04-01	True	github	1	2018-05-21T20:39:38.946657	http://github.com/arrayexpress/ae-interface/tr...
3	10.1371/journal.pone.0010071	True	2010-01-01	True	github	2	2018-05-21T20:39:40.161856	http://github.com/lg/murder
4	10.1093/nar/gkq143	False	2010-05-01	True	github	1	2018-05-21T20:39:41.604144	http://github.com/GeneDesign
5	10.1186/1471-2164-11-222	False	2010-01-01	True	github	1	2018-05-21T20:39:43.265000	http://github.com/wwood/essentiality
6	10.1186/1471-2164-11-222	True	2010-01-01	True	github	2	2018-05-21T20:39:44.578704	http://github.com/wwood/goruby
7	10.1093/nar/gkq476	False	2010-07-01	True	github	1	2018-05-21T20:39:45.758499	http://github.com/mikisvaz

So this works. How to output this nicely?

/ end May 21 notes

Next:

Step 4: Visualise!

Note the web app is at https://github.com/softwaresaved/code-cite-app

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants