Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-JSON formatted simpler summary data desired #142

Open
rossmounce opened this issue Dec 6, 2016 · 2 comments
Open

non-JSON formatted simpler summary data desired #142

rossmounce opened this issue Dec 6, 2016 · 2 comments

Comments

@rossmounce
Copy link
Member

rossmounce commented Dec 6, 2016

The nested data structure of the eupmc_results.json output makes it a little tricky to get human-readable summaries of the results. Particularly the journal title per result, which is nested within journalInfo, and further nested within journal -> title (title incidentally is also a non-unique key, this key is also used to describe the article title). I'm not suggesting a change to the structuring of the results.json's, just that a simpler overview csv could be created for people who find JSON hard/intimidating, as a non-default option within getpapers.

As a workaround I have created a short R script to create this non-interactively from the JSON, although it's far from ideal as it requires the installation of an R package (jsonlite) which users probably won't have.

Script below:

#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)
if (length(args)==0) {
  stop("At least one argument must be supplied (input file).n", call.=FALSE)
} else if (length(args)==1) {
  # default output file
  args[2] = "summary.csv"
}
#install.packages('jsonlite')
library(jsonlite)
mymatrix <- fromJSON(args[1])
journals <- data.frame(rep(NA,dim(mymatrix)[1]))
for (i in 1:dim(mymatrix)[1]) {
  if (is.null(mymatrix$journalInfo[[i]]$journal[[1]]$title) == TRUE) {
    journals[i,1] <- "not published in a journal"
  } else {
  journals[i,1] <- (mymatrix$journalInfo[[i]]$journal[[1]]$title) 
  }
}
zzz <- cbind(as.character(mymatrix$pmcid),as.character(mymatrix$title),journals[1],as.character(mymatrix$pubYear),as.character(mymatrix$authorString),as.character(mymatrix$doi),as.character(mymatrix$hasPDF),as.character(mymatrix$hasSuppl),as.character(mymatrix$isOpenAccess),as.character(mymatrix$citedByCount),as.character(mymatrix$electronicPublicationDate))
colnames(zzz) <- c("pmcid","article.title","journal","pubYear","authorString","doi","hasPDF","hasSuppl","isOpenAccess","citedByCount","electronicPublicationDate")
write.csv(zzz,file=args[2])

Example command-line usage:

Rscript json-to-csv.R eupmc_results.json output.csv

This creates an overview csv file with these (much reduced) fields of information, including all the things that 90% of users are most likely to want to know e.g. journal, article title, year of publication - the basics

csvcut -n output.csv 
  1: 
  2: pmcid
  3: article.title
  4: journal
  5: pubYear
  6: authorString
  7: doi
  8: hasPDF
  9: hasSuppl
 10: isOpenAccess
 11: citedByCount
 12: electronicPublicationDate
@blahah
Copy link
Member

blahah commented Dec 6, 2016

Right now we just take the eupmc API response object and serialise it to JSON. My personal opinion is that it might be out of scope for getpapers to do more with it, and that there's space for more tools in the ecosystem that do more. We could link to other tools that handle the output, including linking to this issue.

@rossmounce
Copy link
Member Author

I've made a minor update to my script with a is.null in the for loop. Patents do not have a journal title and were creating NULLs that broke my simple for loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants