Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on "Learn ContentMining" #6

Open
Daniel-Mietchen opened this issue Mar 22, 2017 · 1 comment
Open

Feedback on "Learn ContentMining" #6

Daniel-Mietchen opened this issue Mar 22, 2017 · 1 comment

Comments

@Daniel-Mietchen
Copy link
Contributor

Daniel-Mietchen commented Mar 22, 2017

The section

Learn ContentMining

in https://github.com/ContentMine/FutureTDM/blob/master/README.md probably sounds a bit daunting for beginners (I'm not, so I'm not sure), and some more context on what this entails, how long it should take and how it all fits together would be very helpful.

The first link, then, is highly irritating, as it did not point to a getpapers tutorial - this is hopefully fixed by #4 , which has the "getpapers" link point to https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/getpapers , which is an actual tutorial that I am following from now on, commenting only if there were surprises.

Instead of the ursus maritimus example (which I had done in the past), I went for

getpapers -q 'thank you' -n -o thank-you
info: Searching using eupmc API
info: Running in no-execute mode, so nothing will be downloaded
info: Found 36011 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 4.5.3.2 vs. 5.0.1 reported by api

This version incompatibility is irritating, but I'll ignore it for the moment.

Those 36k results are a bit too many for a quick download, so I'm adding "donation" as an additional keyword:

getpapers -q 'thank you' donation -n -o thank-you
info: Searching using eupmc API
info: Running in no-execute mode, so nothing will be downloaded
info: Found 36011 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 4.5.3.2 vs. 5.0.1 reported by api

Same number of results — not sure why. Perhaps add some pointers as to whether and how quote and non-quote search strings can be combined?

I then went for a search term with way fewer results: trigonopterus .

Back in https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/getpapers , it says

To have a look at folder file structure, use the tree command.

That gave me

$ tree trigonopterus/
-bash: tree: command not found

I then tried

$ brew install tree
Error: The following formula:
  tree
cannot be installed as a a binary package and must be built from source.
To continue, you must install Xcode from the App Store,
or the CLT by running:
  xcode-select --install

which got me googling and landing at https://superuser.com/questions/359723/mac-os-x-equivalent-of-the-ubuntu-tree-command . I tried none of these options for tree, though one of the answers suggests that my brew attempt should have worked. Instead, I went for

 find trigonopterus/.

After the getpapers tutorial, I am switching to the one on norma:

norma --project trigonopterus -i fulltext.xml -o scholarly.html --transform nlm2html

This resulted in multiple lines of the kind

UNKNOWN: prefix: Dr
UNKNOWN: prefix: Prof
UNKNOWN: prefix: Mr
UNKNOWN: prefix: Ms

or

UNKNOWN: sec-meta: Taxon classificationAnimaliaColeopteraCurculionidae

I did not see the need to do the PDF part, so skipped it and went on to ami:

ami2-species --project trigonopterus/ -i scholarly.html --sp.species --sp.type binomial
ami2-word --project trigonopterus/ -i scholarly.html --w.words wordFrequencies
ami2-sequence --project trigonopterus --filter file\(\*\*/results.xml\) -o sequencesfiles.xml

all of these worked fine. I think I am now prepared enough for the Zika tutorial, which I will tackle next.

@tarrow
Copy link
Contributor

tarrow commented Apr 4, 2017

I hope you don't mind but just so I don't miss anything I'm going to split all of these problems into separate issues:

The link for the getpapers tutorial was in the wrong place: already fixed by #4 (thanks!)

The version incompatibility: This is sort of deliberate; the idea is that we see if EuPMC updates their API and we don't notice. This often comes with a host of little problems that need fixing.

Quoting is confusing and needs to be documented better: see #13

Mac and some linux distributions don't come with tree: see #14

Norma is noisy: these are just tags / attributes that we haven't come across before to translate correctly into ScholarlyHTML

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants