Skip to content
Giulia edited this page Jan 28, 2021 · 17 revisions

Background

getpapers (https://github.com/petermr/openVirus/wiki/getpapers), the primary scraper that we've been using so far, is written in Javascript and requires Node.js to run. Driven by the problems of maintaining and extending the Node-based getpapers, we've decided to re-write it in Python and call it pygetpapers.

getpapers has been used for EPMC and occasionally for other repos. At present these have different APIs and functions. It's unclear whether we need a general abstraction or whether each API must have its own architecture.

Priority

This is medium priority as we already have a working getpapers.

People

  1. PMR
  2. Ayush
  3. Dheeraj
  4. Shweata

And input from the community which uses getpapers.

Initial Plans

PMR: This project is well suited to a modular approach, both in content and functionality. For example, each target repo is a subproject and as long as the framework is well designed it should be possible to add repos independently. An important aspect (missing at the moment) is "how to add a new repo" for example.

architecture

Any revision of getpapers should start with the current architecture.

action

Need an architecture diagram of getpapers with explanations.

Repos

EPMC

This is essential.

comment

The EPMC API is fairly typical. Does it correspond to a known standard?

action

Identify which API functionality is

  • MUST be included
  • MAY be useful
  • should NOT be included (there are many bibliographic fields we don't need.)

crossref

This is metadata from publishers. It's very variable. It may include abstracts but often does not.

action

What fields do we wish to retrieve?

Analyze the query. How much (a) semantic (b) syntactic overlap with EPMC

arXiv

Physics, maths, compsci preprints. non-semantic fulltext as PDF, word, TeX. No XML.

comments

Low priority for biosciences

getpapers features

Current options

-h, --help                output usage information
    -V, --version             output the version number
    -q, --query <query>       search query (required)
    -o, --outdir <path>       output directory (required - will be created if not found)
    --api <name>              API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
    -x, --xml                 download fulltext XMLs if available
    -p, --pdf                 download fulltext PDFs if available
    -s, --supp                download supplementary files if available
    -t, --minedterms          download text-mined terms if available
    -l, --loglevel <level>    amount of information to log (silent, verbose, info*, data, warn, error, or debug)
    -a, --all                 search all papers, not just open access
    -n, --noexecute           report how many results match the query, but don't actually download anything
    -f, --logfile <filename>  save log to specified file in output directory as well as printing to terminal
    -k, --limit <int>         limit the number of hits and downloads
    --filter <filter object>  filter by key value pair, passed straight to the crossref api only
    -r, --restart             restart file downloads after failure

api

Do biorXiv or medrXiv have an API?.

log

Useful if the logfile can default to a child of CProject

supp

does this work for EPMC?

is it documented?

minedterms

does this work for EPMC? Is it useful?

is it documented?

restart

Has anyone used this? What does it do.

is it documented?

Requirements:

Add option to get raw files as well as files in format such as xml and pdfs

Which raw files? Does EPMC have an interface? Do we want these files? Why? What are they used for?

Convert XML papers in a user readable format.

Why? This is not part of getpapers. It is already done by ami

Specify a wordlist and then get the count of those words for each paper.

Out of scope. This is ami-search.

Requirements from PMR (not exhaustive)

default number of hits

motivation

getpapers had no default for number of hits (-k option). This often resulted in downloading the whole database. High priority

choice of cursor size

motivation

User should be able to set number of hits per page. This wasn't explicit in getpapers. May also be able to restart failed searches. Low priority.

query builder

motivation

The use of brackets and quotes can be confusing and lead to errors. It will also be useful when querying using a list of terms. Medium priority

Requirements from Ambreen H

Segregation of papers based on whether they are full text or not

motivation

Many of our tools require fulltext and it may be useful to exclude others.

comment

This might be done by simply sorting the papers based on their size (there might be a better way). This shall ensure the user knows which folder to open and what to expect.

It may be possible to exclude non-fulltext in the search.

Download supplemental datasets if available

motivation

Some papers have data mounted on the publisher's server ("supplemental data", "supporting information").

comment

getpapers has a --supp option. Does this do what we want?

Many papers reference data through links in the fulltext. This would require HTTP-request to download. They could vary a lot in size or number.

Should this be automatic or an interactive facility after the text downloads (e.g. in a dashboard).

what will the user interface be like.

it could work on the local browser.

=====

Requirements and Bugs to Fix (@ayush and@ShweataNHegde)

These are much too general. Who contributed them? Please expand:

General API

What does this mean?

SH and AG: Sort the Articles by Date

PMR: Which date?

SH: Specifically, download only the Review, Research etc.

Motivation

Having an option to choose the type of article would be a useful feature. An opinion or commentary may not be as useful as Research and Review articles. PMR: This is an EPMC option (I think). What is the current query format? We will need to customise this for the user.

Add attributes for repository specific functions

PMR: This is too general

=====

Manny's Requirements

1. Respect quotes while searching terms such as "Essential Oil"

Respecting quotes around multi-word terms (i.e. “Essential Oil” not “essential AND oil”) would increase accuracy thus eliminating thousands of irrelevant hits/downloads and reducing processing time and resources.

2. If possible, allow for wildcards within quoted terms so as to allow for capturing terms with and without suffixes. Example "Essential Oil*" to capture Essential Oil or Oils

Assuming this or some other work-around is programatically possible, allowing for wildcards to be used within the quotes surrounding multi-word terms as in this example…

(“essential oil” OR “essential oils”) (37 characters) vs (“essential oil*”) (18 characters) … would create the possibility for:

  • Shorter/denser queries, which would allow users to maximize the number of important search terms that could be included (or "NOT") within EUPMC’s current 1500 character limit for queries — resulting in more precisely refined resulting downloads for ami to process; and also,
  • Clearer, "less fragile” queries would result by the elimination/reduction of opportunities for user entry errors, such as the user getting lost in a sea of brackets and inadvertently missing, duplicating, or mis-placing some, for example.

3. Increase query precision with Proximity Operators

Using proximity (near words) operators would allow me to further constrain searches/downloads to only papers that contain say, "Essential Oils" within 5 words (before and/or after) of "Activity" (or better yet, "Activit*" — See requirement #2)

The addition of Proximity Query Operators to getpapers — and possibly AMI — would drastically reduce such waste as it would allow users to precisely ultra-fine-tune their queries and weed-out extraneous articles. And as a result, Users’ (i.e. my own) confidence in the results gleaned from the corpus of articles — later analyzed by AMI — would/could be MUCH greater knowing the extracted corpus itself is more accurate, precise, and therefore more relevant to my work, and hopefully more reliable too.

The “Gold Standard” — Why proximity matters

Generally speaking, my “gold standard” for defining articles of interest — ones worthy of being downloaded and analyzed — are ones that contain one or more sentences that contain three or more key Terms in close proximity to one another, such as the ones in the following example: “… our objective was to test/measure the (A) anti-microbial [action/effect/effectiveness/capacity] of (B) essential oil (EO) extracted via (B) hydro-distillation from the (C) leaves of (D) North American (E) Thuja Occidentalis (White cedar) against (F) p. acnes, the bacteria associated with (G) Acne Vulgaris appearing on the (H) cutaneous regions of the (I) face, neck and back, of teens who have found other treatments such as (J) [drugs] ineffective.” Obviously, I have overloaded this sample, but from it, one can plainly see that Proximity Operators using just any three of these key Terms would identify more relevant papers, and thereby generate more relevant results from AMI (I propose AMI should have these capabilities too).

The Proximity Operator “Gold Standard” (aka. wishful thinking)

Most basic example:

2-Term proximity query “near((anti-microbial, “p. acnes"), 6)”

Ideal, most flexible/util example:

(n)-Term Proximity query (In this example, n=4) “near((“anti-microbial”, “p. acnes”, “essential oil*”)), 6)”

A user-defined (n)-Term Proximity query, used in conjunction with a user-defined "nearness number" (as "6" used in examples above) would be ideal.

4. Allow users to allocate more RAM to both Getpapers and AMI as available and necessary.

I've got 32Gigs of Ram I'd like to put to good use. If py doesn't automatically adjust to use RAM intelligently, let's have an option to do that manually please.

Giulia's requirements

Only download papers that contain a keyword of interest in a specific section

Even better, being able to have multiple keywords each with its own section specification.

motivation

This filters out papers that are not relevant really quickly (e.g. those that only mention a species of interest once in the introduction but actually focus on a different species in the results). Sometimes this is desirable, but often it only bloats the number of results with noise.

comment

This is already an option in EPMC's Advanced Search and it looks like ami can do it, but being able to filter by section at this step would mean that the downloaded papers are highly relevant.

Download supplemental datasets if available

motivation

Agree with Ambreen's points above. Some papers have data mounted on the publisher's server ("supplemental data", "supporting information"). This is increasingly the case and very often the data in the supplementary information is useful. It's also often the case that Materials and Methods are published in the supplementary info, which in some cases is now almost 100 pages in length.

comment

Again, echoing Ambreen's comment: getpapers has a --supp option. Does this do what we want?

Clone this wiki locally