-
Notifications
You must be signed in to change notification settings - Fork 17
pygetpapers
getpapers
(https://github.com/petermr/openVirus/wiki/getpapers), the primary scraper that we've been using so far, is written in Javascript and requires Node.js
to run. Driven by the problems of maintaining and extending the Node-based getpapers
, we've decided to re-write it in Python and call it pygetpapers
.
getpapers
has been used for EPMC and occasionally for other repos. At present these have different APIs and functions. It's unclear whether we need a general abstraction or whether each API must have its own architecture.
This is medium priority as we already have a working getpapers
.
- PMR
- Ayush
- Dheeraj
- Shweata
And input from the community which uses getpapers
.
PMR: This project is well suited to a modular approach, both in content and functionality. For example, each target repo is a subproject and as long as the framework is well designed it should be possible to add repos independently. An important aspect (missing at the moment) is "how to add a new repo" for example.
Any revision of getpapers should start with the current architecture.
Need an architecture diagram of getpapers
with explanations.
This is essential.
The EPMC API is fairly typical. Does it correspond to a known standard?
Identify which API functionality is
- MUST be included
- MAY be useful
- should NOT be included (there are many bibliographic fields we don't need.)
This is metadata from publishers. It's very variable. It may include abstracts but often does not.
What fields do we wish to retrieve?
Analyze the query. How much (a) semantic (b) syntactic overlap with EPMC
Physics, maths, compsci preprints. non-semantic fulltext as PDF, word, TeX. No XML.
Low priority for biosciences
Current options
-h, --help output usage information
-V, --version output the version number
-q, --query <query> search query (required)
-o, --outdir <path> output directory (required - will be created if not found)
--api <name> API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
-x, --xml download fulltext XMLs if available
-p, --pdf download fulltext PDFs if available
-s, --supp download supplementary files if available
-t, --minedterms download text-mined terms if available
-l, --loglevel <level> amount of information to log (silent, verbose, info*, data, warn, error, or debug)
-a, --all search all papers, not just open access
-n, --noexecute report how many results match the query, but don't actually download anything
-f, --logfile <filename> save log to specified file in output directory as well as printing to terminal
-k, --limit <int> limit the number of hits and downloads
--filter <filter object> filter by key value pair, passed straight to the crossref api only
-r, --restart restart file downloads after failure
Do biorXiv
or medrXiv
have an API?.
Useful if the logfile can default to a child of CProject
does this work for EPMC?
is it documented?
does this work for EPMC? Is it useful?
is it documented?
Has anyone used this? What does it do.
is it documented?
Which raw files? Does EPMC have an interface? Do we want these files? Why? What are they used for?
Why? This is not part of getpapers. It is already done by ami
Out of scope. This is ami-search.
getpapers
had no default for number of hits (-k
option). This often resulted in downloading the whole database. High priority
User should be able to set number of hits per page. This wasn't explicit in getpapers
. May also be able to restart failed searches. Low priority.
The use of brackets and quotes can be confusing and lead to errors. It will also be useful when querying using a list of terms. Medium priority
Many of our tools require fulltext and it may be useful to exclude others.
This might be done by simply sorting the papers based on their size (there might be a better way). This shall ensure the user knows which folder to open and what to expect.
It may be possible to exclude non-fulltext in the search.
Some papers have data mounted on the publisher's server ("supplemental data", "supporting information").
getpapers
has a --supp
option. Does this do what we want?
Many papers reference data through links in the fulltext. This would require HTTP-request to download. They could vary a lot in size or number.
Should this be automatic or an interactive facility after the text downloads (e.g. in a dashboard).
it could work on the local browser.
=====
These are much too general. Who contributed them? Please expand:
What does this mean?
PMR: Which date?
Having an option to choose the type of article would be a useful feature. An opinion or commentary may not be as useful as Research and Review articles. PMR: This is an EPMC option (I think). What is the current query format? We will need to customise this for the user.
PMR: This is too general
=====
Respecting quotes around multi-word terms (i.e. “Essential Oil” not “essential AND oil”) would increase accuracy thus eliminating thousands of irrelevant hits/downloads and reducing processing time and resources.
2. If possible, allow for wildcards within quoted terms so as to allow for capturing terms with and without suffixes. Example "Essential Oil*" to capture Essential Oil or Oils
Assuming this or some other work-around is programatically possible, allowing for wildcards to be used within the quotes surrounding multi-word terms as in this example…
(“essential oil” OR “essential oils”) (37 characters) vs (“essential oil*”) (18 characters) … would create the possibility for:
- Shorter/denser queries, which would allow users to maximize the number of important search terms that could be included (or "NOT") within EUPMC’s current 1500 character limit for queries — resulting in more precisely refined resulting downloads for ami to process; and also,
- Clearer, "less fragile” queries would result by the elimination/reduction of opportunities for user entry errors, such as the user getting lost in a sea of brackets and inadvertently missing, duplicating, or mis-placing some, for example.
Using proximity (near words) operators would allow me to further constrain searches/downloads to only papers that contain say, "Essential Oils" within 5 words (before and/or after) of "Activity" (or better yet, "Activit*" — See requirement #2)
The addition of Proximity Query Operators to getpapers — and possibly AMI — would drastically reduce such waste as it would allow users to precisely ultra-fine-tune their queries and weed-out extraneous articles. And as a result, Users’ (i.e. my own) confidence in the results gleaned from the corpus of articles — later analyzed by AMI — would/could be MUCH greater knowing the extracted corpus itself is more accurate, precise, and therefore more relevant to my work, and hopefully more reliable too.
Generally speaking, my “gold standard” for defining articles of interest — ones worthy of being downloaded and analyzed — are ones that contain one or more sentences that contain three or more key Terms in close proximity to one another, such as the ones in the following example: “… our objective was to test/measure the (A) anti-microbial [action/effect/effectiveness/capacity] of (B) essential oil (EO) extracted via (B) hydro-distillation from the (C) leaves of (D) North American (E) Thuja Occidentalis (White cedar) against (F) p. acnes, the bacteria associated with (G) Acne Vulgaris appearing on the (H) cutaneous regions of the (I) face, neck and back, of teens who have found other treatments such as (J) [drugs] ineffective.” Obviously, I have overloaded this sample, but from it, one can plainly see that Proximity Operators using just any three of these key Terms would identify more relevant papers, and thereby generate more relevant results from AMI (I propose AMI should have these capabilities too).
2-Term proximity query
“near((anti-microbial, “p. acnes"), 6)”
(n)-Term Proximity query (In this example, n=4)
“near((“anti-microbial”, “p. acnes”, “essential oil*”)), 6)”
A user-defined (n)-Term Proximity query, used in conjunction with a user-defined "nearness number" (as "6" used in examples above) would be ideal.
I've got 32Gigs of Ram I'd like to put to good use. If py doesn't automatically adjust to use RAM intelligently, let's have an option to do that manually please.
Even better, being able to have multiple keywords each with its own section specification.
This filters out papers that are not relevant really quickly (e.g. those that only mention a species of interest once in the introduction but actually focus on a different species in the results). Sometimes this is desirable, but often it only bloats the number of results with noise.
This is already an option in EPMC's Advanced Search and it looks like ami
can do it, but being able to filter by section at this step would mean that the downloaded papers are highly relevant.
Agree with Ambreen's points above. Some papers have data mounted on the publisher's server ("supplemental data", "supporting information"). This is increasingly the case and very often the data in the supplementary information is useful. It's also often the case that Materials and Methods are published in the supplementary info, which in some cases is now almost 100 pages in length.
Again, echoing Ambreen's comment: getpapers
has a --supp
option. Does this do what we want?