-
Notifications
You must be signed in to change notification settings - Fork 17
Please leave your questions and answers here.... We can only answer questions about software and the scientific literature on viruses, epidemics, etc. If you want general knowledge go to Wikipedia (I do!).
It's a collection of tolls and resources to help generate knowledge from the public scientific literature. The tools are generic but we concentrate on viral epidemics (not just COVID-19) and tools to manage them.
- 5-min slide show: https://www.slideshare.net/petermurrayrust/openvirus-tools-for-discovering-literature-on-viruses
- 2-min video: https://www.youtube.com/watch?v=gBFpiOs7wZI
- Call for hack: https://genr.eu/wp/volunteer-for-openvirus-euvsvirus-hack/
- Andy Jackson BL blog: https://blogs.bl.uk/digital-scholarship/2020/05/searching-etheses-for-the-openvirus-project.html
- Andy Jackson: https://blogs.bl.uk/digital-scholarship/2020/05/bringing-metadata-full-text-together.html
- getpapers queries scientific repositories
- quickscrape scrapes publisher and other sites
- ami is a novel toolkit for collecting, transforming, indexing, sectioning, searching, and re-using scientific documents.
- https://github.com/contentmine/getpapers
- https://github.com/contentmine/quickscrape
- https://github.com/petermr/ami3
You need your own machine with install permissions, and understand commandlines. getpapers
and quickscrape
need Node.js and have installation instructions. ami3
is a Java toolkit. At present you download a JAR file (https://github.com/petermr/tigr2ess gives instructions on installation and running. ) We will have later JAR files.
-
scraping, web stuff (Javascript, Node, REST)
-
Academic and scholarly publishing, publisher sites, repositories.
-
document transformation. HTML, XML/XSLT, JATS, PDF, Pandoc, etc.
-
text searching.
-
documentation and tutorials.
-
liaison.
-
Spanish (we are starting to index Redalyc, LatinAmerica)
-
Wikimedia, SPARQL
-
workflows, packaging and distribution
-
community engagement and management.
-
and lots more...
We only use openly visible sites.
-
https://europepmc.org is the flagship and the default in
getpapers
. There are millions of papers, mainly biomedical but a surprising number of maths, phys, materials, chemistry. - https://biorxiv.org Daily preprints in bio. @petermr has developed a scraper but it needs refining
- https://medrxiv.org As for bio, but medical. Uses same tech as bio.
- https://www.redalyc.org/ Latin America. @petermr in close contact with site and we may get a volunteer.
- https://royalsociety.org who have made all their content openly accessible
also
- https://crossref.org metadata and search for all publications including closed. Will normally give URLs and abstracts (if available). No fulltext.
- https://doaj.org Directory of OA journals. Large dumps of metadata and text. Should use this more!
We can't answer medical and personal questions! But we can help search the literature for peer-reviewed Open Access papers to help organisations make policies and protocols.
We try to index everything to Wikidata.org (the data extracted from Wikipedia with a lot more added). Here's "coronavirus" in 96 languages (bottom of page https://www.wikidata.org/wiki/Q82069695) and here's "cough" with 99 languages. So when we annotate pages with Wikidata there's a good chance that yours will be linked.
We are experimenting with adding Hindi equivalents to our dictionaries, using the links in Wikidata.
We also expect to download and extract ES and PT shortly when we start indexing Redalyc. For that we'll need native language speakers. Tasks include: processing of diacritics, creation of stopwords and vocabularies (probably open available), knowledge of phrases, sentence structure, punctuation, synonymy, etc.
It's a major friction in the system. We take a liberal view - that science is facts (not copyrightable), and that copying has widespread fair use permissions, especially for non-commercial or educational purposes. However it's the law in most countries, and we don't knowingly break it. We're happy to find collaborators whose legal systems are permissive. In the UK it's legal to textmine documents you have legal access to , for non-commercial research purposes (which is what we are).
The directory that is created when running getpapers
to download papers can be used as the CProject Directory for ami
For example:
getpapers -q "masks" -o masks -f n95/log.txt -x -p
Here output directory masks will be used for amiami -p masks/ search --dictionary country disease funders
>
A CProject is just a directory whose immediate child directories (CTree
s) are individual documents. Many of the subdirectories have reserved names (e.g. __cooccurrence
holds the co-occurrence results).
If you use getpapers
then the output is already in CProject-form so there is no action required. If you start with a bundle of PDFs, then put them in a single directory (e.g. myproject
, virusmasks
, etc. Do NOT include spaces or uppercase). Then run ami -p virus makeproject --rawfiletypes pdf
. This will rename the PDFs as:
foo&Bar.pdf => foo_bar/fulltext.pdf
Most ami
commands will have a -p or -t argument running on the project or the tree/s.
In 2020-05 we changed the style of ami
dictionary commands. We probably forgot to announce this clearly.
Sorry.
The old command ami-dictionary
has been moved to amidict
. It has its own toplevel Options (e.g. there is no --cproject
option).
amidict --help
Usage: amidict [OPTIONS] COMMAND
`amidict` is a command suite for managing dictionary:
Parameters:
===========
[@<filename>...] One or more argument files containing options.
Options:
========
-d, --dictionary=<dictionaryList>...
input or output dictionary name/s. for 'create' must be singular; when 'display' or
'translate', any number. Names should be lowercase, unique. [a-z][a-z0-9._]. Dots can be
used to structure dictionaries intodirectories. Dictionary names are relative to
'directory'. If <directory> is absent then dictionary names are absolute.
--directory=<directory>
top directory containing dictionary/s. Subdirectories will use structured names (NYI). Thus
dictionary 'animals' is found in '<directory>/animals.xml', while 'plants.parts' is found in
<directory>/plants/parts.xml. Required for relative dictionary names.
-h, --help Show this help message and exit.
-V, --version Print version information and exit.
General Options:
-i, --input=FILE Input filename (no defaults)
-n, --inputname=PATH User's basename for inputfiles (e.g. foo/bar/<basename>.png) or directories. By default this
is often computed by AMI. However some files will have variable names (e.g. output of
AMIImage) or from foreign sources or applications
-L, --inputnamelist=PATH...
List of inputnames; will iterate over them, essentially compressing multiple commands into
one. Experimental.
-f, --forcemake Force 'make' regardless of file existence and dates.
-N, --maxTrees=COUNT Quit after given number of trees; null means infinite.
Logging Options:
-v, --verbose Specify multiple -v options to increase verbosity. For example, `-v -v -v` or `-vvv`. We map
ERROR or WARN -> 0 (i.e. always print), INFO -> 1(-v), DEBUG->2 (-vv)
--log4j=(CLASS LEVEL)...
Customize logging configuration. Format: <classname> <level>; sets logging level of class, e.
g.
org.contentmine.ami.lookups.WikipediaDictionary INFO
Commands:
=========
create creates dictionaries from text, Wikimedia, etc..
display Displays AMI dictionaries. (Under Development)
search searches within dictionaries
translate translates dictionaries between formats
A common subcommand is ami
create
midict create --help
Usage: amidict create [-hV] [--query[=query]] [--informat=input format]
[--linkcol=<linkCol>] [--termcol=<termCol>]
[--termfile=<termfile>] [--testString=<testString>]
[--wptype=<wptype>] [--wikilinks[=<wikiLinks>[,
<wikiLinks>...]...]]... [--namecol=<nameCol>...]
[--datacols=datacol[,datacol...]...]...
[--hrefcols=hrefcol[,hrefcol...]...]...
[--outformats=output format[,output format...]...]...
[--template=<templateNames>...]... [--terms=<terms>[,
<terms>...]...]...
creates dictionaries from text, Wikimedia, etc..
TBD
--datacols=datacol[,datacol...]...
use these columns (by name) as additional data
fields in dictionary. datacols='foo,bar' creates
foo='fooval1' bar='barval1' if present. No
controlled use or vocabulary and no hyperlinks.
-h, --help Show this help message and exit.
--hrefcols=hrefcol[,hrefcol...]...
external hyperlink column from table; might be
Wikidata or remote site(s)
--informat=input format
input format (csv, list, mediawikitemplate,
wikicategory, wikipage, wikitable, wikitemplate)
--linkcol=<linkCol> column to extract link to internal pages. main use
Wikipedia. Defaults to the 'name' column
--namecol=<nameCol>...
column(s) to extract name; use exact case (e.g.
Common name)
--outformats=output format[,output format...]...
output format (xml, html, json)
--query[=query] generate query for cut and paste into EPMC or
similar. value sets size of chunks (too large
crashes EPMC). If missing, no query generated.
--template=<templateNames>...
names of Wikipedia Templates, e.g.
Viral_systemic_diseases (note underscores not
spaces). Dictionaries will be created with
lowercasenames and all punctuation removed).
--termcol=<termCol> column(s) to extract term; use exact case (e.g.
Term). Could be same as namecol
--termfile=<termfile> list of terms in file, line-separated
--terms=<terms>[,<terms>...]...
list of terms (entries), comma-separated
--testString=<testString>
String input for debugging; semantics depend on task
-V, --version Print version information and exit.
--wikilinks[=<wikiLinks>[,<wikiLinks>...]...]
try to add link to Wikidata and/or Wikipedia page
of same name.
--wptype=<wptype> type of input (HTML , mediawiki)
A dictionary can be created, read, updated or deleted using the standard CRUD operations. There are many ways to perform these operations, one of the simplest is by using Excel spreadsheets
Pre-Requisites required for joining
• Basic Coding Skills
• Access to www
• Logical Thinking
• Good Communication
• Consistency
WHOM TO CONTACT
Dr. PETER MURRAY-RUST pm286@cam.ac.uk Dr. GITANJALI YADAV gy@nipgr.ac.in
While building a large multi-module project, each file requires a certain amount of memory and more the number of files, more the memory required until the JVM runs out of Java heap space. Java heap space is the memory space container of our Java program managed by JVM.
This error arose when I was using Amisearch for the CProject directory of 950 articles.
To fix this error, assign more memory to JVM just by giving a command in the Command prompt : set MAVEN_OPTS=-Xmx512m -XX:MaxPermSize=128m
for Windows OS.
For other OS, see https://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError This error arises because some files in the CProject are too bulky and consuming most of the space. The error can be tackled by deleting such files, now how can you find those bulky files, let's understand it as:
....
unTransform] in --transform (OutOfMemoryError: Java heap space)
PMC7286271 java.lang.reflect.InvocationTargetException
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
....
Now, if I delete this file PMC7286271
from the CProject directory, and then run the amisearch command, it shows no error. When we are creating CProject directory of say 1000 files, it is common that we can have 4 or 5 such bulky files which are causing 'OutOfMemory' error. Deleting these files solve the problem.
When ami search was used to search a test dictionary on the CProject, it showed as:
large document (1507) for PMC6824115 truncated to 500 sections
This means that the document PMC6824115
has 1507 sections (most documents have << 100). It's probably a review or a catalogue. It's so large that my browser is having difficulty.This bulky file was creating Java 'OutOfMemory' error. Deleting this file from the CProject directory solved the problem.
From my experience (PMR) this often means that the volume of material is too much and the repository server hung up.
It may be possible to commit the material in small amounts - e.g. chunks of 300 rather than all 950. In the worst case just work with 300 .
There may be multiple reasons for this error while you are making a pull request on GitHub.
- You can try clearing your credentials (windows PC): https://stackoverflow.com/questions/15381198/remove-credentials-from-git and then try signing in again.
- In case there are some cached credentials in that repo that became invalid, you can also clone the entire repository in a separate location and try again.
- This error occurs while trying to update
ami3
, which was installed using maven. When using the commandmvn clean install -DskipTests
, the BUILD FAILURE error occurs. - This happens sometimes on Windows OS . One thing to try is closing the Command Prompt window where you’re running the command and try again in a new window.
- If again in new command prompt BUILD FAILED with the same error ,try closing all windows and try again. If that doesn’t work you may need to reboot your computer.
- It was due to that other Command Prompt probably had a process running that was using that jar file , so the old version couldn’t be deleted.
Cannot run AMI: ami is not recognised as an internal or external command, operable program or batch file.
This probably means you haven't got a JAR file for ami
or haven't set the PATH to point to it.
See https://github.com/petermr/openVirus/wiki/INSTALLING-ami3 for help (there are indications of how to set your path).
- It is due to the low network connectivity problem while cloning large repository like
ami3
repository. - See https://github.com/petermr/openVirus/wiki/Tools:-ami3#where-do-i-get-ami to rectify the error.
Remko's package manager adds dates so that it should be clear when the package was released. When reporting bugs, always use the latest version unless instructed otherwise.