Mini Project: “Semantic analysis of the literature on Plant Invasive Species”

Kanishka Parashar

ABSTRACT

“Semantic analysis of literature on Plant Invasive Species” the research work was conducted to extract knowledge from machine read material. It includes copying the large quantities of material, extracting the data and then recombining it to identify patterns. It works on available open access literature related to plant invasive species by downloading paper from Euorpe Pub Med Central with the following objective:

• To establish a dictionary which will be used to search for annotated scientific literature.

• To create a corpus of an invasive plant by using the getpapers toolkit.

• To get possible information about geographical location of invasive plants.

Chapter 1: INTRODUCTION

The project develops and implements open-source tools such as Wikipedia, Wikidata, Python, Java and data mining in combination with conceptual tools for discovering, combining, cleaning and semantically categorizing scholarly documents that contain a significant number of plant invasive species. The project aims to make information accessible to the community incoherent manner. Here, we used GitHub as an Open Notebook tool to store the information of each activity. The project has grown significantly and branched out into different areas of research. The various parts of the project are as follows:

1.1 CEVOpen

The project aims to develop a semantic atlas of invasive plant and their properties like geographical location/country.

The link to the project page: https://github.com/petermr/CEVOpen#readme (1)

It includes:

Content mining of Open Phytochemical literature for plant medicinal activities.
It is an open project, started to analyse or study the activity and composition of plant essential oils available in open excess literature, using EuropePMC and ContentMine’s pygetpapers/ami.
It includes information like:

-- Plant name and its identity

-- Plant phytochemical composition (chemicals)

-- Literature search

-- Experimental determination of activity.

Figure 1: Showing Introduction of the CEVOpen project.

1.2 INVASIVE SPECIES

The word "invasive" originated from the French word "invasif" meaning "tending to invade" or "aggressive". Invasive species the species that are not naturally found in a native environment, they are introduced by the people in an environment willingly or accidentally. The new introduced species has a negative impact on its new environment. It invades the habitat/ region, causing ecological, environmental, and economic damage. Invasive species may include organisms like- plants, animals, algae, fungi, microbes etc. They are also called ‘Alien species.

Some characteristics shown by invasive species:

Fast growth - They have shorter life cycles, invest heavily in reproduction, produce more seeds, display better dispersal, and even germinate faster
Phenotypic plasticity: the ability to grow in the newly introduced environment.
Ecological competence: a too high tolerance for a wide range of environmental conditions.
Ability to feed on a range of foods.

Importance of Invasive Species:

Apart from being harmful to the native habitat, the invasive plants are being introduced by people in their country knowingly or unknowingly by considering them as “ornamental plants” also because of a short benefit they provide.

Invasive species also helps the environment by providing food, nesting sites, and shelter.
They help to prevent soil erosion by securing sediment and soil.

Why it is important to study Invasive Species:

It is important to study invasive species as they have devastating impacts on native biota or natural ecosystems by destroying or replacing native food.
They can disturb the food chain.
They can also alter the abundance or diversity of species that are important habitats for native life.
In their new habitat, invasive species may become predators, competitors, parasites and diseases of native or domesticated plants and animals.

One of the famous quotes by ‘Sean Henna’

“Once an invasive species arrives it’s about impossible to get rid of it.” (2)

Invasive plant species have a drastic impact on the native population, species, community and ecosystem. According to biologists, the complete eradication of invasive species is difficult because of their varying seasons but a control strategy can be feasible for the impact caused by invasive species.

Some of the methods used are:

a) Mechanical removal of invasive plants from an area.

b) Construction of barriers to prevent their spread.

c) Reduction of their population size by biochemical methods- biocides.

World’s most affected countries:

Researchers across the world teamed up to built-up information about the worldwide spread of invasive species and their threats, providing a global-scale outlook at how the introduction and spread of invasive species could shift in coming decades as a result of increasing globalization and climate change.

According, to** “The Times” data,** researchers found that the United States and China act as the top sources of invasive species because of the high level of trade across the world (3). It has been reported that countries like- Peru, Thailand, Afghanistan, Angola etc are at the a high risk of invasive species invasion and they have a low capacity to respond to the invasion because of lack of resources.

World’s most aggressive invasive Plant species:

Some of the most aggressive Invasive Plant species are Lantana camara native species of Central and South America, Prosopis juliflora, native to Mexico, Central northern South America. Followed by Kudzu, this invasive plant has become a very serious problem in countries like United States Fiji, Australia, New Zealand, Italy, and Canada. Other Plant species namely Caulerpa Seaweed, Cheatgrass, Giant Hogweeds also among the most aggressive Invasive Plant species found around the world. Cheatgrass is mostly found in the United States and Giant Hogweed possibly the newest most aggressive plant Invasive species on earth, it has become a huge concern that is rapidly spreading across Canada.(4)

Information sources to study about Invasive species:

To find information about invasive species refer to the following link-

GISD DATABASE - Global Invasive Species Database (http://www.iucngisd.org/gisd/)

Figure 2: Showing the GISD database webpage.

1.3 MY AIM IN THE PROJECT

Collect information from all freely visible scientific publications on plant invasive species and transform them into an uniform form.
Use of text mining machines to extract meaning from articles.
To build a dictionary containing around 400 plant invasive species with terms, synonyms, common names, taxon name, wiki data ID, wiki data URL, Wikipedia URL, image, map view for search, classification and understanding.

Chapter 2: LITERATURE REVIEW

Invasive, alien, or non-native species are the plant species which show a major threat to local or global biodiversity, including negative socio-economic and human health impacts. Invasive species hinders the ecology of natural ecosystems by displacing native plant and animal species that depends upon them by reducing the biodiversity (5). The positive socio-economic effect mediated the rapid spread of invasive alien species in the forest in rural or urban areas (6). Therefore, various policies and legislation rules are in place to regulate the spread of invasive alien species in the forest ecosystem. Hybridization, species competition and the spread of diseases are the major threats caused by invasive species in the forest ecosystem. The key factor that causes the spread of invasive alien species is human activities, that is, large population. A large number of human activities has been found responsible for the spread of invasive plant species, for example, the construction of a road within the forest might be responsible for the dispersal of plant alien species seed through soil or construction material (7). Along with human activities, natural climatic activities like- wind, rain etc provide invasive plant species to spread and invade. The spread of invasive alien species disturbs the native species individual performance and reduce species diversity that leads to irreversible change in species composition, Impatiens parviflora in Europe has a huge impact on native species growth and proliferation. Also due to nutrient availability intensifies the inter/intraspecies competition for limited resources. Invasive plant species are one of the main threats to biodiversity. In forest ecosystems, the threats caused by invasive plant species include transmission of diseases, hybridization, and species competition. Langmaier M and Lapin K Systematically studied the impact of Invasive Plant species on afforestation of European Temperate Forests. Study identified 53 vascular plant species having a negative influence on forest regeneration in Central European forests. Study found that 21 plant species are reported to be impacted by invasive species. The results of the review synthesis showed the impact mechanisms affect the regeneration success of native plant species. They are competition for resources, chemical, physical and structural impact on regeneration and indirect impact through interaction with other species (8). Invasive species destroy biodiversity in many ways. When an invasive species enters the ecosystem, it may lack natural enemies or control. It can reproduce and spread quickly, taking over an area. Native wildlife may not have evolved defences against this species, or they may not be able to compete with a species that has no predators (9).

In an assessment of invasive plant impacts on resident species, communities and ecosystems conducted by a group of scientists. The review shows that invasive plant species are far more likely to cause significant impacts on native plant and animal richness on islands rather than mainland. Study showed that there is no universal measure of impact and the pattern observed depends on the ecological measure examined. Although impact is highly context dependent, some species traits, especially life form, stature, and pollination syndrome, may provide a means to predict impact, regardless of the specific habitat and geographic area where the invasion occurs, it can be used as a predictive indicator of impact. (10)

Invasive species have a negative environmental impact and social impacts, including biodiversity loss, thereby understanding the impact of invasive plant species has important application for invasive species management in general (11). By studying the environmental impact of interventions for managing invasive species- by targeting an invasive species and studying the habitat or environment it invades followed by designing some strategies that study the prevention or control to prevent the spread of plant invasive species.

Control of Invasive species- Invasive plant species can negatively impact the establishment and growth of native plants and several ecosystem properties, such as soil cover, nutrient cycling, fire regimes and hydrology. Controlling invasive plants is therefore a necessary, yet usually expensive, step towards the restoration of an ecosystem. Decisions about which control method to use depend on the invasive plant species' growth forms, the economic situation of the country where the sites are located and the available resources for control of those species. Developed countries tend to use more chemical control, while less developed countries are more likely to use non-chemical methods.

There are three main methods used for control of invasive species - biological, mechanical, and chemical.

Biological control is the intentional manipulation of natural enemies by humans for the purpose of controlling pests.
Mechanical control includes mowing, hoeing, cultivation, and hand pulling.
Chemical control is the use of herbicides. (12)

Invasive species are a big threat to biodiversity and impacting the environment around the world, thereby it is important to research in this area to find new better preventive techniques to slow down their spread across the globe.

In this study we assess the potential of large-scale data mining system applied to Europe PubMed Central (PMC) full texts. We present extensive evaluation of textual data available for plant invasive species further we combine text mining information with countries from other experimental dictionaries. All large scale data sets as well as manually curated data are made publicly available at GitHub to stimulate the application of text mining data in future plant sciences studies.

Chapter 3: MATERIALS AND METHODS

Tools and Databases

GISD- The Global Invasive Species Database is a free online platform managed by the Invasive Species Specialist Group (ISSG) of the IUCN species Survival Commission for searchable source of information about Invasive species and its negative impacts on biodiversity. GISD aims to increase public knowledge and awareness about invasive species, and to facilitate effective management and preventive activates by disseminating expert knowledge and experience to global audience. GISD focuses on invasive species that harm biodiversity and natural areas, it also provides information about all taxonomic groups from micro-organisms to animals and plants.

Essential software and system run.

getpapers

It is a tool developed for searching research articles from the EuropePMC, IEEE, ArXiv and Crossref API by using the command line service.
Further read at:https://github.com/ContentMine/getpapers
Installation: https://github.com/petermr/tigr2ess/blob/master/installation/windows/INSTALLATION.md

Go to the nvm-windows and download the latest version of nvm-setup.zip
To install the node, run nvm install latest in the command-line
To install the getpapers, run npm install --global getpapers

Give command getpapers in command line to check for installation, cmd should look like-


C:\Users\HP PC>getpapers

  Usage: getpapers [options]

  Options:

    -h, --help                output usage information
    -V, --version             output the version number
    -q, --query <query>       search query (required)
    -o, --outdir <path>       output directory (required - will be created if not found)
    --api <name>              API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
    -x, --xml                 download fulltext XMLs if available
    -p, --pdf                 download fulltext PDFs if available
    -s, --supp                download supplementary files if available
    -t, --minedterms          download text-mined terms if available
    -l, --loglevel <level>    amount of information to log (silent, verbose, info*, data, warn, error, or debug)
    -a, --all                 search all papers, not just open access
    -n, --noexecute           report how many results match the query, but don't actually download anything
    -f, --logfile <filename>  save log to specified file in output directory as well as printing to terminal
    -k, --limit <int>         limit the number of hits and downloads
    --filter <filter object>  filter by key value pair, passed straight to the crossref api only
    -r, --restart             restart file downloads after failure

Figure 3: Showing cmd window for getpaper installation.

-General syntax: getpapers -q < “project title”> -o <output directory> -x<xml> -p<pdf> -k <limit of papers>

pygetpapers

It is a python version of getpapers. This software is developed to access open scientific repositories, gather hits and download articles in a systematic and non-interactive manner.
Further read: https://github.com/petermr/pygetpapers
Installation: https://github.com/petermr/pygetpapers/blob/main/README.md#6-installation

Download python along with pip from: https://www.python.org/downloads/
Cloned the repository using git clone command to the local computer: git clone https://github.com/petermr/pygetpapers
Run the command: pip install git+git://github.com/petermr/pygetpapers

Give command pygetpapers in the command line to check for installation, the cmd window should look like-


C:\Users\HP PC>pygetpapers
usage: pygetpapers [-h] [-v] [-q QUERY] [-o OUTPUT] [-x] [-p] [-s] [--references REFERENCES] [-n]
                   [--citations CITATIONS] [-l LOGLEVEL] [-f LOGFILE] [-k LIMIT] [-r RESTART] [-u UPDATE]
                   [--onlyquery] [-c] [--makehtml] [--synonym]

Welcome to Pygetpapers version 0.0.3.3. -h or --help for help

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         output the version number
  -q QUERY, --query QUERY
                        query string transmitted to repository API. Eg. "Artificial Intelligence" or "Plant Parts". To
                        escape special characters within the quotes, use backslash. Incase of nested quotes, ensure
                        that the initial quotes are double and the qutoes inside are single. For eg: `'(LICENSE:"cc
                        by" OR LICENSE:"cc-by") AND METHODS:"transcriptome assembly"' ` is wrong. We should instead
                        use `"(LICENSE:'cc by' OR LICENSE:'cc-by') AND METHODS:'transcriptome assembly'"`
  -o OUTPUT, --output OUTPUT
                        output directory (Default: Folder inside current working directory named )
  -x, --xml             download fulltext XMLs if available
  -p, --pdf             download fulltext PDFs if available
  -s, --supp            download supplementary files if available
  --references REFERENCES
                        Download references if available. Requires source for references
                        (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
  -n, --noexecute       report how many results match the query, but don't actually download anything
  --citations CITATIONS
                        Download citations if available. Requires source for citations
                        (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
  -l LOGLEVEL, --loglevel LOGLEVEL
                        Provide logging level. Example --log warning <<info,warning,debug,error,critical>>,
                        default='info'
  -f LOGFILE, --logfile LOGFILE
                        save log to specified file in output directory as well as printing to terminal
  -k LIMIT, --limit LIMIT
                        maximum number of hits (default: 100)
  -r RESTART, --restart RESTART
                        Reads the json and makes the xml files. Takes the path to the json as the input
  -u UPDATE, --update UPDATE
                        Updates the corpus by downloading new papers. Takes the path of metadata json file of the
                        orignal corpus as the input. Requires -k or --limit (If not provided, default will be used)
                        and -q or --query (must be provided) to be given. Takes the path to the json as the input.
  --onlyquery           Saves json file containing the result of the query in storage. The json file can be given to
                        --restart to download the papers later.
  -c, --makecsv         Stores the per-document metadata as csv.
  --makehtml            Stores the per-document metadata as html.
  --synonym             Results contain synonyms as well.

Figure 4: showing cmd window for successful installation of pygetpapers.

General syntax: pygetpapers -q < “project title”> -o <output directory> -x <xml> -p <pdf> -k <paper limit> -c <csv metadata file>

ami

It is a toolkit to analyse collected document on local storage. The software is written in Java. It turns documents into knowledge. It includes features tools for downloading scientific papers, creating sections and XML, analyzing components (text, tables, diagrams), creating dictionaries and searching.

-Further read: https://github.com/petermr/ami3

-Installation: https://github.com/petermr/openVirus/wiki/INSTALLING-ami3

Download the backend software such as Java, JDK, maven and git and set the path for them
Open the command line and git clone the repository ami3: git clone https://github.com/petermr/ami3
In ami3 path, run the command: MVN install -Dmaven.test.skip=true

on the command line service bar search for `ami --help`


C:\Users\HP PC>ami --help
[picocli WARN] Could not format '@|bold ami 2020.08.09_09.54-NEXT-SNAPSHOT|@
(jar:file:/C:/Users/HP%20PC/ami3/target/appassembler/repo/ami3-2020.08.09_09.54-NEXT-SNAPSHOT.jar)' (Underlying error: Conversion = 'P'). Using raw String: '%n' format strings have not been replaced with newlines. Please ensure to escape '%' characters with another '%'.
Usage: ami [OPTIONS] COMMAND

`ami` is a command suite for managing (scholarly) documents: download, aggregate, transform, search, filter, index,
annotate, re-use and republish.
It caters for a wide range of inputs (including some awful ones), and creates de facto semantics and an ontology (based
on Wikidata).
`ami` is the basis for high-level science/tech applications including chemistry (molecules, spectra, reaction), Forest
plots (metaanalyses of trials), phylogenetic trees (useful for virus mutations), geographic maps, and basic plots (x/y,
scatter, etc.).

Parameters:
===========
      [@<filename>...]       One or more argument files containing options.
Options:
========
  -h, --help                 Show this help message and exit.
  -V, --version              Print version information and exit.
CProject Options:
  -p, --cproject=DIR         The CProject (directory) to process. This can be (a) a child directory of cwd (current
                               working directory) (b) cwd itself (use `-p .`) or (c) an absolute filename. The cProject
                               name is the basename of the file.
                              The default is: `C:\Users\HP PC/amiprojects/myproject`.
                              You can control the default by setting the `AMIPROJECT` environment variable.
  -r, --includetree=DIR[,DIR...]...
                             Include only the CTrees in the list. (only works with --cproject). Currently must be
                               explicit but we'll add globbing later.
  -R, --excludetree=DIR...   Exclude the CTrees in the list. (only works with --cproject). Currently must be explicit
                               but we'll add globbing later.
CTree Options:
  -t, --ctree=DIR            The CTree (directory) to process. This can be (a) a child directory of cwd (current
                               working directory, usually cProject) (b) cwd itself, usually cTree (use -t .) or (c) an
                               absolute filename. No defaults. The cTree name is the basename of the file.
  -b, --includebase=PATH...  Include child files of cTree (only works with --ctree). Currently must be explicit or with
                               trailing percent for truncated glob.
  -B, --excludebase=PATH...  Exclude child files of cTree (only works with --ctree). Currently must be explicit or with
                               trailing percent for truncated glob.
General Options:
  -i, --input=FILE           Input filename (no defaults)
  -n, --inputname=PATH       User's basename for inputfiles (e.g. foo/bar/<basename>.png) or directories. By default
                               this is often computed by AMI. However some files will have variable names (e.g. output
                               of AMIImage) or from foreign sources or applications
  -L, --inputnamelist=PATH...
                             List of inputnames; will iterate over them, essentially compressing multiple commands into
                               one. Experimental.
  -f, --forcemake            Force 'make' regardless of file existence and dates.
  -N, --maxTrees=COUNT       Quit after given number of trees; null means infinite.
  -o, --output=output        Output filename (no defaults)
Logging Options:
  -v, --verbose              Specify multiple -v options to increase verbosity. For example, `-v -v -v` or `-vvv`. We
                               map ERROR or WARN -> 0 (i.e. always print), INFO -> 1 (-v), DEBUG -> 2 (-vv)
      --log4j=CLASS=LEVEL[,CLASS=LEVEL...]
                             Customize logging configuration. Format: <classname>=<level>; sets logging level of class;
                               e.g. org.contentmine.ami.lookups.WikipediaDictionary=INFO
                             This option may be specified multiple times and accepts multiple values.
Commands:
=========
  assert               Makes assertions about objects created by AMI. Currently requires a type (null),and maybe a
                         SubdirectoryType.
  clean                Cleans specific files or directories in project.
  display              Displays files in CTree.
  download             Downloads content from remote site.
  dummy                Minimal AMI Tool for editing into more powerful classes.
  figure               creates Figures from primitives (e.g. adds XML captions to figures).experimental.
  files                Carries out file operations (copy, delete, etc.) on CProject and CTrees.
  filter               FILTERs images (initally from PDFimages), but does not transform the contents.
  forest               Analyzes ForestPlot images.
  graphics             Transforms graphics contents (often from PDF/SVG).
  grobid               Runs grobid.
  image                Transforms image contents but only provides basic filtering (see ami-filter).
  lucene               Runs Lucene (words and search) Experimental
  makeproject          Processes a directory (CProject) containing files (e.g.*.pdf, *.html, *.xml) to be made into
                         CTrees.
  metadata             Manages metadata for both CProject and CTrees.
  ocr                  Extracts text from OCR and (NYI) postprocesses HOCR output to create HTML.
  pdfbox               Convert PDFs to SVG-Text, SVG-graphics and Images.
  pixel                Analyzes bitmaps - both binary (black/white), but may be oligochrome.
  regex                Searches with regex.
  search               Searches text (and maybe SVG).
  section              Splits all divisions in XML files into sections <using XPath.
  summary              Summarizes the CTree files into a single toplevel CProject directory tree.Used to be hardcoded ,
                         but now can be controlled by glob
  svg                  Takes raw SVG from PDF2SVG and converts into structured HTML and higher graphics primitives.
  table                Writes cProject or cTree to summary table.
  transform            Runs XSLT transformation on XML (NYFI).
  words                Analyzes word frequencies.
  help                 Displays help information about the specified command
  generate-completion  Generate bash/zsh completion script for ami.

ami section:

General syntax: ami -p <cproject> section

The command is used to divide downloaded papers into sections- front, body, back, floats and groups.

ami search:

General syntax: ami -p <cproject><directory> search -dictionary <path>

-It analyses and searches the keywords in the repository and create a frequency data table and histogram.

ami dict:

General syntax: amidict -vv –dictionary <name of dictionary> --directory <Path of the directory folder> --input <SPARQL endpoint output name> create --informat wikisparqlxml --sparqlmap wikidataURL = item, name = itemLabel, term = itemLabel --transformName wikidataID=EXTRACT(wikidataURL,./(.))

The command is used to convert the sparql output of SPARQL endpoint into dictionary format.

pyami/ ami_gui.py:

It is the python version of ami tool. It helps to read and analyse data by displaying search frequencies graphs.
To run ami_gui.py, it is important to clone the following repositories into the local machine by using command line service.

git clone:

-To run the software, following repositories have to be cloned into the local machinery by giving the command on command line prompt (cmd) : git clone (copy path of specific repository)

dictionary: git clone https://github.com/petermr/dictionary
CEVOpen: git clone https://github.com/petermr/CEVOpen
openDiagram: git clone https://github.com/petermr/openDiagram (8)

Running: ami_gui.py after cloning the repositories.

Command on cmd: C:\Users\HP PC\openDiagram\physchem\python>python ami_gui.py

Figure 5: Showing a tk window for ami_gui.py run for invasive species against invasive corpora for different sections.

WIKI DATA

It is a free and an open source of knowledge accessible by both humans and machines. It acts as a central storage for data including Wikipedia, Wikitionary, Wikisource, and others. It is document oriented database, focusing on items, property and identifiers (QID).

Link for wiki data main page: https://www.wikidata.org/wiki/Wikidata:Main_Page

Figure 6: Showing wikidata search page finding information of Lantana camara (aggressive invasive species).

Wiki data SPARQL query service

-It is a semantic query language to formulate queries using knowledge databases.

Link for wiki data query service web page: https://query.wikidata.org/

Figure 7: showing wikidata query service page.

search_lib.py

To run search_lib.py

Give command on cmd- C:\Users\HP PC\openDiagram\physchem\python>python search_lib.py

Test report for search_lib.py : https://github.com/petermr/openDiagram/wiki/Test-Report-for-Search_lib

The basic workflow of the tools:

This is the process workflow for searching open repositories using linked data dictionaries, based on wiki data, retrieving information, and performing machine learning analysis to produce catalogs and knowledge graphs.

Figure 8: Showing workflow of the tools used for the work (By Radhu Ladani)

Chapter 4: WORK STRATEGY

5.1 Creation of Dictionary “plant_invasive”

Dictionary is a collection of terms accompanied by supporting information such as descriptions, wikidataIDs, the purpose of the project dictionaries is to:

• Identify words and phrases (“entities”) within the documents.

• To establish connections between their meaning and context (“ontologies”).

• To assemble a subset of terms that express a high-level concept of plant invasive and properties.

###Structure of dictionary-

The dictionary is supported by XML, JSON files. The section defines specific elements and their associated attributes.
Dictionary/Title: It is the root element containing the title, must be a single word and must be the filename’s base.
Header/ Description: There are zero or more description elements in the header. These can include metadata about dates, maintenance and provenance.
Entry/ Body: A dictionary’s primary component is its entries. An entry is a well-defined object that is typically associated with a Wiki data item. This assigns it a unique identifier (Q- number), obviating the need for ongoing identifier maintenance.
The dictionary consists of 469 plant invasive species terms, with their wiki Data IDs, synonyms in English as well as other languages like- Chinese, French etc.
Link to the summary of dictionary content: https://github.com/petermr/CEVOpen/wiki/Miniproject:-Invasive-species

Procedure to create dictionary-

Step 1. A list of 469 plant invasive species was downloaded from the GISD database site. https://github.com/petermr/CEVOpen/tree/master/dictionary/Invasive_species (file name export_gisd (1).csv )
Step 2. WikidataIDs for each plant was mentioned and a SPARQL query was created to build a dictionary ‘invasive_plant.xml’

Creation of dictionary- Procedure:

Go to wiki data query page: https://query.wikidata.org/ and create a SPARQL query.
Run the SPARQL query using following command

## Selecting the prefered label
## Selecting the prefered label
SELECT Distinct * WHERE {
  VALUES ?item {
wd:Q2086536 wd:Q190887 wd:Q310208 wd:Q1316058 wd:Q932729 wd:Q2701053 wd:Q2673511 wd:Q311430 wd:Q386585 wd:Q1621617 wd:Q2666767 wd:Q402385 wd:Q149401 wd:Q4672006 wd:Q136648 wd:Q15634290 wd:Q26745 wd:Q2717414 wd:Q2706362 wd:Q161246 wd:Q161115 wd:Q159221 wd:Q2732431 wd:Q4692133 wd:Q1949712 wd:Q159717 wd:Q27835 wd:Q159570 wd:Q1482093 wd:Q750307 wd:Q2717846 wd:Q1160961 wd:Q2834268 wd:Q160097 wd:Q156904 wd:Q2703227 wd:Q1472735 wd:Q3534965 wd:Q682164 wd:Q161568 wd:Q40051889 wd:Q309785 wd:Q4763286 wd:Q2353550 wd:Q275620 wd:Q848254 wd:Q311521 wd:Q2860589 wd:Q87594900 wd:Q163675 wd:Q2716675 wd:Q161114 wd:Q1816048 wd:Q28367 wd:Q311451 wd:Q5712311 wd:Q2875199 wd:Q1366575 wd:Q3219428 wd:Q6395156 wd:Q26158 wd:Q22111300 wd:Q426965 wd:Q158397 wd:Q814421 wd:Q2930211 wd:Q10944544 wd:Q162271 wd:Q4117077 wd:Q159066 wd:Q12345850 wd:Q164128 wd:Q161538 wd:Q2265513 wd:Q18461 wd:Q786072 wd:Q2720939 wd:Q26615 wd:Q163958 wd:Q163559 wd:Q158048 wd:Q12207493 wd:Q15547168 wd:Q1431280 wd:Q1368577 wd:Q1768699 wd:Q857220 wd:Q310961 wd:Q48996866 wd:Q2998776 wd:Q2943755 wd:Q163004 wd:Q259033 wd:Q2715047 wd:Q50840658 wd:Q4925284 wd:Q41530893 wd:Q41531244 wd:Q50840675 wd:Q15629297 wd:Q4115083 wd:Q848784 wd:Q2068262 wd:Q9311819 wd:Q36125 wd:Q289811 wd:Q2583146 wd:Q5114539 wd:Q577669 wd:Q164574 wd:Q158722 wd:Q27282 wd:Q341600 wd:Q21177 wd:Q1524349 wd:Q3281716 wd:Q160100 wd:Q2979028 wd:Q727330 wd:Q2712208 wd:Q5149520 wd:Q5152272 wd:Q19848765 wd:Q15248508 wd:Q5173212 wd:Q470109 wd:Q161406 wd:Q160221 wd:Q135531 wd:Q15397864 wd:Q3005741 wd:Q15528510 wd:Q687435 wd:Q5199880 wd:Q7186137 wd:Q41137071 wd:Q381584 wd:Q1391422 wd:Q145781 wd:Q5247707 wd:Q161735 wd:Q1157813 wd:Q2702202 wd:Q311432 wd:Q150385 wd:Q311239 wd:Q1196309 wd:Q15533996 wd:Q24192276 wd:Q21162077 wd:Q163649 wd:Q690645 wd:Q157969 wd:Q181318 wd:Q367242 wd:Q165403 wd:Q11126839 wd:Q33466 wd:Q159760 wd:Q549727 wd:Q1926297 wd:Q41505 wd:Q157726 wd:Q306504 wd:Q159331 wd:Q159331 wd:Q744339 wd:Q148882 wd:Q15597725 wd:Q15535258 wd:Q2740464 wd:Q621082 wd:Q5458608 wd:Q146684 wd:Q146136 wd:Q5494014 wd:Q164832 wd:Q164101 wd:Q3018415 wd:Q311194 wd:Q151046 wd:Q161879 wd:Q50380327 wd:Q50380330 wd:Q50410083 wd:Q1989614 wd:Q1634712 wd:Q847582 wd:Q1394846 wd:Q49550144 wd:Q376320 wd:Q3782768 wd:Q26354 wd:Q3926743 wd:Q1994229 wd:Q3926753 wd:Q3142116 wd:Q164149 wd:Q826602 wd:Q2716358 wd:Q20757547 wd:Q13919449 wd:Q931460 wd:Q2717256 wd:Q158110 wd:Q164091 wd:Q164181 wd:Q158913 wd:Q1661596 wd:Q158289 wd:Q47140799 wd:Q421057 wd:Q158035 wd:Q654064 wd:Q693409 wd:Q272754 wd:Q311175 wd:Q21175 wd:Q15504069 wd:Q11060111 wd:Q3163048 wd:Q148097 wd:Q822052 wd:Q311188 wd:Q311632 wd:Q6471912 wd:Q15345636 wd:Q332469 wd:Q15600070 wd:Q32465 wd:Q149265 wd:Q5364126 wd:Q35905 wd:Q15228001 wd:Q15228001 wd:Q1074201 wd:Q1848856 wd:Q3339674 wd:Q1768430 wd:Q157078 wd:Q10743709 wd:Q709649 wd:Q161083 wd:Q1076276 wd:Q29907 wd:Q15321400 wd:Q159413 wd:Q3338251 wd:Q5235063 wd:Q157513 wd:Q15129340 wd:Q1813408 wd:Q15227704 wd:Q162171 wd:Q158596 wd:Q6812536 wd:Q6820031 wd:Q6820033 wd:Q140905 wd:Q5699638 wd:Q2717688 wd:Q2719490 wd:Q2235813 wd:Q148532 wd:Q1073621 wd:Q158517 wd:Q15377318 wd:Q157307 wd:Q899338 wd:Q160407 wd:Q158130 wd:Q158875 wd:Q21310666 wd:Q20480616 wd:Q1040814 wd:Q13426501 wd:Q882908 wd:Q847939 wd:Q159743 wd:Q843726 wd:Q2355390 wd:Q15479065 wd:Q37083 wd:Q30166 wd:Q13936842 wd:Q310979 wd:Q144412 wd:Q311192 wd:Q141416 wd:Q162795 wd:Q7115068 wd:Q1640921 wd:Q15550928 wd:Q1209690 wd:Q3024698 wd:Q3595850 wd:Q3510828 wd:Q7142304 wd:Q7142302 wd:Q156790 wd:Q1766333 wd:Q3448230 wd:Q2068761 wd:Q135365 wd:Q42418587 wd:Q836721 wd:Q157419 wd:Q27657 wd:Q607380 wd:Q28557 wd:Q11090755 wd:Q3027867 wd:Q165227 wd:Q158468 wd:Q12024 wd:Q2900918 wd:Q271582 wd:Q3281616 wd:Q654001 wd:Q160343 wd:Q7199152 wd:Q15590374 wd:Q15247575 wd:Q4217123 wd:Q157571

  }
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en".
    ?item rdfs:label ?itemLabel;
      skos:altLabel ?itemAltLabel;
      schema:description ?itemDescription.
}
  OPTIONAL {
    ?wikipedia schema:about ?item;
      schema:isPartOf <https://en.wikipedia.org/>.
  }
  OPTIONAL {
    ?hiwikipedia schema:about ?item;
      schema:isPartOf <https://hi.wikipedia.org/>.
  }
  OPTIONAL {
    ?tawikipedia schema:about ?item;
      schema:isPartOf <https://ta.wikipedia.org/>.
  }
  OPTIONAL {
    ?eswikipedia schema:about ?item;
      schema:isPartOf <https://es.wikipedia.org/>.
  }
  OPTIONAL {
    ?frwikipedia schema:about ?item;
      schema:isPartOf <https://fr.wikipedia.org/>.
  }
  OPTIONAL {
    ?dewikipedia schema:about ?item;
      schema:isPartOf <https://de.wikipedia.org/>.
  }
  OPTIONAL {
    ?zhwikipedia schema:about ?item;
      schema:isPartOf <https://zh.wikipedia.org/>.
  }
  OPTIONAL {
    ?urwikipedia schema:about ?item;
      schema:isPartOf <https://ur.wikipedia.org/>.
  }
  OPTIONAL { ?wikipedia wdt:P627 ?IUCN_taxon_ID. }
  OPTIONAL {  }
  OPTIONAL { ?wikipedia wdt:P225 ?taxon_name. }
  OPTIONAL { ?wikipedia wdt:P1843 ?taxon_common_name. }
}

After getting the result, a SPARQL endpoint from the link was downloaded and got the SPARQL file.
Open the SPARQL file in notepad
Use amidict for SPARQL mapping by the following command:

amidict -vv --dictionary invasive_plants --directory plantinvasivespecies --input sparql/sparql4 create --informat wikisparqlxml --sparqlmap wikidataURL=item,wikipediaPage=wikipedia,name=itemLabel,term=itemLabel,Description=itemDescription,_p627_iucn_taxonid=ICUN_taxon_ID,_p225_taxon_name=taxon_name,_p1843_taxon_common_name=t --transformName wikidataID=EXTRACT(wikidataURL,.*/(.*)) --synonyms=itemAltLabel

Commit changes in GitHub.
Dictionary Invasive_plant has about 460 invasive plant entries.
Attributes in the dictionary are: WikidataIDs, WikidataURLs, Description, Common name, plant IUCN status, Image, map view, taxon, taxon common name, synonyms etc for all the species mentioned, however "map view" is not present for each species.

Figure 9: Showing invasive plant dictionary page.

The dictionary provides the following attributes for biological activities as well as metadata about the different entities as follows:

The description parameter defines a string that describes the entry. It is frequently generated directly from Wiki data.

The name is the preferred name for the term. It is case-sensitive and frequently appears in the text, the name and term may or may not be synonymous.
The taxon_common_name is name for plant taxon name.
The wiki data ID & URL are the wiki data item’s identifiers.
The Wikipedia page is referred to as Wikipedia. It is frequently used as term that gives a detailed information of species.
The map.View it gives information about the geographical location of the plant.

Dictionary link: https://github.com/petermr/CEVOpen/tree/master/dictionary/Invasive_species

5.2 Test search tools with DICTIONARY against MINICORPUS

Mini corpus ‘invasive’ was developed:

• Query: getpapers -q "(invasive plant species)" -n -x -o invasive -k (100) -f invasive/log.txt

• The command getpapers will initiate the process and -q refers to query which is to be searched. The query is entered in inverted commas as is done in "(invasive plant species)". The next element is -o which refers to output directory and the parameter that follows it in the name of the directory which is invasive in our case. Then, -x -p corresponds to XML and PDF files to be included in our search and -k 100 limits our search to 100 files only.

• getpapers used to create corpus of plant invasive species.

• Created an Html database by using ami search

command- c:\Users\HP PC\CEVOpen\minisorpora>ami -p “invasive” search --dictionary invasive_plant.xml eo_activity.xml plant_compound.xml

Link to the corpora https://github.com/petermr/CEVOpen/tree/master/minicorpora

Additionally, we are looking for alternative dictionaries on country to get the possible insights from the open scientific literature regarding the association between invasive plant and their geographical location.

The link for other dictionaries: https://github.com/petermr/CEVOpen/tree/master/dictionary

Chapter 6: RESULT AND DISCUSSION

The following results from the material and methods, which enable us to address the scientific question regarding plant invasive species and countries associated with them.

Result of ami section

The successful completion of ami section command, the papers are divided into the following sections:

Figure 10: Showing ami section result of the downloaded paper.

Result of ami search

The successful completion of ami section command, generates a complete data table representing searches in a row. (link- https://github.com/petermr/CEVOpen/blob/master/dictionary/Invasive_species/full.dataTables.html )

Figure 11: Showing data table picture.

Result of ami_gui.py

The successful completion of the ami_gui.py command for corpus ‘invasive’ in the Invasive species dictionary gives the histograms related to various sections of the paper. After analysing the data, A list of 25 plant invasive species was prepared which were common in various sections of papers like- title, table, references etc.

Figure 12: Designed Excel Sheet for ami_gui.py results, containing the species which are present in different sections of downloaded papers along with the new species and the words used to address invasive species.

Link for excel sheet https://github.com/petermr/CEVOpen/blob/master/dictionary/Invasive_species/Plant_invasive%20(1).xlsx

Figure 13: Bar Chart representing the name of plant invasive species present in the downloaded literature.

The above-mentioned bar chart depicts that the ‘abstract’ and ‘title’ section of paper contains most of the names of plant invasive species. For example- All 25 species mentioned appeared in the ‘abstract’ section of the paper.

Figure 14: The bar chart represents the comparative visualisation of plant invasive species with the countries.

According to the bar graph above, United States, Europe and China are the countries which are majorly mentioned in the downloaded papers.

Link for the spreadsheet data used to make bar chart for plant invasive species and countries mentioned: https://github.com/petermr/CEVOpen/blob/master/dictionary/Invasive_species/plantinvasivespecies_country_data.csv

DISCUSSION

Results were obtained through fully automated metadata extraction from a corpus of plant invasive species in the supported formats of xml and pdf using the getpapers toolkit, which divides the articles into various sections such as front, body, back, floats, and groups, each of which contains unique insights. The open scientific literature on plant invasive species and country was analyzed using the ami search engine, which reveals associations between different invasive plant species and mentioned countries.

According to the data interpreted by using "invasive" corpus, the most of invasive species like- Lantana camara, Prosopis juliflora, Alliaria petiole etc are the most studied and comes under the world topmost aggressive plant invasive species. Regions like- United States, Europe are majorly mentioned regions in downloaded literature, reason for which might be the high rate of tourism across the countries resulting for species invasion, still, further studies are required to get to any conclusion. There are many Federal and State agencies involved in regulating the invasive species, but still there are no accepted standards for regulating all types of invasive species in all geographic areas. We have included the following information resources from the open access literature for the information purpose, thereby should not be considered to be complete, as the dictionary and the resources are subject to change with time or addition of new information. The present study can extend to retrieve information about the chemical compounds present in or extracted by the invasive plant species which will help to study their spread.

FUTURE PROSPECTS

The present data collected by using “invasive” corpus consisting of 100 papers, will be used for geolocation tagging by using GeoPandas and will be presented over on the world map depicting the region. The work will be continued by Bhavini Malhotra along with the help of other co-interns. I think the map view presentation will give readers a different view and increase their interest in the study.

Here are few examples of map view presentation created by me using Microsoft excel feature showing the countries mentioned in the downloaded literature with numerical numbers and shades of a specific colour. The higher number and darker shade depict the number of times a specific country been mentioned in the downloaded literature for a particular invasive plant species.

Figure 15: An example of map view presentation for Alliaria petiole and countries mentioned- United States, Canada, New Zealand, South Africa, Czech Republic.

Figure 16: An example of map view presentation for Prosopis juliflora and countries mentioned- United States, China, Australia, India, Germany, South Africa, Sri Lanka, Turkey.

Figure 17: An example of map view presentation for Impatiens glandulifera and countries mentioned- United States, China, Australia, India, New Zealand, South Africa, France.

Figure 18: An example of map view presentation for Lantana camara in countries like- United States, China, Australia,, New Zealand.

Figure 19: An example of map view presentation for Lythrum salicaria and countries mentioned- United States, Australia, New Zealand, South Africa, Argentina.

This is just an initial prototype presentation which will be modified later by co-interns using GeoPandas coding softwares, here is the link for GeoPandas mapping https://geopandas.org/docs/user_guide/mapping.html

Chapter 7: CONCLUSION

CEVOpen's goal is to work as an open- source for discovering, aggregating, and semantically enriching scholarly documents that contain significant amounts of information about plant invasive species; like- their geographical locations, countries they are found in. The objective of the project is to create an automated system which is capable of reading scientific literature and extracting its structure and meaning, with a particular emphasis in plant invasive species.

Analysis of invasive plant species was based on open access literature, with a particular emphasis on their geographical location or countries. This was possible because a dictionary search was created to analyse scientific literature and to solve this problem, a corpus of invasive plant species was developed, and information was retrieved with the assistance of getpapers. The project’s main findings are based on open-source literature on plant invasive species and mentioned countries.

Chapter 8: REFERENCES

CEVOpen repository, Dr. Peter Murray- Rust, https://github.com/petermr/CEVOpen
Quote about invasive species by Sean Hanna, https://www.quotetab.com/quote/by-sean-hanna/once-an-invasive-species-arrives-its-about-impossible-to-get-rid-of-it
Time report about invasive species spread, https://time.com/4375137/invasive-species-united
Rosadiuk, E. “6 Of the World's Most Invasive Plant Species.” The Richest, 9 Aug. 2014, www.therichest.com/rich-list/6-of-the-worlds-most-invasive-plant-species/.
Zevit, Pamela & Bio, R. (2021). Battling the Alien Invasion! An overview of invasive plant species impacts in the Georgia Basin
Langmaier M and Lapin K (2020) A Systematic Review of the Impact of Invasive Alien Plants on Forest Regeneration in European Temperate Forests. Front. Plant Sci. 11:524969. doi: 10.3389/fpls.2020.524969
Natalie Van Hoose, Global forecast assesses countries’ invasive species risk, response capacity https://phys.org/news/2016-08-global-countries-invasive-species-response.html
Langmaier M and Lapin K (2020) A Systematic Review of the Impact of Invasive Alien Plants on Forest Regeneration in European Temperate Forests. Front. Plant Sci. 11:524969. doi: 10.3389/fpls.2020.524969
https://www.nwf.org/Educational-Resources/Wildlife-Guide/Threats-to-Wildlife/Invasive-Species.
Pyšek P, Jarošík V, Hulme PE, Pergl J, Hejda M, Schaffner U, Vilà M. A global assessment of invasive plant impacts on resident species, communities and ecosystems: the interaction of impact measures, invading species' traits and environment. Glob Chang Biol. 2012 May;18(5):1725–37. doi: 10.1111/j.1365-2486.2011.02636.x. PMCID: PMC3597245
Martin, P.A., Shackelford, G.E., Bullock, J.M. et al. Management of UK priority invasive alien plants: a systematic review protocol. Environ Evid 9, 1 (2020). https://doi.org/10.1186/s13750-020-0186-y
Control mechanism, National Invasive Species Information Centre https://www.invasivespeciesinfo.gov/subject/control-mechanisms
openDiagram repository, Dr. Peter Murray- Rust, https://github.com/petermr/openDiagram
openVirus repository, Dr. Peter Murray- Rust, https://github.com/petermr/openVirus

Video link

https://drive.google.com/file/d/1gJd47RJK79S0szlrFgAzZJEwK-GYFQZk/view?usp=sharing