Skip to content

Dictionary: Overview

Giulia edited this page Feb 2, 2021 · 1 revision

This overview is a modified version of the dictionary overview given in the openVirus project.

Purpose

The purpose of Dictionaries in the CEVOpen project is:

  • to identify words and phrases ("entities") in the documents (running text and images).
  • to provide (computable) links to their meaning and context ("ontologies").
  • to collect a subset of terms representing a high-level concept ("species", "pests", "chemical compound",...).

The benefits include:

  • understanding the meanings of words.
  • background reading.
  • aggregation ("searching") for the same or related entities in the corpus (collection of documents).
  • building computable knowledge networks/graphs.
  • classifying documents.

This can be described as ontological annotations in semantic networks.

Possible uses

There are many established uses of such annotations:

Improved reading

We are often put off by unfamiliar terms, e.g. "trichome". Wikipedia has an article on https://en.wikipedia.org/wiki/Trichome#:~:text=Trichomes%20(%2F%CB%88tra%C9%AA,hairs%2C%20scales%2C%20and%20papillae.:

Trichomes (/ˈtraɪkoʊmz/ or /ˈtrɪkoʊmz/), from the Greek τρίχωμα (trichōma) meaning "hair", are fine outgrowths or appendages on plants, algae, lichens, and certain protists. They are of diverse structure and function. Examples are hairs, glandular hairs, scales, and papillae.

With mouseover or footnotes this can dramatically improve speed of reading.

Searching and indexing

Annotations are easily aggregated in indexes or search engines.

Precision and checking

People may confuse trees (a group of diverse organisms) with plants (one of life's kingdoms which includes land plants and certain algae).

Relations between entities

As an example from Wikipedia (https://en.wikipedia.org/wiki/Phytophthora_infestans)

Phytophthora infestans is an oomycete or water mold, a microorganism that causes the serious potato and tomato disease known as late blight or potato blight.

This sentence links potato blight to Phytophthora infestans. Indeed we can write:

  • Potato blight isA disease
  • Potato blight isCausedBy Phytophthora infestans

Ami's annotations allow software to discover and use such annotation. We can find all diseases isCausedBy oomycetes.

Disambiguation

What's "moss"? https://en.wikipedia.org/wiki/Moss_(disambiguation) tells us:

Moss is a small, soft, non-vascular plant that does not have flowers or seeds.

Moss may also refer to:

  • Moss (language), a musical language designed by Jackson Moore
  • Moss Bros, a menswear outfitters in the United Kingdom
  • Moss Brothers Aircraft, an English aircraft manufacturer (1936–1955)
  • Moss FK, a Norwegian football club

... and many more ...

We can label the different concepts by using a unique identifier system as in Wikidata.

Structure of a dictionary

Dictionaries have a simple format, best supported by XML or JSON (currently mainly XML). This defines certain elements and attributes (in <element att1="attval1" att2="attval2" ... > ). We are developing validation software. In general:

  • unknown elements are ignored
  • <desc> and <entry> and <alternative> are optional and repeatable.
  • all attributes except dictionary/@title are optional (at this stage)
  • order of elements and attributes is irrelevant (but worth making pretty and consistent)

Dictionary/title

This is the root element and contains the title which MUST be a single word and MUST be the base of the filename, e.g. pests.xml must have the structure

<dictionary title="pests">
...
</dictionary>

There is no XML namespace.

Header/description

There is a header of zero or more <desc> description elements, though we may enforce mandatory elements later. These can describe metadata such as dates, maintenance, provenance, authors etc. They are not yet standardised but will be. Here is a snippet from the eoPlant dictionary (contains plant species names):

<dictionary title="eoPlant">
<desc>A dictionary of 1678 plant names extracted mentioned in the 186 test articles downloaded from PubMed. Of the 1678 entries, 1567 had their names normalized and tagged with corresponding Wikidata IDs</desc>
<authors>Dr. Gitanjali Yadav, Ph.D., Computational Biology Laboratory, NIPGR National Institute of Plant Genome Research, Lecturer, University of Cambridge Dept. of Plant Sciences; Ambarish Kumar</authors>
<contributors>Shruthi Mohan; Emanuel Arruda, President https://www.verriclar.com, https://www.verriclar.com.br/; Peter Murray-Rust, Reader Emeritus in Molecular Informatics, Unilever Centre, Dept. Of Chemistry University of Cambridge</contributors>
<datasource>http://www.nipgr.ac.in/Essoildb/</datasource>
</dictionary>

Entry/body

The main component of a dictionary are entries, still slightly evolving. An entry is a well-defined object which can normally be mapped / linked to a Wikidata item. This gives it a unique identifier (Q-number), removing the need to maintain identifiers. Typical entry (with new element synonym and more use of desc with new attributes:

<dictionary title="miniterpenes">
  <entry term="borneol" wikipedia="borneol" wikidata="Q27089413" name="(-)-borneol" description="chemical compound" id="CM.myterpenes.0" term.hi="बोर्निऑल" term.it="borneolo" term.zh="冰片" regex="(\([+-]\)\-)?[Bb]borneol">
    <desc date="2020-07-22">added Bornyl-alcohol synonym</desc>
    <alternative>(-)-Bornyl alcohol</alternative>
  <entry>
...
</dictionary>

Entry attributes

  • the term is the unique lexical string (word) defining the entry. Terms are always lowercase and always start with a letter. The term may or may not be the linguistic entity in documents.
  • the name is the preferred name for the term. It is case-sensitive, and will often occur in text, name and term may or may not be identical words.
  • term.xx can occur as language equivalents where xx is the appropriate 2- or 3-letter language code. See https://en.wikipedia.org/wiki/ISO_639-2. These can often be picked up from the links to Wikipedia pages from a Wikidata item (bottom of page). (Experimental).
  • regex is a regular expression for locating possible matches in text. This one finds (-)-borneol, (+)-borneol, and borneol.
  • description is a human-readable string describing the entry. However it is often created directly from Wikidata and may be used for grouping or disambiguation.
  • wikipedia is the name of the Wikipedia page. It is often the term (for single words). It may not have spaces and may have escaped punctuation. resolves to (e.g. for EN, https://en.wikipedia.org/wiki/<wikipedia>
  • wikidata is the identifier of the Wikidata item, always of the form Qddddd.. (occasionally Pddd...). It resolves to https://wikidata.org/wiki/<wikidata>. There is only one identifier for a Wikidata item and the relationships and graphs are language-independent.
  • id is a local autogenerated ID and is not stable.

Children of entry

We are introducing 2 children of entry

  • desc has the same semantics as desc for dictionary
  • <alternative> . These are alternative lexical forms for the term. There are deliberately no semantics. They may or may not be exact synonyms, and may or may not be narrower/broader terms. These ontological relations can often be obtained from Wikidata.

Using dictionaries

  • dictionaries will provide search terms (term, name, regex, alternative) for ami, Lucene/Solr or KNIME.
  • dictionaries provide a link to Wikipedia pages or Wikidata Items. Annotation software can create hyperlinks for humans to follow.

Creating dictionaries

Conventional dictionaries take a lot of effort to create and maintain, particularly if they contain ontological relationships. Often only specialist maintainers can do this. ContentMine dictionaries remove this problem by reducing the problem to a selection of relevant terms. Often this selection is already made, in Wikipedia pages, or other collections. Many dictionaries are thus "views" (subsets) of Wikidata. There are several ways of doing this (see other sections of this wiki).

Clone this wiki locally