Skip to content
ShweataNHegde edited this page Jun 6, 2021 · 35 revisions

Welcome to the CEVOpen wiki! This page outlines the key components of the project. It's intentionally kept short. If you wish to know more, you can browse through the wiki pages of this repository, and of openvirus (https://github.com/petermr/openVirus/wiki).

1. Main components of intern activity:

1.1. Technology

1.1.1.((py)getpapers, ami)

  • pygetpapers is the scraper developed in Python by Ayush Garg. It is based on getpapers(https://github.com/ContentMine/getpapers) which was written in Node.js. pygetpapers downloads scientific papers, primarily from EuropePMC repository. You can read more about it here
  • pyami (Needs more documentation. Still a prototype) is currently being developed by Peter Murray-Rust. It's a new Python-based open-source universal reader and analyser for scientific literature. Source code can be found here

1.1.2. Dictionaries - Ontologies

Currently, our projects are based on building dictionaries. Each intern has their own dictionary which is usually relevant to essential oils. The current list is:

  • Radhu Ladani - oil-producing plants
  • Radhu Ladani- biological activities of EOs
  • Kanishka Parashar- Invasive plant species
  • Talha Hasan - EO compounds
  • Vasant Kumar - Plant parts
  • Shweata Hegde - Plant Genera
  • Shweata Hegde - organizations (e.g. Research Funders, Universities)
  • Ambreen Hamadani - country

1.1.3. How are the dictionaries created?

  • Most dictionaries are created from Wikidata SPARQL queries. You can take a look at individual dictionary wiki pages to know more.
  • You can also refer to this slide deck to understand the basics.

1.2. Mini-projects

  • chemotype
  • genotype
  • activities (medicinal)
  • phenotype - invasive species integration - how these fit together - an atlas

2. Prerequisites

Python is essential to run all of our software. Ensure you've installed it before proceeding further.

2.1.Install

2.1.1. pygetpapers (https://github.com/petermr/pygetpapers)

Run the following command on your command line to install pygetpapers

pip install git+git://github.com/petermr/pygetpapers

If you have trouble installing using this method, you can find alternatives here.

2.1.2. ami_gui.py

  • git clone https://github.com/petermr/openDiagram.git
  • Though ami_gui.py runs on the command line, you will have to make some changes to the source code to point the software to where all the projects outlined below lie on your local machine. PyCharm is recommended to edit the source code.

2.2. git clone

The project has gradually expanded and branched out to different research areas. Therefore, our work is dispersed across various different repositories. These repositories are where the latest dictionaries, mini-corpora and software are. To run amigui_py, you will have to clone (i.e., download it to your local machine) the following repositories:

3. Overall Goal

To build a multilingual semantic Atlas of Volatile Phytochemistry.[1]

3.1. Subgoals

To build Open Source multiplatform tools which can discover, aggregate, clean, and semantify scholarly documents containing significant amounts of phytochemical VOC[2]s. Documents will contain, extraction and assay of oils, optionally with properties and activities.

3.2. About CEVOpen

Phytochemistry is the key component of this project and in the main, we will be analysing:

  • compounds (mainly VOC). Includes synonyms, structures, images
  • plants that create VOC/essential oils, again many synonyms, includes images
  • locations where the plant was harvested
  • activities reported for the oils
  • organizations involved

We will be analysing corpora for instances of the above, manually to validate the process and then automatically.

3.3. Tools include:

  • APIs for repositories such as EPMC, biorXiv preprints, and thesis collections.
  • Scrapers for semi-structured sites such as journals
  • standardised metadata (e.g. JATS)
  • PDF and HTML readers => XML or JSON
  • article sectioning (e.g. into JATS categories)
  • extraction of floats (tables, maps, images, diagrams, chemistry, maths*)
  • display and navigation of sections in a paper
  • aggregated statistics and machine learning
  • multilingual annotation (using Wikidata)
  • linking to the Wikidata knowledge graph

[*] not included in CEVOpen but extensible in future
[1] we need an engaging title. "Atlas" is often extended beyond maps (e.g. Atlas of The Human Body). For example, plantPart is an atlas of the plant. It works for me but may confuse others. Here are some ideas:

  • "Compendium of ..."
  • "Semantic Essence of phytochemistry". Essence == central meaning, and also volatiles
    But please think creatively.

[2] Volatile Organic Compound

3.4. Required actions:

  • Coordination of EO-related and general dictionaries - conformance to a common standard.
  • Validation of gold-standard minicorpora (e.g. for training and validating machine learning)
  • If you are interested in contributing to the project on the Machine Learning front, you can take a look at the Our-Project-and-Machine-Learning page.

3.5. Update (2021-06-06):

We have a new set of interns joining us. Here we are summarizing goals for the next 6 months:

3.5.1. STRATEGY

With the new intake of interns, we are expanding our strategy for the next 6 months. We have been joined by Chaitanya Sharma and Bhavini Malhotra and Sagar Jadhav and we are hoping to appoint another intern (InternX) shortly.

3.5.2. GOALS

The goal of these 6 months is to consolidate our current dictionaries, corpora, and code, and then to explore how they can be used. We'll think of this as a guide to phytochemistry of essential oils ("Atlas", "Compendium", etc.). Each of you will be creating a specific part of this and/or coordinating and customising it for a wide range of audiences.


4. Outreach

We've presented our work (mostly of openVirus) at various places including Wikcite, COAR and BarCamp. You can take a look at our Outreach page. If you're a newbie, taking a look at our presentations is, probably, the best way to get started to understand the pipeline.


5. Code of Conduct

All the interns, volunteers and contributors should adhere to the code of conduct, outlined here. Basically, it says "be respectable and helpful towards everyone".

Clone this wiki locally