crop5

crop5 miniproject template

Five miniprojects for DBT/KARYA interns 2021-09.
Duration 2 months
Each intern chooses a project from a list of 7 crops (see TIGR2ESS 2019 workshop)
project is phased , some being iterative

We take Maize (Zea mays, Zm) as a typical project. Each intern will substitute their crop.

manually assess rapidly (hours) whether the literature on Zm + TPS is large enough to be useful. If not, select another plant. This may need communal discussion.
each intern builds separate mini-dictionaries for:
- Zm genes or enzymes keyed on enzyme name. Start with Sagar's dictionary. We want to find what is mentioned in the literature.
- Zm enzyme products (mainly terpenes)
search EPMC using mini-dictionaries to assess scope/feasibility
increase size or precision of dictionaries by snowballing (particularly important for abbreviations - if they are common).
refine minicorpus to contain high precision content on Zm enzymes. At this stage the minicorpus will be a collection of papers which are primarily about terpene synthases and their products in Zm.
communally compare dictionaries and corpus (mainly by term frequency) to decide:
- which TPS are most important in each plant
- which terpenes are most important in each plant

each intern has their own wiki (e.g. Zea_mays)
they record everything daily on the wiki. For large data they create a subdirectory (see TIGR2ESS projects https://github.com/petermr/tigr2ess/tree/master/crops )
a daily standup report with links to wiki
create an initial minicorpus (Zm100), main purpose to snowball terms, abbreviations, etc.
create skeleton dictionaries by searching Zm100. The goal is to find out what genes or compounds are most frequently reported, what syntaxes are used
- TPSgene => Zm100gene. Initially a list of enzyme names. Gradually add synonyms, new enzymes and abbreviations. NOTE: The genes may or may not have the form ZmTPS or ZmHMGR , etc. Abbreviations may be standard or highly variable. This will be messy, but valuable.
- eo_compounds => Zm100terp . A list of compounds created by terpene synthases. There will be many synonyms and possibly some abbreviation.
Each intern has a major project that they are responsible for, and a minor project that they help with.

Generation of Hand created terms in text file
Installation of pygetpapers and (ami) https://github.com/petermr/pygetpapers/blob/main/README.md
Pygetpaper query

pygetpapers -q "terpene synthase volatile Camellia AND (((SRC:MED OR SRC:PMC OR SRC:AGR OR SRC:CBA) NOT (PUB_TYPE:"Review")))" -o CamelliaTPS -x -p -s

It will create a folder "CamelliaTPS" containing papers

Interns can also use the following queries.

"terpene synthase volatile Mentha"

"terpene synthase volatile Citrus sinensis"

"terpene synthase volatile Zea mays"

"terpene synthase volatile Vitis vinifera"
Focus only research articles
Go through each paper with control F function scoping for TPS.
Collect gene (names) terms such as CsTPS, MonoTPS and so on. Put those terms into excel file as a list and save excel file as gene.txt file
Use this command to create a dictionary

amidict -v --dictionary eo_Gene --directory gene --input gene.txt create --informat list --outformats xml
Create corpus using this command pygetpapers -q "terpene synthase TPS plant volatile" -o TPSvolatile -x -p -k <number of papers>
Testing dictionary ami -p "TPSvolatile" section

ami -p "TPSvolatile" search --dictionary eo_Gene.xml