-
Notifications
You must be signed in to change notification settings - Fork 4
Classifying sequences with treesapp assign
TreeSAPP's gene-centric classification workflow assigns both a gene name and taxonomic label to amino acid sequences. It is capable of classifying either genomic or proteomic sequences in this manner, and they can be derived from genomes (isolate, single-cell amplified or metagenome-assembled) or metagenomes. The TreeSAPP subcommand used for classifying sequences is treesapp assign
. Nucleotide reference packages are currently not supported.
Supported versions >=0.11.0
The classification workflow begins with conceptually translating open reading frames (ORFs) using Prodigal 1 if the input FASTA or FASTQ file contains nucleotide sequences. This step is not necessary otherwise.
The protein ORFs are then queried against a set of profile hidden Markov models (HMMs) using hmmsearch
from the HMMER suite 2. Proteins that are deemed homologous are then divided into groups based on the regions they aligned to in the profile HMM and these groups are used in a profile alignment with hmmalign
to generate a multiple sequence alignment with reference and query sequences. The evolutionary placement algorithm (EPA-NG) is used to insert the query sequences into the reference phylogeny 3. Each placement, called a PQuery, is assigned a likelihood weight ratio, and distal and pendant length distances 4. These are used for further filtering out non-homologous sequences.
PQueries are assigned a taxonomic label based on the lowest common ancestor (LCA) of their descendants. Furthermore, a taxonomic rank is recommended by a linear model that models the correlation between taxonomic rank and phylogenetic distance. This is useful in situations where a PQuery is placed on an edge with few children such that the LCA is highly resolved but the phylogenetic distance between the query and ancestral reference state is large.
Finally, classified sequences are written to a classification table and PQueries are concatenated into a single JPlace file for each reference package's classifications. The latter can be used for visualization with iTOL 5.
Here is a diagram of the workflow:
For a list of all available arguments use treesapp assign -h
.
treesapp assign
only requires a single FASTA or FASTQ file containing either nucleotide or amino acid sequences.
It is capable of detecting the molecule type automatically, but this can also be indicated explicitly through the -m/--molecule
argument with the choices 'dna' and 'prot'. By default the output directory './output' would be written to unless a directory path is provided using the argument -o/--output
.
treesapp assign -i my.fasta -o treesapp_output/
In addition to the basics, we tend to also extract phylogenetically informative regions from multiple sequence alignments using BMGE with the flag --trim_align
6. This speeds up the phylogenetic placement stage without significantly harming classification performance (recall and precision).
Another well-used option is --num_procs
to determine the number of threads available to Prodigal (FASTA is split into chunks and Prodigal is run in parallel), HMMER, and EPA-NG.
If you're not interested in using the reference packages provided in the GitHub version (maybe you've happened upon our RefPkgs repository) and you want to classify sequences using different ones, you can direct TreeSAPP to these with the argument --refpkg_dir
.
treesapp assign
identifies homologs using all reference packages by default, although in many circumstances this is not necessary or desired. The argument for selecting the subset of reference packages to query is -t
. This requires one or more comma-separated reference package names (typically the short-form gene or protein name).
The selection can be found by running treesapp info -v
. The bottom chunk of standard output lists the gene/protein names mapped to reference package codes, descriptions and other information. Here is an example:
Name Molecule Tree builder RefPkg-type Leaf nodes Description Created Last-updated
DsrAB prot FastTree functional 560 Dissimilatory sulfite reductase, alpha and beta subunits 2021-04-19
HydA prot FastTree functional 93 [FeFe] hydrogenase, group A, B1, B3 2021-04-21
McrA prot FastTree functional 204 Methyl coenzyme M reductase alpha subunit 2021-04-23
McrB prot FastTree functional 122 Methyl-coenzyme M reductase beta subunit 2021-04-19
McrG prot FastTree functional 123 Methyl-coenzyme M reductase gamma subunit 2021-04-19
NapA prot FastTree functional 748 Periplasmic nitrate reductase, large subunit 2021-04-21
NifD prot FastTree functional 844 nitrogenase molybdenum-iron protein alpha chain 2021-04-22
NirK prot FastTree functional 325 nitrite reductase (NO-forming) 2021-04-21
NirS prot FastTree functional 523 nitrite reductase (NO-forming) / hydroxylamine reductase 2021-04-21
NorB prot FastTree functional 742 nitric oxide reductase large subunit B [COG3256] 2021-04-22 2021-04-22
NosZ prot FastTree functional 326 Nitrous-oxide reductase (TAT-dependent|Sec-dependent) 2021-04-22 2021-04-23
NxrA prot FastTree functional 866 nitrate reductase / nitrite oxidoreductase, alpha subunit 2021-04-23
NxrB prot FastTree functional 519 nitrate reductase / nitrite oxidoreductase, beta subunit 2021-04-22
To analyze just the McrABG subunits (last three rows), you could include -t McrA,McrB,McrG
in your treesapp assign
command.
treesapp assign
is also capable of assigning FPKM or TPM abundance values to classified sequences when the --rel_abund
flag is invoked. At least one FASTQ file is also required (using the -r
argument), and a second if the reads are paired-end and not interleaved (path provided to -2
). The --pairing
option is used to specify whether the library is paired-end ('pe') or single-end ('se').
BWA is used to both index the classified nucleotide ORF sequences, and align reads to this index 7. samsum
, a Python package dependency, parses the SAM file written and calculates normalized relative abundance values of each ORF.
Note: This routine is only available if a genome was provided and Prodigal was run - we cannot map DNA reads to protein sequences.
treesapp assign
supports restarting from incomplete output (i.e. checkpoints), as described in the general notes.
Additionally, the --stage
argument can be used to control what stages are completed, where all stages up to and including the one specified are ran.
There are a variety of output files generated by treesapp assign
in three subdirectories: final_outputs
, intermediates
and iTOL_output
.
The most useful files are the classification table and FASTA file containing all of the classified query sequences, already truncated to include just the homologous region, in the final_ouputs/
directory.
The intermediates/
directory contains many other files such as intermediate JPlace files, MSAs, and more. If the --delete
flag was invoked, however, this directory was removed.
The inputs for visualization with iTOL are located in iTOL_output/
.
The table containing functional and taxonomic classifications for the queries is final_outputs/classifications.tsv
under the output directory. Each row represents the classification for a single query sequence with the gene/protein name and taxonomic label it was classified as. Chances are, you are just interested in the first six columns.
Field | Description |
---|---|
Sample |
Name of the sample the ORF originated from. This is taken from the name of the FASTA file provided to -i /--fasta_input except when the --fpkm flag is used, in which case the name is taken from the FASTQ file(s). |
Query |
Name of the query protein sequence or conceptually translated ORF sequence. |
Marker |
Name of the reference package (i.e. gene or protein name) the query sequence was classified as. If you are unfamiliar with this name, treesapp info -v can give you a brief description. |
Start_pos |
Left-most alignment position on the query sequence to the reference profile HMM. * |
End_pos |
Length of the query sequence that was aligned to the reference profile HMM and subsequently used for phylogenetic placement. This could be, and usually is, a subsequence of the full-length ORF or protein sequence. |
Taxonomy |
Taxonomic label assigned to the query sequence based on LCA alone and the rank recommended by the reference package's linear model. ** |
Abundance |
The abundance of the query sequence in the dataset. If --fpkm was not used it is set to 1, otherwise it is a fragments per kilobase per million mappable reads (FPKM) value. |
E-value |
The Expectation value for the aligned query sequence calculated by hmmsearch . |
iNode |
The internal node number representing where on the reference phylogeny the query sequence was placed. |
LWR |
Likelihood weight ratio assigned to the query by EPA-NG. |
EvoDist |
Sum of all phylogenetic distances calculated by EPA-NG as well as the average node-to-tip distances. |
Distances |
A comma-separated list of the distances that summed to equal EvoDist
|
Notes:
* The total homologous sequence length was substituted for the Start_pos
and End_pos
fields in version 0.8.7
** The LCA-only taxonomy was removed in version 0.8.7
Within the root output directory, there is a directory called iTOL_output/
. Within there is a directory for each of the reference packages that query sequences were assigned to and within those are files you can use to visualize your placements in iTOL.
If you have an iTOL account already you should login first and once you are in your workspace click-and-drag the JPlace file from one of the iTOL output refpkg directories into a workspace. If you do not have an iTOL account (and don't want to get one) you can navigate to the "Upload" button at the top of the page and upload the JPlace file through that.
Once you are viewing the phylogeny you should toggle on the "Phylogenetic Placements" dataset. Then, to finish uploading the annotation files, click-and-drag the remaining files in that directory (e.g. simplebar.txt, labels.txt) onto the webpage et voila, you have a b-e-a-utiful phylogeny showing you where sequences in your dataset were placed.
To paint the reference package's phylogenetic tree by taxonomy in iTOL you will need to create the colour_strip.txt and colours_style.txt files using treesapp colour
. These files are also imported by clicking-and-dragging onto the browser window.
-
Hyatt, D., Chen, G.-L., Locascio, P. F., Land, M. L., Larimer, F. W., & Hauser, L. J. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11, 119. https://doi.org/10.1186/1471-2105-11-119
-
Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics (Oxford, England), 14(9), 755–763. https://doi.org/btb114
-
Barbera, P., Kozlov, A. M., Czech, L., Morel, B., & Stamatakis, A. (2018). EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences. Systematic Biology, 0(0), 291658. https://doi.org/10.1101/291658
-
Matsen, F. A., Hoffman, N. G., Gallagher, A., & Stamatakis, A. (2012). A format for phylogenetic placements. PLoS ONE, 7(2), 1–4. https://doi.org/10.1371/journal.pone.0031009
-
Letunic, I., & Bork, P. (2019). Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Research, 47(W1), W256–W259. https://doi.org/10.1093/nar/gkz239
-
Criscuolo, A., & Gribaldo, S. (2010). BMGE (Block Mapping and Gathering with Entropy): A new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evolutionary Biology, 10(1). https://doi.org/10.1186/1471-2148-10-210
-
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv Preprint ArXiv, 00(00), 3. https://doi.org/arXiv:1303.3997