Skip to content

Analysis run through

Rutger Vos edited this page Jun 5, 2015 · 40 revisions

A typical SUPERSMART run consists at least of the steps described on this page, in the order in which they are given here. The examples, such as for the primates and the piperaceae, follow these same steps:

  1. smrt taxize
  2. ... classify
  3. ... align
  4. ... orthologize
  5. ... bbmerge
  6. ... bbinfer
  7. ... bbreroot
  8. ... bbcalibrate
  9. ... consense
  10. ... bbdecompose
  11. ... clademerge
  12. ... cladeinfer
  13. ... cladegraft

The steps in a typical run are thus the execution of subcommands of the smrt program. For example, the first step is the smrt taxize subcommand. To find out how to use these subcommands when working in the environment they provide a usage help message that is displayed when issuing either smrt help <subcommand>, or smrt <subcommand> --help.

In the descriptions below, in addition to all required arguments all optional arguments are also described. For the most part these are names for input and output files. Unless you have good reasons it is probably best to go with these defaults because other subcommands downstream also assume them.

The behavior of some commands is additionally influenced by certain variables (in ALL CAPS), as described for the relevant subcommands. These variables can be specified either for your entire SUPERSMART installation, or just for a specific run. To update a variable for your entire installation, run the smrt-config command, which opens an editor to update the configuration file supersmart.ini. To update a variable for a specific run, you can set it as an environment variable with the SUPERSMART_ prefix. For example, to change the number of parallel nodes that a certain command can use to 2 (by default this is NODES=4), do this:

export SUPERSMART_NODES=2

Lastly, when running, each subcommand will issue a large stream of logging messages, which are colored in the terminal window. Green and blue messages (informational and debugging, respectively) simply inform of the progress of the analysis. Yellow warning messages indicate recoverable issues, though make sure that you understand what the messages mean. Red error messages identify serious, non-recoverable issues.

smrt taxize

This step takes either a text file with a list of species names, or a higher taxon name provided as a command line argument, and performs taxonomic name resolution. The result is a tab separated spreadsheet where each row represents a species that has been resolved, and the columns give the higher taxa (genus, family, etc.) for this species. To run taxize on a file, do as follows:

smrt taxize -i <filename>

where <filename> is the name of a text file that has on each line a species name to resolve. To run taxize on a higher taxon, do as follows:

smrt taxize -r <taxon>

where <taxon> is a higher taxon name, such as Primates.

Optional arguments

-o <outfile>

Specifies a different name for the output spreadsheet. The default is species.tsv

-e <rank>

Specifies to which taxonomic rank to expand the higher taxon. Typically the default, i.e. species, works best. In any case it has to be a taxonomic rank that is actually used by NCBI to annotate individual sequences with, so this could be below species level (e.g. a subspecies) but not above it.

smrt classify

This step creates a Newick tree file that represents the taxonomy for the species spreadsheet. As such, this tree file will have many polytomies (e.g. if more than two genera are in the same family), no branch lengths, and many unbranched internal nodes (e.g. for a monotypic genus there would be a nesting where the genus node has only a single child). This tree is used to generate starting trees.

Optional arguments

-i <infile>

Specifies a different name for the species spreadsheet. Default is species.tsv

#####-o <outfile> Specifies a different name for the output tree file. Default is classification-tree.dnd

smrt align

This step selects and aligns all phylota clusters that are relevant for the input species. During this step, many FASTA files are produced, as well as an index file called aligned.txt. The FASTA files are named after the phylota cluster they are based on, so the name consists of <seed gi>-<MRCA taxon ID>-<cluster ID>-<cluster type>.fa.

Optional arguments

-i <infile>

Specifies a different name for the species spreadsheet. Default is species.tsv

-o <outfile>

Specifies a different name for the output index file. Default is aligned.txt

smrt orthologize

The alignments from the previous step consist of sets of sequences that have been clustered, by the PhyLoTA pipeline, up to an algorithmically selected higher taxonomic level (e.g. a genus or a family): PhyLoTA cannot cluster all of GenBank simply because the number of pairwise comparisons would be too large. It is of course possible that the same marker has been sequenced more broadly across such clusters (e.g. among multiple genera or families). This step therefore assesses which alignments from the previous step are orthologous with one another, and merges them. The result consists of many FASTA files, each named clusterXXX.fa (where XXX is a unique integer), and an index file that lists all the created FASTA files, called merged.txt.

Optional arguments

-i <infile>

To specify the location of the index file from the previous step. Default is aligned.txt

-o <outfile>

To specify the location of the output index file. Default is merged.txt

smrt bbmerge

This step merges the orthologous clusters into a supermatrix. Candidate clusters to merge into the supermatrix are filtered according to the following variables:

  • BACKBONE_MAX_DISTANCE defines the maximum average pairwise uncorrected distance across the candidate alignment. If a candidate is more divergent than that it is omitted. Reasonable values are probably in the range of 0.05 to 0.1.
  • BACKBONE_MIN_COVERAGE defines the minimum number of alignments that every species in the supermatrix must participate in. If this value is too high, too many alignments will be concatenated and the supermatrix will become too big to handle in the next steps. Reasonable values are in the range of 3 to 5.

These values can be specified in two ways:

  1. by editing the configuration file using the smrt-config command. This means that the newly updated value will be used by all subsequent runs.
  2. by setting environment variables, prefixed with SUPERSMART_ (example: SUPERSMART_BACKBONE_MAX_DISTANCE).

Optional arguments

-a <filename>

Specifies the index file for the orthologous clusters. Default is merged.txt

-t <filename>

Specifies the species spreadsheet. Default is species.tsv

-o <outfile>

Specifies the supermatrix output file name. Default is supermatrix.phy

-i <taxon1,taxon2,...>

Specifies names of taxa present in the species spreadsheet that must be included in the supermatrix, regardless their marker coverage. By default, exemplar taxa are selected algorithmically.

-m <filename>

Specifies the name for an output spreadsheet summarizing the selected exemplar taxa and markers. Default is markers-backbone.tsv

smrt bbinfer

Constructs a backbone phylogeny using the supermatrix from the previous step. Three inference tools are available: ExaML, ExaBayes and RAxML. For the maximum-likelihood tools, a bootstrap analysis can be specified. The following variables are important:

  • For ExaML and RAxML, the substitution model can be specified using the variables EXAML_MODEL or RAXML_MODEL, respectively. These can be specified using smrt-config or by setting environment variables with the prefix SUPERSMART_ (e.g. SUPERSMART_EXAML_MODEL).
  • For ExaBayes, the number of runs (EXABAYES_NUMRUNS), the number of chains (EXABAYES_NUMCHAINS) and the number of generations (EXABAYES_NUMGENS) can be specified.

We have found that the defaults for these variables work reasonably well in most cases.

Optional arguments

-s <infile>

Specifies the name of the supermatrix. Default is supermatrix.phy

-t <treefile>

Specifies the name of the classification tree, whose polytomies will be randomly resolved to obtain a starting tree. Default is classification-tree.dnd.

-i <inference tool>

Specifies the name of the inference tool to use. Can be one of examl, raxml or exabayes

-b <bootstrap replicates>

Using an ML tool, specifies the number of bootstrap replicates.

-o <outfile>

Specifies the name of the output tree file. Default is backbone.dnd

-x

Specifies that all intermediate files are to be cleaned up.

smrt bbreroot

The tree(s) from the previous step are unrooted. As we are constructing time-calibrated species trees the trees must be rerooted. This step does this using one of two possible approaches:

  1. Using an outgroup. This outgroup can be a single species, a comma-separated list of species, or of higher taxa. For example, in the phylogeny of the Primates, the Strepsirrhini (lemurs, bush babies, and such) are the outgroup of the monkeys, apes and tarsiers. Therefore, this higher taxon can be used to root the tree. This approach of course requires some prior knowledge about what would be a good outgroup.
  2. Using the taxonomy. With this approach, the species spreadsheet is used to identify the rooting that minimizes the number of non-monophyletic higher taxa.

Using either approach, the output is a Newick tree file with the rerooted tree(s) called backbone-rerooted.tsv.

Optional arguments

-t <filename>

Specifies the species spreadsheet. Default is species.tsv

-b <filename>

Specifies the backbone tree file. Default is backbone.dnd

-g <taxon1,taxon2,...>

Specifies outgroup taxa.

-o <outfile>

Specifies the output file. Default is backbone-rerooted.dnd

-s

Smooths the placement of the root along the basal branch. The approach is to select a placement that most approximates the midpoint, such that the average root-to-tip path length is approximately equal for the taxa on both sides of the root.

smrt bbcalibrate

Time-calibrates the rooted backbone tree(s) using treePL. Calibration points for this step are specified using a tab-separated spreadsheet such as this. You can make such a file for example in Excel or in a text editor, but make sure that:

  1. The separators really are tab characters, not something else.
  2. The line breaks are UNIX line breaks, not Windows or MacOS.

The following columns are required, but may be used in any order:

  • NFos - a unique integer for every fossil
  • FossilName - a human readable name for the fossil
  • CrownvsStem - whether the fossil is a stem or a crown fossil
  • CalibratedTaxon - which higher taxon the fossil belongs to
  • MinAge - minimum age
  • MaxAge - maximum age
  • BestPracticeScore - currently not used, but refers to the fossil vetting ranking used by FossilCalibrations.org

The location of your fossil spreadsheet must be provided on the command line, so at a minimum the command must be invoked like this:

smrt bbcalibrate -f <fossils>

where <fossils> is the name of the spreadsheet file.

Note: Once the FossilCalibrations.org project makes available either a web service API or a data dump, this step will change accordingly, presumably to a version that provides (semi-)automated fossil harvesting functionality.

Optional arguments

The following arguments may optionally be provided, in addition to the mandatory -f <fossils> argument.

-t <treefile>

Specifies the rerooted backbone tree file. Default is backbone-rerooted.dnd

-s <supermatrix>

Specifies the supermatrix file. Default is supermatrix.phy

-o <outfile>

Specifies the output tree file. Default is chronogram.dnd

smrt consense

This step calculates a consensus tree over a set of Newick trees. This is typically done using the output from the previous bbcalibrate command, but it can equally be used with the output from the preceding commands that produce sets of Newick trees, i.e. bbreroot and bbinfer. The output of this command is a NEXUS file that uses the additional syntax for node annotations that is understood by the highly recommended tree viewer FigTree.

Optional arguments

-i <infile>

Specifies the input tree file. Default is chronogram.dnd

-o <outfile>

Specifies the output tree file (NEXUS format). Default is consensus.nex.

-b <burnin>

Specifies the fraction of burnin to omit. This option necessarily only applies to sets of trees that were generated in Bayesian analyses.

-h <heights>

Specifies how to summarize node heights, compatible with syntax for TreeAnnotator, i.e. one of keep, median, mean or ca. Default is ca.

-l <limit>

Support threshold above which a node is included in the consensus. Default is 0.0.

-s <fossils>

Specifies the fossil table file. If provided, the FossilNames are attached to their respective nodes, where possible.

-p

Flag to indicate that the support values are posterior probabilities (and should be embedded in the node annotations as such). If not provided, the support values are assumed to be bootstrap values.

smrt bbdecompose

Decomposes the backbone topology into monophyletic clades to be refined in clade-level analyses. For each of these clades, a directory named cladeXXX (where XXX is a unique integer) is created, in which candidate alignments are written. In addition, a tab-separated spreadsheet file - called markers-clades.tsv - is created that summarizes the selected markers. The selection process for these alignments is influenced by two variables:

  1. CLADE_MAX_DISTANCE specifies the maximum average pairwise uncorrected distance below which alignments are still selected (above this value they are assumed to be too saturated to be useful). Reasonable values appear to be around 0.1, higher values would select more alignments.
  2. CLADE_MIN_DENSITY specifies the minimum fraction of the total species diversity of the clade above which alignments are selected. For example, a value of 0.5 selects all alignments that have sequence data for at least half of the taxa in the clade.

These variables can be set using smrt-config or as environment variables with the SUPERSMART_ prefix.

Optional arguments

-b <treefile>

Specifies the backbone tree file to decompose. Default is consensus.nex

-g

Attempt to automatically add outgroup from sister genus for each clade, if there is sufficient marker overlap between clades.

-c <treefile>

Specifies the classification tree used for identifying possible outgroups (i.e. in combination with the -g argument). Default is classification-tree.dnd

-a <index file>

Specifies the index file with all candidate alignments. Default is aligned.txt.

-o <outfile>

Specifies the name for the output tab-separated spreadsheet that summarizes the selected markers. Default is markers-clades.tsv

smrt clademerge

The clade-level analyses are performed using *BEAST. To this end, the candidate alignments from the previous step must be merged in such a way that it is clear to which species each selected sequence belongs (as there are multiple alignments, and multiple sequences per species in each alignment). As an intermediate format, this merged data is written as richly-annotated NeXML. The maximum number of alignments that is selected is specified with the CLADE_MAX_MARKERS variable: all alignments are sorted by decreasing taxon coverage, and then the first CLADE_MAX_MARKERS are included in the merged data set.

Optional arguments

-e

Specifies that the candidate alignments must be enriched with additional haplotypes. The number of haplotypes that is added is specified with the CLADE_MAX_HAPLOTYPES variable.

smrt cladeinfer

Runs a *BEAST analysis for every clade identified in the preceding steps. The input files for *BEAST are generated by inserting values in a template file located in $SUPERSMART_HOME/data/beast/starbeast.xml. This template consists for the most part of BEAST XML statements, with some additional template syntax (shown inside [% square brackets %]) for inserting the data. If you are familiar with editing such XML files you should be able to make sense out of the syntax. The defaults that are in the template appear to be reasonable for many cases, but you are encouraged to modify the template as you see fit.

This step takes the NeXML files generated in the previous step (named cladeXXX/cladeXXX.xml), reads in the data and inserts it in the template file to produce a file in *BEAST syntax (named cladeXXX/cladeXXX-beast-in.xml). Once it starts running the analysis on the *BEAST input file, beast will start logging parameters to the file cladeXXX/cladeXXX.log. It is strongly recommended that you track whether the parameters are converging to "hairy caterpillars" and you achieve sufficient ESS by using the Tracer program. If for some clade it looks like an additional run would improve ESS you can append to your existing results by running:

smrt cladeinfer -n <generations> -a -x -b <burnin> -f <cladeXXX/cladeXXX.xml>

Optional arguments

-n <ngens>

Number of generations. The default, 100_000 is intended for testing and is far too low. Reasonable values are at least above 10_000_000

-s <sfreq>

Sampling frequency. Default is 1000.

-l <lfreq>

Logging frequency. Default is 1000.

-f <infile>

If provided, only the specified input NeXML file is analyzed, not all clades. This is useful for example if you want to rerun / expand the analysis for one clade, but not others.

-x

Specifies that previously generated *BEAST input files need to be rebuild. If this is not set, any already existing *BEAST files are re-used.

-a

Specifies that results from the current analysis are to be appended to previously existing results. This is useful if you have not quite accomplished high enough ESS for all parameters and so you want to extend the chain.

-b <fraction>

Specifies the fraction of burnin to omit from newly generated results before appending them to existing results.

smrt cladegraft

This, final, step grafts the clade-level analysis results onto the backbone topology. This results in a single, time-calibrated, ultrametric tree in NEXUS format with additional annotations as are understood by FigTree.

Optional arguments

-b <infile>

Specifies the name of the backbone tree file on which to graft the clades. Default is consensus.nex

-o <outfile>

Specifies the output tree file name. Default is final.nex

-c <clade tree>

Specifies the clade tree file if only grafting a single clade on the backbone.

-h <heights>

Specifies how to summarize the node heights in the clade trees. Values can be one of keep, ca, median or mean. Default is ca.