Adrian Baez-Ortega
Transmissible Cancer Group, University of Cambridge
mutree is a generalization and extension of Asif Tamuri's treesub pipeline. It makes use of RAxML [1] and parts of treesub itself (which in turns uses the Java libraries PAL [2] and BioJava [3]) in order to infer a phylogenetic tree and identify candidate recurrent coding-affecting mutations in it, from a coding DNA sequence alignment.
The pipeline generates:
-
A maximum likelihood phylogenetic tree including bootstrap values in its branches (Newick format).
-
A version of the ML tree showing all the annotated mutations in the branches where they occur (Nexus format).
-
A version of the ML tree showing only the recurrent mutations in the branches where they occur (Nexus format). A nonsynonymous mutation in a branch of the tree is considered to be recurrent if another nonsynonymous mutation in the same gene has been found in a different branch.
-
A text table with all the single-nucleotide substitutions found in the alignments, indicating whether they are nonsynonymous and recurrent.
mutree has been tested on an Ubuntu 14.04.4 system, and it should behave well in any Linux distribution. It should also work well on Mac OS X.
mutree depends on the installation of the following software:
-
RAxML version 8.2.9 or later. mutree requires compiling the
raxmlHPC-SSE3
andraxmlHPC-PTHREADS-SSE3
RAxML executables, which should work well in processors up to 5 years old. -
A recent Java runtime (1.6+), which might be already installed in your system.
-
Although it is not required in order to run the pipeline, some visualisation tool is needed to open the output tree files. FigTree can read the Nexus format in which the substitution trees are output. The tree showing the bootstrap support values (in Newick format) can be opened using e.g. Dendroscope, or converted to a different format.
mutree already includes its own (slightly customized) version of the treesub pipeline, named 'treesub-TCG'. Therefore, installing treesub is not necessary, although in some cases it may have to be re-compiled (see NOTE below).
The following instructions describe the steps for installing mutree and all its components in an Ubuntu 14.04.4 system; they should be valid for any Ubuntu or Debian Linux distribution. The tools employed have available Mac and Windows versions (please consult their respective websites). mutree itself has not been tested on Mac or Windows systems, but it might work with an appropriate Bash shell.
-
Install RAxML
You only need to install RAxML if the commands
which raxmlHPC-PTHREADS-SSE3
orwhich raxmlHPC-SSE3
do not print anything in the terminal.Go to the desired installation folder (in this example, the Software folder inside your home directory, or
~/Software
):cd ~/Software
Download and compile RAxML:
wget https://github.com/stamatak/standard-RAxML/archive/v8.2.9.tar.gz tar zxvf v8.2.9.tar.gz rm v8.2.9.tar.gz cd standard-RAxML-8.2.9/ make -f Makefile.SSE3.gcc rm *.o make -f Makefile.SSE3.PTHREADS.gcc rm *.o
Then, edit your
~/.bashrc
file using:nano ~/.bashrc
and append the
standard-RAxML-8.2.9
directory at the end of your PATH variable. If the PATH variable is not defined, you can define it by adding the following line at the end of the~/.bashrc
file:export PATH=~/Software/standard-RAxML-8.2.9:$PATH
Then save and close the file (Ctrl-X).
-
Install the Java Runtime Environment
You only need to install Java if the command
which java
does not print anything in the terminal.sudo apt-get install default-jre
The system will ask for your password; you need to have administrator permissions in your system in order to use
sudo apt-get install
. -
Install mutree
Go to the desired installation folder, and download and uncompress mutree (replace
2.xx
with the latest version):cd ~/Software wget https://github.com/adrianbaezortega/mutree/archive/v2.xx.tar.gz tar zxvf v2.xx.tar.gz rm v2.xx.tar.gz
Then, edit your
~/.bashrc
file using:nano ~/.bashrc
and append the
mutree-2.xx/src
directory at the end of your PATH variable. If the PATH variable was not defined, not its line should look like:export PATH=~/Software/standard-RAxML-8.2.9:~/Software/mutree-2.xx/src:$PATH
Then save and close the file (Ctrl-X).
Either close the terminal and open a new one, or source the
~/.bashrc
file in order to apply the changes:source ~/.bashrc
Then you should be able to run the following commands, which should print something like this:
which raxmlHPC-PTHREADS-SSE3 # prints: [...]/standard-RAxML-8.2.9/raxmlHPC-PTHREADS-SSE3 which raxmlHPC-SSE3 # prints: [...]/standard-RAxML-8.2.9/which raxmlHPC-SSE3 which java # prints: /usr/bin/java which mutree # prints: [...]/mutree-2.xx/src/mutree
And now you can have fun!
NOTE: If you encounter problems while using mutree and they seem to be related to the treesub pipeline, you can try re-compiling it. You need to go to the treesub-TCG
folder within the mutree installation directory, and re-compile treesub using Ant:
cd ~/Software/mutree-2.xx/treesub-TCG
export ANT_OPTS="-Xmx256m"
ant compile jar
The pipeline requires the following input:
-
Absolute path to a coding sequence (CDS) alignment file, in FASTA format (
-i
option). Each sequence in the file should be composed of a concatenation of multiple gene CDS sequences, all of which must be in frame (i.e. the concatenated sequence must contain codon bases only, and its length must be a multiple of 3). If the length of a CDS is not a multiple of 3, any trailing bases after the last codon have to be removed before adding the CDS to the concatenated sequence. Each sequence in the FASTA file represents a sample (taxon), and must be labeled with a unique sample name. Sample names cannot include any blank spaces, tabulators, carriage returns, colons, commas, parentheses or square brackets. Each sequence must be on a single line, so that odd lines in the file contain the sample names, while even lines contain the sequences. The first sequence in the file will be used as an outgroup to root the tree, so this should be the reference sequence or a suitable outgroup sample. An example can be found in the file mutree-2.xx/examples/Alignment_H3HASO.fna (this has been adapted from one of treesub's example files). -
Absolute path to a "gene table" (
-g
option). This is mandatory unless the-f
option is used. The gene table must be a tab-delimited file with no header and two columns: gene symbol and CDS start position (position of the first nucleotide in the concatenated sequence). This allows mapping each mutation to the gene where it occurs and finding recurrent mutations. An example can be found in the file mutree-2.xx/examples/GeneTable_H3HASO.txt (the gene symbols and positions have been defined arbitrarily for this example). -
Absolute path to an output directory (
-o
option). The directory will be created if necessary. The pipeline implements a checkpoint logging system, so in the event that the execution is interrupted before finishing, re-running mutree with the same output directory will resume the execution after the last successfully completed step.
mutree also accepts other optional input:
-
Number of RAxML threads (
-t
option). This allows using the multi-threaded version of RAxML to substantially speed up the tree inference and the ancestral sequence reconstruction. This value can be any positive integer, and cannot be higher than the available number of processors. The default value is 1. -
Custom RAxML options for tree inference (
-r
option). This allows personalizing the RAxML routine, which uses rapid bootstrapping followed by maximum likelihood search by default (see pipeline description below). Custom options must be specified as a single string within quotes, and must include all the required options for running RAxML, except for the options-s
,-n
,-w
and-T
, which cannot be used. -
Custom RAxML options for ancestral sequence reconstruction (
-a
option). This allows personalizing the ASR settings, which consist of a GTR substitution model plus a Gamma model of rate heterogeneity by default (see pipeline description below). Custom options must be specified as a single string within quotes, and must include all the required options for running RAxML, except for the options-f
,-s
,-n
,-w
and-T
, which cannot be used. -
Perform tree inference and rooting only (
-f
option). If this option is specified, only the first three steps of the pipeline will be run. Thus, in this case, it is not necessary to provide a gene table via-g
, and there is also no need for the input alignment (-i
) to be composed of coding sequences (unless the rest of the pipeline is to be run afterwards). This option does not require any arguments.
Following from this, the mutree command should look similar to the example below:
mutree -i /path/to/alignment.fna -o /path/to/out_dir -g /path/to/gene_table.txt -t 8 -r "-m GTRGAMMA -# 10 -p 12345" -a "-m GTRGAMMA --HKY85 -M"
Most users should not need to use options -r
and -a
. The example input files can be used for a quick test run (without bootstrapping):
mutree -i /path/to/mutree-2.xx/examples/Alignment_H3HASO.fna -g /path/to/mutree-2.xx/examples/GeneTable_H3HASO.txt -o /path/to/out_dir -r "-m GTRGAMMA -p 12345"
(Because the sequences in this arbitrary example have a high mutation density, there will be more than one nonsynonymous substitution in every gene, and therefore all the nonsynonymous substitutions will appear as recurrent. However, it is useful as a model of how the input and output should look like.)
Running mutree
without any arguments or with the -h
option will print the usage information; the -v
option will print the program version only.
The pipeline is composed of six steps:
-
Input processing
The input FASTA alignment is transformed to PHYLIP format and the sequences are relabelled so that they are compatible with the tools employed. If the alignment contains sites composed only of undetermined characters ('N's) in all the sequences, a version without such sites will be generated as an input for step 2. The codons containing variable sites will be concatenated and written to a different file (in which 'N' characters will be replaced by 'A's), which will be used in step 4.
-
Maximum likelihood tree inference
RAxML is used to build a maximum likelihood (ML) phylogenetic tree from the input alignment. This can be a very expensive process. By default, rapid bootstrapping (with an extended majorityrule consensus tree stop criterion) is performed prior to a thorough ML tree search, which employs a GTR substitution model plus a Gamma model of rate heterogeneity (
-f a -m GTRGAMMA -# autoMRE -x 931078 -p 272730
configuration; see the RAxML manual). However, custom RAxML options can be specified via mutree's-r
option. Custom options must be specified between quotes (e.g.-r "-m GTRGAMMA -# 10 -p 12345"
), and must include all the options required for running RAxML, except for the options-s
,-n
,-w
and-T
, which cannot be used. -
Tree rooting
Treesub is used to root the ML tree by the outgroup sequence, which should be the first sequence in the input alignment FASTA file.
-
Ancestral sequence reconstruction
RAxML is used to perform marginal reconstruction of ancestral sequences from the input alignment and the ML tree. To allow the processing of very long coding sequences, this step runs on the set of codons that contain a mutation in any of the sequences. Here, RAxML employs a GTR substitution model plus a Gamma model of rate heterogeneity by default (
-f A -m GTRGAMMA
configuration; see the RAxML manual). However, custom RAxML options for ancestral sequence reconstruction can be specified via mutree's-a
option. Custom options must be specified between quotes (e.g.-a "-m GTRGAMMA --HKY85 -M"
), and must include all the options required for running RAxML, except for the options-f
,-s
,-n
,-w
and-T
, which cannot be used. -
Tree annotation
Treesub is used to annotate the mutations occurring in each branch of the tree, based on the reconstructed ancestral sequences. Mutations are assessed for their amino acid change. As part of this process, each branch receives a unique label.
-
Recurrent mutation identification
Finally, mutated codons' positions are translated into their actual positions in the original sequence, and the input gene table is then used to map each mutation to the gene (CDS) where it occurs. Any group of nonsynonymous mutations affecting the same gene are marked as recurrent. A new tree is produced which shows only the identified recurrent mutations in each branch.
Each one of the pipeline steps will generate an intermediate folder within the specified output directory. The 'logs' folder contains the global execution log, as well as the checkpoint file, which is used to record the current stage of the pipeline and can be modified in order to restart the execution in any given step: when re-running mutree with the same output folder as before, execution will be resumed at the step that follows the last step recorded in the checkpoint file.
The pipeline's final output will be stored in a folder named 'Output', and will consist of:
-
A tab-delimited text file containing the information for all the identified mutations in the tree (branch/node, gene, position, codon and amino acid changes, and whether the mutation is nonsynonymous/recurrent).
-
Five versions of the same phylogenetic tree:
-
Standard ML tree as produced by RAxML (Newick format).
-
Tree showing the bootstrap support values in its branch bifurcations (Newick format, unrooted; only if bootstrapping is performed).
-
Tree showing the branch and node labels employed in the output table of mutations (Newick format).
-
Tree showing all the mutations identified in each branch (Nexus format).
-
Tree showing the candidate recurrent mutations identified in each branch (Nexus format).
-
Please cite mutree as:
Adrian Baez-Ortega. mutree: A pipeline for phylogenetic tree inference and recurrent mutation discovery. Zenodo (2017). doi:10.5281/zenodo.583634.
Copyright © 2016–2017 Transmissible Cancer Group, University of Cambridge
Author: Adrian Baez-Ortega (ORCID 0000-0002-9201-4420; ab2324@cam.ac.uk)
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses.
-
Stamatakis, A. 2006. RAxML-VI-HPC: Maximum Likelihood-based Phylogenetic Analyses with Thousands of Taxa and Mixed Models. Bioinformatics 22(21):2688–2690.
-
Drummond, A., Strimmer, K. 2001. PAL: An object-oriented programming library for molecular evolution and phylogenetics. Bioinformatics 17: 662-663.
-
Holland, R.C.G., Down, T., Pocock, M., Prlić, A., Huen, D., James, K., Foisy, S., Dräger, A., Yates, A., Heuer, M., Schreiber, M.J. 2008. BioJava: an Open-Source Framework for Bioinformatics. Bioinformatics 24(18): 2096-2097.