Building a concatinated gene tree

Why make a concatenated gene tree?

Concatenated gene trees (such as those for ribosomal proteins) are often used to improve the resolution of organism phylogenies - with the caveat that if alignments for genes with different patterns of horizontal gene transfer are concatenated it could lead to a false or muddled signal. Note that the results of doing this analysis is a phylogeny of the organisms, which represents some sort of consensus on the phylogeny of the chosen conserved genes within them.

Requirements for making a concatenated gene tree

ITEP contains scripts to help you make a concatenated gene tree (from which you can get an organism phylogeny or an overall phylogeny for a specific operon, etc.). However, in order to make such a tree, it is required that all of the alignments that you wish to concatinate:

Represent the same group of organisms, and
Have exactly one gene per organism

The alignments should be in FASTA formats and should have ITEP gene IDs so that ITEP can search for the organism to which they belong.

A set of alignments that meet all of these requirements can be made using the following procedure: First, identify a group of organisms that you want to use. Then make a file with the organism names (one on each line) and call

# This part gets cluster and run IDs for clusters that have exactly one copy in each input organism.
# However, they do not necessarily have exactly one copy in the organisms not in the input list.
$ cat [organism_list_file] | db_findClustersByOrganismList.py -a -u all_I_2.0_c_0.4_m_maxbit > [conserved_uniq_clusters_filename]

# This part gets the gene info and filters the results to only contain the organisms in your file. 
# (Note - this step assumes that the organism names don't appear in the annotations)
$ cat [conserved_uniq_clusters_filename] | db_getClusterGeneInformation.py | grep -F -f [organism_list_file] > [geneinfo_filename]

# Finally, this part makes un-aligned FASTA files for each cluster in the above geneinfo file
$ cat [geneinfo_filename] | getClusterFastas.py [foldername]

These FASTA files can be aligned with your chosen tools, eg. with MAFFT

$ cd [foldername]
$ mkdir [newdir]
$ for file in *; do mafft --auto $file > [newdir]/$file; done

where newdir is some folder you create to store all the alignments.

Figuring out which alignments you want to concatenate

[TODO - I need to write some nicer functions for filtering that list of clusters by annotation...]

Curating alignments

We recommend that you curate the alignments manually using other tools (e.g. Jalview), use the wrapper script for Gblocks to prune out low-quality portions of the alignments, or a combination of these before concatenating them to minimize the error in the resulting tree. See the tutorial entry Building alignments and trees for details of the Gblocks wrapper script.

Concatenating alignments

Once you have a directory containing ONLY the alignments you want to concatenate, they can be concatenated by running

$ catAlignments.py [alignment_directory] > [concatinated_alignment]

The script will automatically identify which proteins are in the same organisms (they must have ITEP IDs, which they will if you make the FASTA files with ITEP tools) and sequentially add them to the alignment so that the same protein is in the same position for each organism.

Making the tree

The tree can be made using the wrapper scripts described in Building alignments and trees or with your favorite treeing program. The ITEP tools support the Newick format so please ask the tool to output in that format.

The resulting tree will have ITEP organism IDs on the leaves. However, many of the ITEP scripts that use trees require them to have sanitized organism names on the leaves instead (such trees are also much easier to interpret if put into programs such as FigTree for manipulation and analysis). Such a tree can be generated using the replaceOrgWithAbbrev.py function :

$ cat [newick_file] | replaceOrgWithAbbrev.py > [new_newick_file]

If your organism IDs in the input file are sanitized (e.g. fig_\d+_\d+ instead of fig|\d+.\d+), as happens when using the FastTree wrapper, specify the -s flag so that the function will correctly recognize them as organism IDs.

Note - this function can also be used with any other file containing organism IDs or gene IDs (not just a Newick file) to get a quick view of what organisms they represent. However, the results cannot be used by any ITEP scripts that rely on having the original ids. A better function for that purpose is db_addOrganismNameToTable.py which preserves existing table structures and simply adds organism names (and optionally annotations).

Note 2 - The name of both replaceOrgWithAbbrev.py and db_addOrganismNameToTable.py will likely be changed in the near future to more accurately reflect the capabilities \ uses of these two functions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly