-
Notifications
You must be signed in to change notification settings - Fork 15
Building a concatinated gene tree
Concatenated gene trees (such as those for ribosomal proteins) are often used to improve the resolution of organism phylogenies - with the caveat that if alignments for genes with different patterns of horizontal gene transfer are concatenated it could lead to a false or muddled signal. Note that the results of doing this analysis is a phylogeny of the organisms, which represents some sort of consensus on the phylogeny of the chosen conserved genes within them.
ITEP contains scripts to help you make a concatenated gene tree (from which you can get an organism phylogeny or an overall phylogeny for a specific operon, etc.). However, in order to make such a tree, it is required that all of the alignments that you wish to concatinate:
- Represent the same group of organisms, and
- Have exactly one gene per organism
The alignments should be in FASTA formats and should have ITEP gene IDs so that ITEP can search for the organism to which they belong.
A set of alignments that meet all of these requirements can be made using the following procedure: First, identify a group of organisms that you want to use. Then make a file with the organism names (one on each line) and call
# This part gets cluster and run IDs for clusters that have exactly one copy in each input organism.
# However, they do not necessarily have exactly one copy in the organisms not in the input list.
$ cat [organism_list_file] | db_findClustersByOrganismList.py -a -u all_I_2.0_c_0.4_m_maxbit > [conserved_uniq_clusters_filename]
# This part gets the gene info and filters the results to only contain the organisms in your file.
# (Note - this step assumes that the organism names don't appear in the annotations)
$ cat [conserved_uniq_clusters_filename] | db_getClusterGeneInformation.py | grep -F -f [organism_list_file] > [geneinfo_filename]
# Finally, this part makes un-aligned FASTA files for each cluster in the above geneinfo file
$ cat [geneinfo_filename] | getClusterFastas.py [foldername]
These FASTA files can be aligned with your chosen tools, eg. with MAFFT
$ cd [foldername]
$ mkdir [newdir]
$ for file in *; do mafft --auto $file > [newdir]/$file; done
where newdir is some folder you create to store all the alignments.
[TODO - I need to write some nicer functions for filtering that list of clusters by annotation...]
We recommend that you curate the alignments manually using other tools (e.g. Jalview), use the wrapper script for Gblocks to prune out low-quality portions of the alignments, or a combination of these before concatenating them to minimize the error in the resulting tree. See the tutorial entry Building alignments and trees for details of the Gblocks wrapper script.
Once you have a directory containing ONLY the alignments you want to concatenate, they can be concatenated by running
$ catAlignments.py [alignment_directory] > [concatinated_alignment]
The script will automatically identify which proteins are in the same organisms (they must have ITEP IDs, which they will if you make the FASTA files with ITEP tools) and sequentially add them to the alignment so that the same protein is in the same position for each organism.
The tree can be made using the wrapper scripts described in Building alignments and trees or with your favorite treeing program. The ITEP tools support the Newick format so please ask the tool to output in that format.
The resulting tree will have ITEP organism IDs on the leaves. However, many of the ITEP scripts that use trees require them to have sanitized organism names on the leaves instead (such trees are also much easier to interpret if put into programs such as FigTree for manipulation and analysis). Such a tree can be generated using the replaceOrgWithAbbrev.py function :
$ cat [newick_file] | replaceOrgWithAbbrev.py > [new_newick_file]
If your organism IDs in the input file are sanitized (e.g. fig_\d+_\d+ instead of fig|\d+.\d+), as happens when using the FastTree wrapper, specify the -s flag so that the function will correctly recognize them as organism IDs.
Note - this function can also be used with any other file containing organism IDs or gene IDs (not just a Newick file) to get a quick view of what organisms they represent. However, the results cannot be used by any ITEP scripts that rely on having the original ids. A better function for that purpose is db_addOrganismNameToTable.py which preserves existing table structures and simply adds organism names (and optionally annotations).
Note 2 - The name of both replaceOrgWithAbbrev.py and db_addOrganismNameToTable.py will likely be changed in the near future to more accurately reflect the capabilities \ uses of these two functions.