Building reference packages with TreeSAPP

Overview

This tutorial is meant for users who would like to analyze a gene family lacking an associated TreeSAPP reference package (refpkg). This tutorial is also useful to people who find a refpkg is completely useless to them and would rather start over than use treesapp update.

Supported versions >=0.10.2

Ingredients

The subcommand we will need to use for this tutorial is treesapp create. This subcommand's workflow is visualized below:

As you can see treesapp create depends on several other open source software. Some of these are also dependencies of treesapp assign (HMMER, EPA-NG, BMGE) and several are not (FastTree, RAxML-NG, MAFFT, MMSeqs2, OD-Seq).

To promote transparency and reproducibility, TreeSAPP assumes that the sequences (either nucleotide or protein) are accessioned on the NCBI and therefore available through BioPython's Entrez API. This allows TreeSAPP to automatically determine the taxonomic lineage assigned to each sequence without this information being provided. Hopefully this means less work for you! However, seeing as folks will invariably want to work with the latest sequences, there are ways to provide lineage information for unaccessioned sequences (below).

Step 1. Reference sequences

What databases can I retrieve sequences from?

Throughout the development of TreeSAPP I have frequented FunGene, the functional gene pipeline and repository. This database provides curated sequences (of varying reliability) for many popular function and taxonomic anchor genes. FunGene is also great since they have hidden Markov models (HMMs) for every gene available, providing a sense of what the full length gene is and allowing users to filter the sequences based on the percentage of an HMM covered. Since we're building a reference tree I suggest downloading all full length and near-full length genes, so setting that HMM coverage parameter to between 60 and 90%.

EggNOG is another source of curated orthologous groups. The breadth of families available here is astounding and more often than not they do have the sequences you're looking for. What I've found handy is first looking for the Cluster of Orthologous Gene (COG) identifier on NCBI's Conserved Domains database. Then I'd use that identifier to search for the sequences I'm interested in on EggNOG.
Unfortunately, the breadth of taxa covered by EggNOG is limited to just 5,090 organisms. In many cases this is good enough to build a 'seed' reference package, especially if you're building a reference package of a well-conserved, or housekeeping, gene. Yet, there are plenty of scenarios where the number of sequences downloaded is just too few for your objective and a more comprehensive reference package is required. More on that below.

Alternative sources

EggNOG and FunGene are great resources, though this leaves out the vast majority of biological sequence databases for no reason other than simplicity. While IMG, KEGG, UniProt, Pfam, and other similar databases are all useful they lack an API to retrieve lineage information, and therefore the lineages need to be provided through other means.

A table mapping sequence names to taxonomic lineages can be provided using the --seqs2lineage argument. This lineage table can either be tab or comma-separated with the fields:

SeqID	Organism	Lineage	Domain	Phylum	Class	Order	Family	Genus	Species

'SeqID' must be the first field and it can be a prefix of the query sequences that were placed. This is useful when the queries are contigs (i.e. nucleotide sequences) but the amino acid open-reading frames were classified. Providing the contig name as the 'SeqID' will work.

Some fields are redundant and so not all are required. If the “Lineage” field is provided, the separated rank fields (e.g. “Domain”, “Phylum”, etc.) are not needed and vice versa. If the lineage field is used there are two things that can help you and TreeSAPP: first off, other non-canonical ranks (e.g. sub-order, super-phylum, etc.) should be removed from the lineage. Secondly, it is best to explicitly include the rank-prefixes in the lineages (d__Archaea; p__Euryarchaeota) so TreeSAPP doesn’t have to do this later. The “Organism” field is also optional and the most resolved rank is used in its place if it is missing.

Step 2. Creating the reference package

The most basic refpkg consists of 4 files: a multiple sequence alignment file in FASTA format, a HMM, a tree in NEWICK format, and a table mapping identifiers in the phylogenetic tree to taxa. In addition to these files, the same NEWICK tree with bootstrap files can also be generated.

A basic command could look something like this:

$ treesapp create \
--fastx_input rpoB_proteins.faa --output rpoB_TreeSAPP_create \
--refpkg_name rpoB --cluster --similarity 0.90 \
--num_procs 8 --molecule prot

The --similarity argument specifies the proportional sequence similarity to cluster the input sequences at. This is very useful when the input fasta includes many sequences from closely related organisms. Using this option alone assumes that you've already clustered these sequences prior to running treesapp create. If this is not the case, users must include the --cluster flag for TreeSAPP to use vsearch for sequence clustering.

Other options that some may find useful are --profile, --guarantee, --accession2taxid, --fast, --min_taxonomic_rank, and --headless.

--profile can be used if you have a HMM that you want to use to extract domains from the input sequences. For example, maybe input sequences contain multiple different domains or multiple copies of the same domain. In some cases this is desired while in others it could complicate phylogenetic inference and the profile HMM, and may lead to false positive classifications. It is advised to use an HMM that models the domain you're interested in and to build a refpkg from these sequences only. This is also handy if you don't fully trust the sequences you're using to build a refpkg.

--fast invokes "fast mode" where FastTree is used to build the phylogenetic tree instead of RAxML-NG. The main advantage here is speed. Currently a tree with bootstrap values is not generated in fast-mode. We have not observed a significant difference in classification performance between RAxML-NG and FastTree phylogenies, so if you are building a large refpkg I suggest using fast mode.

--headless is useful when building refpkgs on a server and don't want to be prompted for any reason. Without --headless, when the --cluster flag is used the sequences of each cluster are presented to the user for them to manually select the sequence with the best annotation, sequence length, etc. There are cases when this isn't feasible and instead the longest sequence (cluster centroid) is used.

--guarantee can be used to ensure that sequences make it into the final set of reference sequences, regardless of whether they would be normally removed by filters and clustering. It's argument is a FASTA file with the sequences that you want to make it through. These do not need to be in the FASTA provided to --fasta_input.

--accession2taxid is a path to an NCBI accession2taxid file for more rapid accession-to-lineage mapping. These files can be downloaded from the NCBI's ftp site here. We use prot.accession2taxid. Note: its only worthwhile if you have many hundreds or thousands of sequences in your input that need to be assigned a lineage. Otherwise, it is probably faster to just wait for TreeSAPP to send queries to the Entrez database.

The --min_taxonomic_rank argument specifies the minimum taxonomic rank a sequence's taxonomic lineage must be resolved to. For example, using --min_taxonomic_rank g will ensure that all sequences in the reference package have been annotated to the genus, species or strain taxonomic rank. This is useful for removing poorly annotated sequences, such as those from amplicon or metagenome studies and hastily uploaded without properly determining the closest relative.

--deduplicate is useful to remove nearly identical sequences from a FASTA file before proceeding to build a reference package. Specifically, all sequences are clustered at 99.9% identity with MMSeqs2 where the longest sequence from each cluster is selected as the centroid/representative. Using --deduplicate can save time when taxonomic lineage information needs to be downloaded from Entrez.

Creating multiple versions

A significant time sink is waiting for TreeSAPP to download the lineages associated with each sequence accession. This time can be saved if you're building multiple RefPkgs from the same FASTA file, say by clustering at different sequence similarities. Additionally, if you want to maintain the reference package and update it as new versions of TreeSAPP are released, keeping the file that maps sequence names to their lineages will save much time.

After creating the first RefPkg, copy the file accession_id_lineage_map.tsv for reuse. Here is an example:

$ treesapp create -i rpoB_proteins.faa -o rpoB_create_90/ \
 -n 4 -c RpoB -p 0.90 --cluster -m prot
$ cp rpoB_create_90/intermediates/accession_id_lineage_map.tsv ./RpoB_accession_id_lineage_map.tsv
$ treesapp create -i rpoB_proteins.faa -o rpoB_create_80/ \
 -n 4 -c RpoB -p 0.80 --cluster -m prot --accession2lin RpoB_accession_id_lineage_map.tsv

Just be sure that you don't use the flag --delete during the first build as it will remove the intermediate files, including accession_id_lineage_map.tsv.

Step 3. Integration

The final step is to run those final commands that were printed as treesapp create completed. They should look something like this:

To integrate this package for use in TreeSAPP you must copy McrA/seed_refpkg/final_outputs/McrA_build.pkl to a directory containing other reference packages you want to analyse.
This may be in treesapp_venv/lib/python3.7/site-packages/treesapp//data/ or elsewhere

These need to be executed outside of treesapp create since it looks for all refpkg files in subdirectories of a single location. These are not copied automatically for two reasons:

Prevent over-writing existing refpkgs that users may want to retain (sentimental value, etc.)
Give the user an opportunity to inspect and test the refpkg before (blindly) using it for analyses

This brings us, dear reader, to evaluating the quality of refpkgs with treesapp purity.

An upcoming tutorial will show you how to estimate the taxonomic classification accuracy with your new refpkg using treesapp evaluate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly