A simple tool to generate hierarchical clustering trees from nucleotide sequences using kmer spectra distance. Included is a small testset of SARSCOV2 genomes downloaded from https://www.nlm.nih.gov/news/coronavirus_genbank.html.
This tool calculates the distance between a set of nucleotide sequences in FASTA format by digesting them into kmer count vectors (effectively kmer spectra). The pairwise distance between all pairs of vectors are calculated and clustered to build a Hierarchical clustering tree. A number of distance metrics and clustering methods are supported (see distance and clustering).
Installation is very straightforward, simply run
git clone git@github.com:ArthurVM/TreeMer.git
cd TreeMer
python3 -m pip install -d dependencies.txt
and you are good to go!
TreeMer takes kmer a set of nucleotide sequences in FASTA format, and generates kmer count files, stuctured as:
kmer0 count
kmer1 count
...
kmern count
in tab seperated format (denoting the kmer spectrum of the sequence). These kmer spectra are used to distance vector, and a Hierarchical Clustering tree generated.
TreeMer outputs the following files:
HC_dendro.png - The hierarchical clustering dendrogram in .png format.
HC_tree.nwk - A text file containing the hierarchical clustering tree in Newick format.
heatmap.png - The heatmap of sequence distances in .png format.
heatmap.{D}.tsv - A heatmap file in .tsv format. {D} is the distance metric used.
usage: TreeMer.py [-h] [-i I I] [-k K] [-m M] [-s]
[-d {distance metric}}]
[-c {clustering method}]
[-g G]
[fa_files [fa_files ...]]
positional arguments:
fa_files An arbitrary number of sequence files in FASTA format.
optional arguments:
-h, --help show this help message and exit
-i I I Lower and upper bound percentiles to construct the
tree. E.g. 25 75 will generate a tree from kmers from
the 25th to the 75th percentiles in the total set of
kmers ordered by count.
-k K Kmer size to use in constructing genome comparison.
Default=7.
-m M The maximum count to return a kmer, e.g. return only
kmers with count <=10 if m=10. Default=return ALL.
-s Suppress the generation of kmer-spectra from sequence
files. This assumes that all positional arguments
provided to this tool are already kmer-spectra files
generated by genKmerCount. Default=False.
-d {euclidean,minkowski,cityblock,sqeuclidean,hamming,jaccard,chebyshev,canberra,braycurtis,yule}
Metric used in calculating distance between kmer
spectra. Default=euclidean.
-c {ward,single,complete,average,weighted,centroid,median}
Clustering method utilised to build the tree.
Default=ward.
-g G A tab seperated text file containing geographic
locations for each sequence, ith the sequence ID in
col0 an geolocation in col1. Default=False.
-v Verbose output mode. Default=False.
A dataset of complete SARSCOV2 genomes are provided with this tool, in the /TreeMer/SARSCOV2/SARSCOV2_WGS
directory. This includes geolocations of each isolate in /TreeMer/SARSCOV2/geolocs.tsv
.
The entire pipeline can be run using a single command fromthe TreeMer root directory:
python3 TreeMer.py SARSCOV2/SARSCOV2_WGS/* -k 7 -i 10 90 -d euclidean -c ward -g SARSCOV2/geolocs.tsv
In this instance, we are calculating the euclidean distance between 7mer frequency vectors, stripping out the 10% least and most frequent kmers, and clustered using Wards method. The subsiquent tree is:
A number of distance metrics and clustering methods are supported by this tool.
- Euclidean
- Minkowski
- Cityblock
- Sqeuclidean
- Hamming
- Jaccard
- Chebyshev
- Canberra
- Bradycurtis
- Yule
- Ward
- Single
- Complete
- Average
- Weighted
- Centroid
- Median
python3
argparse
scipy
numpy
matplotlib
seaborn