-
Notifications
You must be signed in to change notification settings - Fork 49
SNaQ
To run SNaQ, you need
-
data extracted from sequence alignments:
- a list of estimated unrooted gene trees (e.g. from RAxML), or
- a table of concordance factors (CF) (e.g. from BUCKy)
-
a starting topology (e.g. from Quartet MaxCut or ASTRAL, or RAxML tree from a single gene...)
List of estimated unrooted gene trees: plain text file with one tree per line
in parenthetical format.
Table of CF: plain text file or comma-separated file, obtained from TICR pipeline.
Starting topology: tree or network in parenthetical format
SNaQ estimates unrooted explicit networks with a maximum pseudolikelihood approach.
Read the data into Julia:
buckyCF = readTableCF("bucky-output/1_seqgen.CFs.csv")
or
raxmlCF = readTrees2CF("raxml/besttrees.tre", writeTab=false, writeSummary=false)
For the starting topology, the Quartet MaxCut tree is saved in one file, whereas the ASTRAL tree is saved at the end of a list of bootstrap trees:
tre = readTopology("bucky-output/1_seqgen.QMC.tre")
or
astraltree = readMultiTopology("astral/astral.tre")[102] # main tree with BS as node labels
Estimate the best network from bucky's quartet CF and hmax
number of hybridizations (make sure to have a folder called snaq
in the working directory):
net0 = snaq!(tre, buckyCF, hmax=0, filename="snaq/net0_bucky", seed=123)
net1 = snaq!(net0, buckyCF, hmax=1, filename="snaq/net1_bucky", seed=456)
We want to continue with hmax=2
, but faster. Let's tell Julia to
use multiple processors. When we do so, snaq!
will send its different
runs to different processors. The screen output is a little different.
using Distributed
addprocs(2) # tells Julia to add 2 worker cores
@everywhere using PhyloNetworks # tells these cores to load the PhyloNetworks package for themselves
net2 = snaq!(net1, buckyCF, hmax=2, filename="snaq/net2_bucky", seed=789, runs=5) # different runs will be sent to different cores
By default, each optimization will do 10 runs (runs=10
), but we choose runs=5
for speed in the tutorial.
To estimate the best network from raxml's gene trees, the command is the same,
except for the input data (and perhaps for the output file where we want
the results to do):
net1r = snaq!(net0, raxmlCF, hmax=1, filename="snaq/net1_raxml")
We can plot the estimated networks:
using PhyloPlots
plot(net0, :R);
again, if a plot window didn't pop up, an alternative is to save the plot as a pdf and open it outside of julia:
R"pdf"("plot3-net0.pdf", width=3, height=3);
plot(net0, :R);
R"dev.off()";
But taxon 6 was the outgroup, and it does not show as an outgroup. This is okay, because SNaQ infers an unrooted (or rather, semi-rooted) network. Let's reroot all of our networks with our outgroup.
# R"pdf"("plot4-net012.pdf", width=8, height=3); # to save as pdf if needed
R"layout(matrix(1:3,1,3))"; # to have a panel of 3 plots in 1 figure
R"par"(mar=[0.1,0.1,0.1,0.5]);
rootatnode!(net0, "6")
plot(net0, :R);
rootatnode!(net1, "6")
plot(net1, :R);
rootatnode!(net2, "6")
plot(net2, :R);
# R"dev.off()";
The minor hybrid edge crosses other edges, so let's rotate the top/down sisters to make the plot nicer. First, we need to know the number of the node around which we want to rotate edges.
# R"pdf"("plot5-net1.pdf", width=8, height=3); # to save as pdf if needed
R"layout"([1 2 3]); # for 3 plots in 1 figure. [1 2 3] is a 1x3 matrix (of plots here)
R"par"(mar=[0.1,0.1,0.1,0.5]);
plot(net1, :R, showNodeNumber=true);
rotate!(net1, -3)
plot(net1, :R);
plot(net1, :R, showGamma=true);
# R"dev.off()";
-
.out
file: contains the best network among all runs, and the best network per run, includes also the pseudolikelihood score and the computation time.(1,2,((4,(3)#H7:::0.8455008017543593):10.0,(6,(5,#H7:::0.15449919824564068):1.3478059001640503):0.8097232227650086):8.859048876499543); -Ploglik = 32.44166815275073 Dendroscope: (1,2,((4,(3)#H7):10.0,(6,(5,#H7):1.3478059001640503):0.8097232227650086):8.859048876499543); Elapsed time: 279.945323758 seconds, 10 attempted runs ------- List of estimated networks for all runs (sorted by log-pseudolik; the smaller, the better): (1,2,((4,(3)#H7:::0.8455008017543593):10.0,(6,(5,#H7:::0.15449919824564068):1.3478059001640503):0.8097232227650086):8.859048876499543);, with -loglik 32.44166815275073 (1,2,((6,(5,#H7:::0.15449929246440022):1.3478048315566553):0.8097228561399494,(4,(3)#H7:::0.8455007075355998):10.0):8.858993407546471);, with -loglik 32.44166815276231 (1,2,((6,(5,#H7:::0.15449950311468977):1.347797935277818):0.8097218999238613,(4,(3)#H7:::0.8455004968853103):10.0):8.859068579213334);, with -loglik 32.44166815299586 (1,2,((4,(3)#H7:::0.8455028784370155):10.0,(6,(5,#H7:::0.15449712156298445):1.3478623437305024):0.8097829889895752):9.686329026439614);, with -loglik 32.44681878021907 (1,2,((4,(3,#H7:0.0::0.12596024349126436):10.0):6.998525545802376,((5,6):0.6706815043923903)#H7:0.02644187240118026::0.8740397565087357):8.802233330267686);, with -loglik 38.47631220369748 (1,2,((4,(3,#H7:0.0::0.12669655104300853):5.137606913234892):9.662626208222877,((5,6):0.6721814535915894)#H7:0.025109110453433996::0.8733034489569915):9.999190138600072);, with -loglik 38.48682416318879 (1,2,(((5,6):0.48745060060410667,(3,(4)#H7:::0.8562805766179749):7.459921008438409):5.783680919305781,#H7:::0.14371942338202517):9.999680749757447);, with -loglik 39.165297532921166 (2,1,(((3,(4)#H7:::0.856702302080893):6.842325499857021,(5,6):0.48745123178986577):8.84844751722053,#H7:::0.1432976979191069):9.705601003398822);, with -loglik 39.165297532966704 (2,1,(((3,(4)#H7:::0.856716182654851):6.827117792791621,(5,6):0.4874498177842928):9.980179092466777,#H7:::0.14328381734514894):8.210595868229708);, with -loglik 39.16529753331082 (1,2,(((4,3):1.4572675240847626,(5,(6)#H7:::0.7580193251190865):1.31560016434676):9.979843400401359,#H7:::0.2419806748809134):7.098369544865319);, with -loglik 40.717647931470985 -------
-
.networks
file: contains the networks obtained by switching the hybrid node inside each cycle (because of potential identifiabiliy issues).(1,2,((4,(3)#H7:::0.8455008017543593):10.0,(6,(5,#H7:::0.15449919824564068):1.3478059001640503):0.8097232227650086):8.859048876499543);, with -loglik 32.44166815275073 (best network found, remaining sorted by log-pseudolik; the smaller, the better) (1,2,((4,(3,#H7:::0.17493951971036137):1.0060603344148846):2.251254356415373,(6,(5)#H7:::0.8250604802896386):1.1591996918589282):8.12886551182064);, with -loglik 38.56527982546186 (1,2,(#H7:::0.17739396674858887,(6,(5,(3,(4)#H7:::0.8226060332514111):8.054305321403243):0.033502492781334416):3.6927109132625615):5.3702784570971644);, with -loglik 86.3478945657121 (#H7:9.374999128965698::0.42574056046963116,4,(3,(5,(6,((1,2):8.859258713532823)#H7:0.8096935556192413::0.5742594395303688):1.8395730576771743):3.5649233234248294):0.4815558034289685);, with -loglik 119.5463693848075 (1,2,((4,(3,(5,(6)#H7:::0.5814752796026457):2.7705818610565034):0.0):1.7681838379153163,#H7:::0.41852472039735433):8.365569611574916);, with -loglik 226.60240574282727
-
.log
file: description of each run, convergence criterion, seed information -
.err
file: seed information on runs that failed, empty when nothing failed.In the case of a failed run, for example:
Total errors: 1 in seeds [4545]
, you could run the following function on the same settings that caused the error:snaqDebug(tree,data,seed=4545)
(to help us debug).
- For big datasets, do only one run first to get an idea of computing time:
net1 = snaq!(net0, buckyCF, hmax=1, filename="snaq/net1_bucky", runs=1)
- Increase the number of hybridizations sequentially:
hmax=0,1,2,...
, and use the best network ath-1
as starting point to estimate the best network ath
. - Whenever possible, do many independent runs (default
runs=10
) to avoid getting stuck on local maxima. We do not need to do all runs sequentially. We can parallize by doing different runs on different cores (or computers). If we are on a machine or on a cluster that has many different cores, we can ask Julia to use these multiple cores, andsnaq!
will send different runs to different cores, like we did earlier. To do this: - For long jobs, run as a script in the terminal:
julia runSNaQ.jl
, arguments to the script are passed to Julia as a vector calledARGS
. See the example scriptrunSNaQ.jl
in the folderdata_results/scripts/
. more on this topic in the package doc.
The network does not make any sense!
Recall:
- The root is meaningless, use the rooting functions (more here)
- Identifiability issues: check out the
.networks
file, and its pseudolikelihood scores (more here).looking at the plots: the difference between the best and second-scoring networks is the direction of gene flow. To see the likelihood scores of these alternative networks:vnet = readMultiTopology("snaq/net1_bucky.networks") plot(vnet[1], :R) # best network rootatnode!(vnet[1],"6") rootatnode!(vnet[2],"6") # R"pdf"("plot6-boot.pdf", width=6, height=3); # to save as pdf if needed R"layout"([1 2]); # [1 2] is a 1x2 matrix. [1,2] is a vector of length 2 R"par"(mar=[0.1,0.1,0.1,0.5]); plot(vnet[1], :R); plot(vnet[2], :R); #R"dev.off"();
; less snaq/net1_bucky.networks ; grep -Eo "with -loglik [0-9]+.[0-9]+" snaq/net1_bucky.networks
- Level-1 assumption: no intersecting cycles, so maybe no more hybridizations can be added if
number of taxa is small
- The bigger the network to estimate, the more genes are needed (compare bootstrap results between 30 genes and 300 genes).
- General documentation here
- Inside Julia:
?plot
- PhyloNetworks google group
- check the issues open on GitHub, open a new one if the problem that you are encountering is new
- email us
Next: how to choose the number of reticulations.
PhyloNetworks Workshop
- home
- example data
-
TICR pipeline:
from sequences to quartet CFs
- the data
- MrBayes on all genes
- BUCKy
- Quartet MaxCut
- RAxML & ASTRAL
- PhyloNetworks: from quartet CFs or gene trees to phylogenetic networks
- TICR test: is a population tree with ILS sufficient (vs network)?
- Continuous trait evolution on a network