Skip to content

Choice of number of hybridizations

ClaudiaSL edited this page Apr 2, 2022 · 7 revisions

Model selection tools are necessary to estimate the number of hybridizations (h). We use the log pseudolikelihood profile with h. A sharp improvement is expected until h reaches the best value and a slower, linear improvement thereafter.

Unfortunately, there is no clear-cut criterion to determine what constitutes a large and 'significant' improvement in the pseudolikelihood score. Even with a regular likelihood, AIC and BIC are not quite appropriate, because they are not meant to accommodate the exploding number of models when h gets bigger (the network space grows very fast). That being said, the pattern of score improvement can be used heuristically.

The negative log pseudolikelihood score is an attribute of the network object: net.loglik. The lower the better.

scores = [net0.loglik, net1.loglik, net2.loglik]
using RCall
R"pdf"("score-vs-h.pdf", width=4, height=4);
R"plot"(x=0:2, y=scores, type="b", xlab="number of hybridizations h",
        ylab="network score");
R"dev.off"();

Below are examples from 2 different data sets.



Goodness-of-fit of a candidate network

Recent work by Cai and Ane (2021) introduce a goodness-of-fit test for the adequacy of the multispecies network coalescent to see if a given population or species network explains the quartet concordance factor data adequately.

We need to install the new Julia package QuartetNetworkGoodnessFit.jl.

The commands below assume that we are in the directory for the data set on 15 taxa: n15.gamma0.30.20.2_n300. So first, navigate to get there. After the PhyloNetwork analyses on the other data set, we would need to do this:

cd ../n15.gamma0.30.20.2_n300/

We first read the CF table and get rid of the credibility intervals of the CFs:

using QuartetNetworkGoodnessFit, CSV
df = DataFrame(CSV.File("bucky-output/1_seqgen.CFs.csv"))
dat = df[:,[1,2,3,4,5,8,11,14]]

Then, we read the estimated network with 1 hybridization (our candidate network):

snaqnet1 = readTopology("snaq/net1_bucky.out")
rootatnode!(snaqnet1,"15")

We can plot the network:

using PhyloPlots
plot(snaqnet1,:R)

And now we fit the goodness-of-fit test (this could take a couple of minutes):

out = quarnetGoFtest!(snaqnet1, dat, false)

The first component of the output out[1] corresponds to the p-value of the overall goodness-of-fit test:

julia> out[1]
6.760380685802971e-6

This p-value implies that the candidate network (snaqnet1) does not explain the quartet CFs adequately. This is not surprising as the data in n15.gamma0.30.20.2_n300 corresponds to simulated data from a network with 3 hybridizations, and snaqnet1 has only 1.

The other components in the output out (as well as input arguments) are explained in the documentation here.

Next: perform and summarize a bootstrap analysis

PhyloNetworks Workshop

Clone this wiki locally