Simulation protocol #3

celinescornavacca · 2020-11-11T08:50:01Z

We are going to simulate our networks under the birth-hybridization process of Zhang et al 2018.
In this process, we have a speciation rate l (lambda), a hybridisation rate v (nu) and the time we run the process t_0.

From their paper: "The simulator starts from the time of origin (t_0) with one species. A species splits into two (speciation) with rate l, and two species merge into one (hybridization) with rate v. At the moment of k branches, the total rate of change is r_tot = k l + choose (k 2) * v. We generate a waiting time ~exp(r_tot) and a random variable u ~U(0,1), If u < kl=r_tot, we randomly select a branch to split; otherwise, we randomly select two branches to join, and generate an inheritance
probability c ~U(0,1). The simulator stops at time t_0."

Zhang et al. - 2018 - Bayesian Inference of Species Networks from Multil.pdf

No transfer from now, because branches with no length cannot be recovered.

Given such generated network, we extract a random tree, using the inheritance probabilities at the reticulated nodes: for each reticulated node x, we draw a number y ~ U(0,1) and we choose the first parent p_x if y < inheritanceProb of the edge p_x,x. If we do this,we have atree but not a binary one. Then we clean the tree from the dead nodes and 1-indegree 1-outdegree nodes, as Sarah explained in her document, and we obtain a binary tree. On this binary tree, we simulate N sequences with Seq-Gen-1.3.4 (currently under this model -mHKY -t3.0 -f0.3,0.2,0.2,0.3 this can be changed).
We do this K times, obtaining an alignment of length K*N in which each group of N sites has its tree.

celinescornavacca · 2020-11-11T09:27:45Z

In their paper, they use this parameters in the simulations

\lambda - \nu ~ exp(0.1) with mean 10
\tau_0 ~ exp(10) with mean 0.1
\nu / \lambda ~ Beta distribution with alpha = 1 and beta = 1 (same as uniform distribution between 0 and 1)
\gamma ~ Uniform distribution (between 0 and 1)

As I said in a slack message, this process gives Yule-type networks, that act as the Yule-type trees, see example below for trees:

Once they start having a certain amount of nodes, things get crazy and they start speciating and hybridising a lot.
It is not the focus of our paper to correctly simulate networks (we are not in a Bayesian framework, we do not need priors) and this is the best we can for now (for the next paper, we will do better, working on it).
So we can simply throw the networks we do not like as done in line 143-147 of this simulator, which only simulate networks and not trees and sequences.

RandomNetwork.txt
(in txt because py is not supported)

Sarah, you will find the parameters they use in the paper on lines 8-17, you can copy and paste this in your simulator and add a line similar to line 143, and you should be good.

celinescornavacca · 2020-11-11T11:11:47Z

I did play with the simulator for a while and I think that I found a way to set the parameters that gives in average networks with enough leaves and not too many reticulations.

I also suggest to use the ratio no_of_hybrids/no_of_leaves to do our choices (line 148)
Note that the choice of parameters has to be done in the "while simulateNetwork:" loop (line 46-48), not before as done in my previous file

RandomNetwork.txt

lutteropp · 2020-11-19T12:28:46Z

Great, it works! 👍 It is very slow, but it works! (The slightly adapted version is in https://github.com/lutteropp/NetRAX/blob/master/simulator/src/network/logic/celine_simulator.py)

Due to runtime issues, it is not realistic to exactly pre-specify wanted number of taxa and wanted number of reticulations with this simulator. Thus, we are categorizing the simulated datasets (accepting every network with at least 30 taxa and up to 10% reticulations) into low/medium/large buckets instead. For generating a set of acceptable datasets, we use https://github.com/lutteropp/NetRAX/blob/master/simulator/src/network/logic/simulate_celine_endless.py

Our current simulation setup is:

categorize simulated networks into low/medium/high buckets (in terms of #taxa/#reticulations)
use 5 random starting points in NetRAX
#of trees to simulate: each displayed tree
MSA-partition-size: 500, 1000 sites per tree
use both likelihood functions (average + best)

As a first step (to test the pipeline), we will also do the simulations on tiny networks (up to 10 taxa, up to 2 reticulations).

Slightly later, we will also run NetRAX with raxml-ng best tree (to check performance implications).

Much later (for second paper):

3 simulators (celine, sarah, empirical datasets)
identifiability issue, different networks can have same displayed trees --> graph isomorphism on rooted DAGs for checking network topology

The entire simulation pipeline can currently be found here:
https://github.com/lutteropp/NetRAX/tree/master/simulator/src/network/logic

stamatak · 2020-11-19T15:02:50Z

+ start from one RAxML tree

…

On 19.11.20 14:29, Sarah Lutteropp wrote: Great, it works! 👍 It is very slow, but it works! (The slightly adapted version is in https://github.com/lutteropp/NetRAX/blob/master/simulator/src/network/logic/celine_simulator.py) Due to runtime issues, it is not realistic to exactly pre-specify wanted number of taxa and wanted number of reticulations with this simulator. Thus, we are categorizing the simulated datasets (accepting every network with more than 30 taxa and up to 10% reticulations) into low/medium/large buckets instead. For generating a set of acceptable datasets, we use https://github.com/lutteropp/NetRAX/blob/master/simulator/src/network/logic/simulate_celine_endless.py Our current simulation setup is: * categorize simulated networks into low/medium/high buckets (in terms of #taxa/#reticulations) * use 5 random starting points in NetRAX * #of trees to simulate: each displayed tree * MSA-partition-size: 500, 1000 sites per tree * use both likelihood functions (average + best) As a first step (to test the pipeline), we will also do the simulations on tiny networks (up to 10 taxa, up to 2 reticulations). Slightly later, we will also run NetRAX with raxml-ng best tree (to check performance implications). Much later (for second paper): * 3 simulators (celine, sarah, empirical datasets) * identifiability issue, different networks can have same displayed trees --> graph isomorphism on rooted DAGs for checking network topology The entire simulation pipeline can currently be found here: https://github.com/lutteropp/NetRAX/tree/master/simulator/src/network/logic — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGXB6XKIBIXF7PSO2LZ3JDSQUFQ5ANCNFSM4TRW4E2Q>.

-- Alexandros (Alexis) Stamatakis Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology www.exelixis-lab.org

lutteropp · 2020-11-20T19:52:04Z

There is one problem with our simulation setup from above - it ignores the displayed tree probabilities.

I implemented the following sampling types:

class SamplingType(Enum):
    STANDARD = 1 # randomly choose which tree to sample, then sample equal number of sites for each sampled tree - this is the only mode that uses the n_trees or m parameter for sampling
    PERFECT_SAMPLING = 2 # sample each displayed tree, and as many site as expected by the tree probability
    PERFECT_UNIFORM_SAMPLING = 3 # sample each displayed tree, with the same number of sites per tree (ignoring reticulation probabilities)
    SINGLE_SITE_SAMPLING = 4 # sample each site individually, with the reticulation probabilities in mind

What I wrote above would be PERFECT_UNIFORM_SAMPLING. I am running the small dataset simulations with this for now, but I expect NetRAX to not infer the correct reticulation probabilities in this setting. Instead, I believe PERFECT_SAMPLING is better. What are your thoughts on this?

stamatak · 2020-11-21T18:28:28Z

I'd tend to agree that perfect_sampling is better alexis

…

On 20.11.20 21:52, Sarah Lutteropp wrote: There is one problem with our simulation setup, and that is that it ignores the displayed tree probabilities. I implemented the following sampling types: |class SamplingType(Enum): STANDARD = 1 # randomly choose which tree to sample, then sample equal number of sites for each sampled tree - this is the only mode that uses the n_trees or m parameter for sampling PERFECT_SAMPLING = 2 # sample each displayed tree, and as many site as expected by the tree probability PERFECT_UNIFORM_SAMPLING = 3 # sample each displayed tree, with the same number of sites per tree (ignoring reticulation probabilities) SINGLE_SITE_SAMPLING = 4 # sample each site individually, with the reticulation probabilities in mind | What I wrote above would be PERFECT_UNIFORM_SAMPLING. I am running the small dataset simulations with this for now, but I expect NetRAX to not infer the correct reticulation probabilities in this setting. Instead, I believe PERFECT_SAMPLING is better. What are your thoughts on this? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGXB6VXZ5IVVHMOQHDEGNLSQ3CHHANCNFSM4TRW4E2Q>.

-- Alexandros (Alexis) Stamatakis Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology www.exelixis-lab.org

celinescornavacca · 2020-11-23T07:59:45Z

Same here!

lutteropp · 2020-11-25T15:22:47Z

Okay, switched to PERFECT_SAMPLING as we all agree on that. Done in addb906.

(EDIT: Moved starting networks question to new issue)

stamatak · 2020-11-25T18:47:56Z

Well if that's the way it is, then let's add more starting networks ... An additional interesting number here would be to compare the LnL of the true network and the inferred networks to see how far off we are. Alexis

…

On 25.11.20 17:23, Sarah Lutteropp wrote: Okay, switching to PERFECT_SAMPLING as we all agree on that. Second problem: 5 random starting networks are not enough, even if the "true" network is a tree. RAxML-NG uses by default 10 parsimony + 10 random starting trees. I tried NetRAX (with only 5 random start networks) on a small (10 taxa) tree dataset and it got stuck in a local optimum. We currently have no parsimony networks in NetRAX. But we definitely need more than five random start networks with the current network search algorithm (adding comparison to RAxML-NG and start from best RAxML-tree right now). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGXB6X2AYV3A3W2EXM7ND3SRUONRANCNFSM4TRW4E2Q>.

-- Alexandros (Alexis) Stamatakis Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology www.exelixis-lab.org

lutteropp · 2020-11-25T19:35:05Z

@stamatak Already in there ;-) I copy-pasted the struct I use for the results, as this is the shortest code part where one can see what is currently evaluated

class Result:
    def __init__(self, dataset):
        self.dataset = dataset
        self.bic_true = 0
        self.logl_true = 0
        self.bic_inferred = 0
        self.logl_inferred = 0
        self.bic_inferred_with_raxml = 0
        self.logl_inferred_with_raxml = 0
        self.bic_raxml = 0
        self.logl_raxml = 0
        self.n_reticulations_inferred = 0
        self.n_reticulations_inferred_with_raxml = 0
        self.topological_distances = {}
        self.topological_distances_with_raxml = {}

(inferred_with_raxml means that RAxML best tree was used as single starting network)

celinescornavacca · 2020-11-25T21:14:52Z

Looking good!

lutteropp · 2020-11-29T23:58:31Z

This is what we currently measure: #14

lutteropp · 2020-11-30T00:46:09Z

I just realized that we may need both PERFECT_SAMPLING and PERFECT_UNIFORM_SAMPLING in our experiments:

PERFECT_SAMPLING is better for comparing loglikelihood and BIC score.
PERFECT_UNIFORM_SAMPLING is better for scoring well when comparing network topology. Because those Dendroscope topological distances do not take reticulation probabilities into account...

stamatak · 2020-11-30T08:20:28Z

sounds good,

…

On 30.11.20 02:46, Sarah Lutteropp wrote: I just realized that we need both PERFECT_SAMPLING and PERFECT_UNIFORM_SAMPLING in our experiments: * PERFECT_SAMPLING is better for comparing loglikelihood and BIC score. * PERFECT_UNIFORM_SAMPLING is better for comparing network topology. Because those Dendroscope topological distances do not take reticulation probabilities into account... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGXB6SRVQSQOIH4B47UXVLSSLTN3ANCNFSM4TRW4E2Q>.

-- Alexandros (Alexis) Stamatakis Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology www.exelixis-lab.org

lutteropp · 2020-12-06T01:18:49Z

There are three different settings on how to handle partitions:

PLLMOD_COMMON_BRLEN_UNLINKED: Each partition has its own independent branch lengths and reticulation probs.
PLLMOD_COMMON_BRLEN_LINKED: All partitions share the same branch lengths and reticulation probs.
PLLMOD_COMMON_BRLEN_SCALED: All partitions share the same base branch lengths and reticulation probs. In addition to that, each partition has its own scaling factor by which it is allowed to multiply the branch lengths for it.

... It appears to me that we need to include all these three possibilities in our experiments, as soon as we have more than one partition in the MSA.

stamatak · 2020-12-06T18:08:05Z

hm yes, we do, but maybe linked is not so easy to implement for a network (and one also already needs to be careful when implementing this for standard ML), the easiest is probably UNLINKED

…

On 06.12.20 03:19, Sarah Lutteropp wrote: There are three different settings on how to handle branch lengths: * PLLMOD_COMMON_BRLEN_UNLINKED: Each partition has its own independent branch lengths and reticulation probs. * PLLMOD_COMMON_BRLEN_LINKED: All partitions share the same branch lengths and reticulation probs. * PLLMOD_COMMON_BRLEN_SCALED: All partitions share the same branch lengths and reticulation probs. In addition to that, each partition has its own scaling factor by which it is allowed to multiply the branch lengths for it. ... It appears to me that we need all these three possibilities in our experiments. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGXB6UJSWRGXMRFDHV75V3STLLYLANCNFSM4TRW4E2Q>.

-- Alexandros (Alexis) Stamatakis Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology www.exelixis-lab.org

lutteropp · 2020-12-06T18:14:06Z

@stamatak umm... all 3 versions are fully implemented in NetRAX already

stamatak · 2020-12-06T18:15:49Z

fine with me ... just commented

…

On 06.12.20 20:14, Sarah Lutteropp wrote: @stamatak <https://github.com/stamatak> umm... all 3 versions are fully implemented in NetRAX already — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGXB6QJ5LIVOS2TCS2F6W3STPCXXANCNFSM4TRW4E2Q>.

-- Alexandros (Alexis) Stamatakis Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology www.exelixis-lab.org

lutteropp · 2020-12-06T18:31:29Z

should I switch the experiments to run with UNLINKED then? So far I used LINKED.

stamatak · 2020-12-07T05:46:26Z

I believe unlinked/scaled make more sense, with unlinked there's always the danger of over-parametrizing as each partition addes a whole lot of additional free branch length parameters ...

…

On 06.12.20 20:31, Sarah Lutteropp wrote: should I switch the experiments to run with UNLINKED then? So far I used LINKED. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGXB6QGW227G6KRL2QTRXTSTPEY5ANCNFSM4TRW4E2Q>.

-- Alexandros (Alexis) Stamatakis Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology www.exelixis-lab.org

celinescornavacca · 2020-12-07T08:35:44Z

I would go with scaled but in the experiment linked should be enough since the data are linked (TODO for the second paper, use a scaled framework to simulate?)

PERFECT_UNIFORM_SAMPLING is better for scoring well when comparing network topology. Because those Dendroscope topological distances do not take reticulation probabilities into account...

I am not sure that I would do that, because we need in any case a way to check the gammas when reconstructing the network in the PERFECT_SAMPLING setting.

lutteropp · 2020-12-07T12:03:20Z

Cool, then let's stick to scaled (and if we under-estimate reticulations, switch to linked later) and only PERFECT_SAMPLING first :)

It is easier (in terms of total runtime, number of plots to look at, and interpretation) to not try every possible combination of parameters one could think of in the initial experiments...

stamatak · 2020-12-07T12:07:18Z

sounds good, still unlinked should also be tested, as we don't understand the effects of reticulations on branch length variance/differences yet

…

On 07.12.20 14:03, Sarah Lutteropp wrote: Cool, then let's stick to /scaled/ (and if we under-estimate reticulations, switch to linked later) and only /PERFECT_SAMPLING/ first :) It is easier (in terms of total runtime, number of plots to look at, and interpretation) to not try every possible combination of parameters one could think of in the initial experiments... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGXB6QDYZSGGOZCAE6S6KLSTTABRANCNFSM4TRW4E2Q>.

-- Alexandros (Alexis) Stamatakis Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology www.exelixis-lab.org

celinescornavacca · 2020-12-07T12:56:52Z

We can test it after the scaled one, @stamatak, to avoid the burden to test them at the same time?

lutteropp · 2020-12-07T13:16:30Z

Yes, let's make a separate experiment that only looks at the effect of scaled vs. unlinked brlens :)
EDIT: I created an issue for it: #28

lutteropp · 2020-12-07T13:26:08Z

(I can do the runs with tons of different parameters on the cluster, but my problem is interpreting, plotting, and summarizing all these results for so many different parameter combinations at once; and still seeing the connections there... it's making tons of plots)

lutteropp · 2020-12-07T13:40:15Z

(also, having to entirely re-run some broken experiment is more painful when it was on a lot of parameter combinations)

stamatak · 2020-12-07T18:45:19Z

I'd do it right away and would just push everything on the cluster, as I don't have an intuition how this will behave, it's better to just conduct all runs at once

…

On 07.12.20 15:40, Sarah Lutteropp wrote: (also, having to entirely re-run some broken experiment is more painful when it was on a lot of parameter combinations) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGXB6USQK5GJG3M743FSR3STTLNBANCNFSM4TRW4E2Q>.

-- Alexandros (Alexis) Stamatakis Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology www.exelixis-lab.org

lutteropp mentioned this issue Nov 29, 2020

100 simulated MSA sites on 10-taxon-tree is not enough for RAxML #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simulation protocol #3

Simulation protocol #3

celinescornavacca commented Nov 11, 2020

celinescornavacca commented Nov 11, 2020 •

edited

Loading

celinescornavacca commented Nov 11, 2020

lutteropp commented Nov 19, 2020 •

edited

Loading

stamatak commented Nov 19, 2020 via email

lutteropp commented Nov 20, 2020 •

edited

Loading

stamatak commented Nov 21, 2020 via email

celinescornavacca commented Nov 23, 2020

lutteropp commented Nov 25, 2020 •

edited

Loading

stamatak commented Nov 25, 2020 via email

lutteropp commented Nov 25, 2020 •

edited

Loading

celinescornavacca commented Nov 25, 2020

lutteropp commented Nov 29, 2020

lutteropp commented Nov 30, 2020 •

edited

Loading

stamatak commented Nov 30, 2020 via email

lutteropp commented Dec 6, 2020 •

edited

Loading

stamatak commented Dec 6, 2020 via email

lutteropp commented Dec 6, 2020

stamatak commented Dec 6, 2020 via email

lutteropp commented Dec 6, 2020

stamatak commented Dec 7, 2020 via email

celinescornavacca commented Dec 7, 2020

lutteropp commented Dec 7, 2020

stamatak commented Dec 7, 2020 via email

celinescornavacca commented Dec 7, 2020

lutteropp commented Dec 7, 2020 •

edited

Loading

lutteropp commented Dec 7, 2020 •

edited

Loading

lutteropp commented Dec 7, 2020

stamatak commented Dec 7, 2020 via email

Simulation protocol #3

Simulation protocol #3

Comments

celinescornavacca commented Nov 11, 2020

celinescornavacca commented Nov 11, 2020 • edited Loading

celinescornavacca commented Nov 11, 2020

lutteropp commented Nov 19, 2020 • edited Loading

stamatak commented Nov 19, 2020 via email

lutteropp commented Nov 20, 2020 • edited Loading

stamatak commented Nov 21, 2020 via email

celinescornavacca commented Nov 23, 2020

lutteropp commented Nov 25, 2020 • edited Loading

stamatak commented Nov 25, 2020 via email

lutteropp commented Nov 25, 2020 • edited Loading

celinescornavacca commented Nov 25, 2020

lutteropp commented Nov 29, 2020

lutteropp commented Nov 30, 2020 • edited Loading

stamatak commented Nov 30, 2020 via email

lutteropp commented Dec 6, 2020 • edited Loading

stamatak commented Dec 6, 2020 via email

lutteropp commented Dec 6, 2020

stamatak commented Dec 6, 2020 via email

lutteropp commented Dec 6, 2020

stamatak commented Dec 7, 2020 via email

celinescornavacca commented Dec 7, 2020

lutteropp commented Dec 7, 2020

stamatak commented Dec 7, 2020 via email

celinescornavacca commented Dec 7, 2020

lutteropp commented Dec 7, 2020 • edited Loading

lutteropp commented Dec 7, 2020 • edited Loading

lutteropp commented Dec 7, 2020

stamatak commented Dec 7, 2020 via email

celinescornavacca commented Nov 11, 2020 •

edited

Loading

lutteropp commented Nov 19, 2020 •

edited

Loading

lutteropp commented Nov 20, 2020 •

edited

Loading

lutteropp commented Nov 25, 2020 •

edited

Loading

lutteropp commented Nov 25, 2020 •

edited

Loading

lutteropp commented Nov 30, 2020 •

edited

Loading

lutteropp commented Dec 6, 2020 •

edited

Loading

lutteropp commented Dec 7, 2020 •

edited

Loading

lutteropp commented Dec 7, 2020 •

edited

Loading