-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simulation protocol #3
Comments
In their paper, they use this parameters in the simulations \lambda - \nu ~ exp(0.1) with mean 10 As I said in a slack message, this process gives Yule-type networks, that act as the Yule-type trees, see example below for trees: Once they start having a certain amount of nodes, things get crazy and they start speciating and hybridising a lot. RandomNetwork.txt Sarah, you will find the parameters they use in the paper on lines 8-17, you can copy and paste this in your simulator and add a line similar to line 143, and you should be good. |
I did play with the simulator for a while and I think that I found a way to set the parameters that gives in average networks with enough leaves and not too many reticulations. I also suggest to use the ratio no_of_hybrids/no_of_leaves to do our choices (line 148) |
Great, it works! 👍 It is very slow, but it works! (The slightly adapted version is in https://github.com/lutteropp/NetRAX/blob/master/simulator/src/network/logic/celine_simulator.py) Due to runtime issues, it is not realistic to exactly pre-specify wanted number of taxa and wanted number of reticulations with this simulator. Thus, we are categorizing the simulated datasets (accepting every network with at least 30 taxa and up to 10% reticulations) into low/medium/large buckets instead. For generating a set of acceptable datasets, we use https://github.com/lutteropp/NetRAX/blob/master/simulator/src/network/logic/simulate_celine_endless.py Our current simulation setup is:
As a first step (to test the pipeline), we will also do the simulations on tiny networks (up to 10 taxa, up to 2 reticulations). Slightly later, we will also run NetRAX with raxml-ng best tree (to check performance implications). Much later (for second paper):
The entire simulation pipeline can currently be found here: |
+ start from one RAxML tree
…On 19.11.20 14:29, Sarah Lutteropp wrote:
Great, it works! 👍 It is very slow, but it works! (The slightly adapted
version is in
https://github.com/lutteropp/NetRAX/blob/master/simulator/src/network/logic/celine_simulator.py)
Due to runtime issues, it is not realistic to exactly pre-specify wanted
number of taxa and wanted number of reticulations with this simulator.
Thus, we are categorizing the simulated datasets (accepting every
network with more than 30 taxa and up to 10% reticulations) into
low/medium/large buckets instead. For generating a set of acceptable
datasets, we use
https://github.com/lutteropp/NetRAX/blob/master/simulator/src/network/logic/simulate_celine_endless.py
Our current simulation setup is:
* categorize simulated networks into low/medium/high buckets (in terms
of #taxa/#reticulations)
* use 5 random starting points in NetRAX
* #of trees to simulate: each displayed tree
* MSA-partition-size: 500, 1000 sites per tree
* use both likelihood functions (average + best)
As a first step (to test the pipeline), we will also do the simulations
on tiny networks (up to 10 taxa, up to 2 reticulations).
Slightly later, we will also run NetRAX with raxml-ng best tree (to
check performance implications).
Much later (for second paper):
* 3 simulators (celine, sarah, empirical datasets)
* identifiability issue, different networks can have same displayed
trees --> graph isomorphism on rooted DAGs for checking network topology
The entire simulation pipeline can currently be found here:
https://github.com/lutteropp/NetRAX/tree/master/simulator/src/network/logic
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXB6XKIBIXF7PSO2LZ3JDSQUFQ5ANCNFSM4TRW4E2Q>.
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
|
There is one problem with our simulation setup from above - it ignores the displayed tree probabilities. I implemented the following sampling types:
What I wrote above would be PERFECT_UNIFORM_SAMPLING. I am running the small dataset simulations with this for now, but I expect NetRAX to not infer the correct reticulation probabilities in this setting. Instead, I believe PERFECT_SAMPLING is better. What are your thoughts on this? |
I'd tend to agree that perfect_sampling is better
alexis
…On 20.11.20 21:52, Sarah Lutteropp wrote:
There is one problem with our simulation setup, and that is that it
ignores the displayed tree probabilities.
I implemented the following sampling types:
|class SamplingType(Enum): STANDARD = 1 # randomly choose which tree to
sample, then sample equal number of sites for each sampled tree - this
is the only mode that uses the n_trees or m parameter for sampling
PERFECT_SAMPLING = 2 # sample each displayed tree, and as many site as
expected by the tree probability PERFECT_UNIFORM_SAMPLING = 3 # sample
each displayed tree, with the same number of sites per tree (ignoring
reticulation probabilities) SINGLE_SITE_SAMPLING = 4 # sample each site
individually, with the reticulation probabilities in mind |
What I wrote above would be PERFECT_UNIFORM_SAMPLING. I am running the
small dataset simulations with this for now, but I expect NetRAX to not
infer the correct reticulation probabilities in this setting. Instead, I
believe PERFECT_SAMPLING is better. What are your thoughts on this?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXB6VXZ5IVVHMOQHDEGNLSQ3CHHANCNFSM4TRW4E2Q>.
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
|
Same here! |
Well if that's the way it is, then let's add more starting networks ...
An additional interesting number here would be to compare the LnL of the
true network and the inferred networks to see how far off we are.
Alexis
…On 25.11.20 17:23, Sarah Lutteropp wrote:
Okay, switching to PERFECT_SAMPLING as we all agree on that.
Second problem: 5 random starting networks are not enough, even if the
"true" network is a tree. RAxML-NG uses by default 10 parsimony + 10
random starting trees. I tried NetRAX (with only 5 random start
networks) on a small (10 taxa) tree dataset and it got stuck in a local
optimum.
We currently have no parsimony networks in NetRAX. But we definitely
need more than five random start networks with the current network
search algorithm (adding comparison to RAxML-NG and start from best
RAxML-tree right now).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXB6X2AYV3A3W2EXM7ND3SRUONRANCNFSM4TRW4E2Q>.
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
|
@stamatak Already in there ;-) I copy-pasted the struct I use for the results, as this is the shortest code part where one can see what is currently evaluated
(inferred_with_raxml means that RAxML best tree was used as single starting network) |
Looking good! |
This is what we currently measure: #14 |
I just realized that we may need both PERFECT_SAMPLING and PERFECT_UNIFORM_SAMPLING in our experiments:
|
sounds good,
…On 30.11.20 02:46, Sarah Lutteropp wrote:
I just realized that we need both PERFECT_SAMPLING and
PERFECT_UNIFORM_SAMPLING in our experiments:
* PERFECT_SAMPLING is better for comparing loglikelihood and BIC score.
* PERFECT_UNIFORM_SAMPLING is better for comparing network topology.
Because those Dendroscope topological distances do not take
reticulation probabilities into account...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXB6SRVQSQOIH4B47UXVLSSLTN3ANCNFSM4TRW4E2Q>.
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
|
There are three different settings on how to handle partitions:
... It appears to me that we need to include all these three possibilities in our experiments, as soon as we have more than one partition in the MSA. |
hm yes, we do, but maybe linked is not so easy to implement for a
network (and one also already needs to be careful when implementing this
for standard ML), the easiest is probably UNLINKED
…On 06.12.20 03:19, Sarah Lutteropp wrote:
There are three different settings on how to handle branch lengths:
* PLLMOD_COMMON_BRLEN_UNLINKED: Each partition has its own independent
branch lengths and reticulation probs.
* PLLMOD_COMMON_BRLEN_LINKED: All partitions share the same branch
lengths and reticulation probs.
* PLLMOD_COMMON_BRLEN_SCALED: All partitions share the same branch
lengths and reticulation probs. In addition to that, each partition
has its own scaling factor by which it is allowed to multiply the
branch lengths for it.
... It appears to me that we need all these three possibilities in our
experiments.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXB6UJSWRGXMRFDHV75V3STLLYLANCNFSM4TRW4E2Q>.
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
|
@stamatak umm... all 3 versions are fully implemented in NetRAX already |
fine with me ... just commented
…On 06.12.20 20:14, Sarah Lutteropp wrote:
@stamatak <https://github.com/stamatak> umm... all 3 versions are fully
implemented in NetRAX already
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXB6QJ5LIVOS2TCS2F6W3STPCXXANCNFSM4TRW4E2Q>.
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
|
should I switch the experiments to run with UNLINKED then? So far I used LINKED. |
I believe unlinked/scaled make more sense, with unlinked there's always
the danger of over-parametrizing as each partition addes a whole lot of
additional free branch length parameters ...
…On 06.12.20 20:31, Sarah Lutteropp wrote:
should I switch the experiments to run with UNLINKED then? So far I used
LINKED.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXB6QGW227G6KRL2QTRXTSTPEY5ANCNFSM4TRW4E2Q>.
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
|
I would go with scaled but in the experiment linked should be enough since the data are linked (TODO for the second paper, use a scaled framework to simulate?)
I am not sure that I would do that, because we need in any case a way to check the gammas when reconstructing the network in the PERFECT_SAMPLING setting. |
Cool, then let's stick to scaled (and if we under-estimate reticulations, switch to linked later) and only PERFECT_SAMPLING first :) It is easier (in terms of total runtime, number of plots to look at, and interpretation) to not try every possible combination of parameters one could think of in the initial experiments... |
sounds good, still unlinked should also be tested, as we don't
understand the effects of reticulations on branch length
variance/differences yet
…On 07.12.20 14:03, Sarah Lutteropp wrote:
Cool, then let's stick to /scaled/ (and if we under-estimate
reticulations, switch to linked later) and only /PERFECT_SAMPLING/ first :)
It is easier (in terms of total runtime, number of plots to look at, and
interpretation) to not try every possible combination of parameters one
could think of in the initial experiments...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXB6QDYZSGGOZCAE6S6KLSTTABRANCNFSM4TRW4E2Q>.
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
|
We can test it after the scaled one, @stamatak, to avoid the burden to test them at the same time? |
Yes, let's make a separate experiment that only looks at the effect of scaled vs. unlinked brlens :) |
(I can do the runs with tons of different parameters on the cluster, but my problem is interpreting, plotting, and summarizing all these results for so many different parameter combinations at once; and still seeing the connections there... it's making tons of plots) |
(also, having to entirely re-run some broken experiment is more painful when it was on a lot of parameter combinations) |
I'd do it right away and would just push everything on the cluster, as I
don't have an intuition how this will behave, it's better to just
conduct all runs at once
…On 07.12.20 15:40, Sarah Lutteropp wrote:
(also, having to entirely re-run some broken experiment is more painful
when it was on a lot of parameter combinations)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXB6USQK5GJG3M743FSR3STTLNBANCNFSM4TRW4E2Q>.
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
|
We are going to simulate our networks under the birth-hybridization process of Zhang et al 2018.
In this process, we have a speciation rate l (lambda), a hybridisation rate v (nu) and the time we run the process t_0.
From their paper: "The simulator starts from the time of origin (t_0) with one species. A species splits into two (speciation) with rate l, and two species merge into one (hybridization) with rate v. At the moment of k branches, the total rate of change is r_tot = k l + choose (k 2) * v. We generate a waiting time ~exp(r_tot) and a random variable u ~U(0,1), If u < kl=r_tot, we randomly select a branch to split; otherwise, we randomly select two branches to join, and generate an inheritance
probability c ~U(0,1). The simulator stops at time t_0."
Zhang et al. - 2018 - Bayesian Inference of Species Networks from Multil.pdf
No transfer from now, because branches with no length cannot be recovered.
Given such generated network, we extract a random tree, using the inheritance probabilities at the reticulated nodes: for each reticulated node x, we draw a number y ~ U(0,1) and we choose the first parent p_x if y < inheritanceProb of the edge p_x,x. If we do this,we have atree but not a binary one. Then we clean the tree from the dead nodes and 1-indegree 1-outdegree nodes, as Sarah explained in her document, and we obtain a binary tree. On this binary tree, we simulate N sequences with Seq-Gen-1.3.4 (currently under this model -mHKY -t3.0 -f0.3,0.2,0.2,0.3 this can be changed).
We do this K times, obtaining an alignment of length K*N in which each group of N sites has its tree.
The text was updated successfully, but these errors were encountered: