A short observation about per-displayed-tree-clvs #47

lutteropp · 2021-02-24T20:06:48Z

Naively, one would store number_of_network_nodes * number_of_displayed_trees CLV vectors. However, there are regions in the network that are exactly the same over multiple displayed trees. Those regions can be identified with the blob optimization trick.

---> One actually can save quite some memory and computations in there, if one cleverly reuses some CLVs/ shares them among multiple displayed trees. It will be tricky to make this work together with dynamical changes (network moves) to the network data structure.

The text was updated successfully, but these errors were encountered:

lutteropp · 2021-02-28T02:29:39Z

Lol. Turns out blob optimization is not the way to go with this new likelihood definition we are using. Instead, each node in the network needs to store 2^(number_of_reticulations_in_subnetwork_rooted_at_the_node) clv vectors. I am currently working out the details (e.g., have subnetwork-specific tree_ids that encode which reticulations were taken, and combine them via bitshifting), but the potential speedup looks promising :sunglasses:
And now that I have a working naive per-displayed-tree-clvs implementation (that stores 16 clvs at each node in the example below) I can refine, it's actually not that big of a coding effort once I figured out the details on paper! :nerd_face: (the annoying thing remains redirecting all these clv-vector pointers from the pll_treeinfo_t object all the time)
(I am sick right now :face_with_thermometer:... but I can still do research while sneezing all day :test_tube:)

Okay, if we had memory usage issues and needed some tuneable in-between solution, then we could do with multiple clv vectors only at the megablob roots, recomputing all the other clv vectors within a megablob on-the-fly while iterating over its trees. And then, then it would make sense to use gray-code iteration order to minimize the number of clv re-computations within a megablob. But with low numbers of reticulations we are currently talking about, and still rather low number of taxa, and not superduper insanely large MSAs, memory issues are currently not our priority.

Maybe we can even reduce the number of stored CLVs a bit more... to ignore unreachable/dead areas (this is, areas that have some inner node with suddenly no active children, due to how the active reticulations were chosen).

lutteropp · 2021-02-28T02:52:32Z

But even if memory usage ends up being an issue in the future, the blobs per-se do not make much sense anymore.
We could simply arbitrarily choose some on-the-fly nodes for which we don't want to keep all their clvs in memory, but rather recompute them on-demand...

lutteropp · 2021-02-28T02:54:17Z

RIP blobs and gray-code stuff... well, they still make sense for the "wrong" likelihood definition. Maybe we can use the wrong one as a heuristic during horizontal search, to speed things up a little...

lutteropp · 2021-02-28T03:22:14Z

Even more CLV saving potential, but not worth the extra effort and dirty case distinctions needed:

lutteropp · 2021-02-28T03:39:29Z

Okay, actually this extra clv/computation saving is easier to implement than I thought. I'm confident that I can do it without much extra effort 🙂

lutteropp · 2021-02-28T11:34:53Z

In order to avoid pointer craziness, I will modify pll_update_partials in my libpll2 fork: Instead of giving it a long list of operations, I will always just give a single pll_operation_t. And I will specify which clv vectors and scale buffers to use (for parent, left, right) via the function call, thus entirely avoiding the use of partition->clv and partition->scale_buffer.

I will also modify pll_compute_root_loglikelihood accordingly.

lutteropp · 2021-02-28T11:40:35Z

The speedup potential of this is HUGE, as it will give us back true incremental loglikelihood computation in networks!!! :-)

(Also, I already have an idea for a very ad-hoc pseudolikelihood function that's quick to compute and easy to integrate then -> the one I wanted to try from the beginning... it apparently makes no biological sense, but maybe it serves as a good speedup-heuristic for identifying/pre-scoring promising move candidates)

stamatak · 2021-03-01T04:04:07Z

That all sounds very promising and makes sense

…

On 28.02.21 13:40, Sarah Lutteropp wrote: The speedup potential of this is HUGE, as it will give us back true incremental loglikelihood computation in networks!!! :-) (Also, I already have an idea for a very ad-hoc pseudolikelihood function that's quick to compute and easy to integrate then -> the one I wanted to try from the beginning... it apparently makes no biological sense, but maybe it serves as a good speedup-heuristic for identifying/pre-scoring promising move candidates) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#47 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGXB6STEKPSOKHEZC5J2HDTBITT7ANCNFSM4YFFIWMQ>.

-- Alexandros (Alexis) Stamatakis Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology www.exelixis-lab.org

lutteropp · 2021-03-07T10:30:13Z

Topologies like this one complicate things, due to not only reticulations, but also dead nodes. I have annotated the reticulation configurations for the displayed trees that need to be stored at each node in the picture on the right:

Luckily, I have already figured out a solution.

Solution for processNodeTwoChildren: For each left tree, we need to check under which conditions (if any) it is alone. For each right tree, we need to check under which conditions (if any) it is alone. And a combined tree from left and right can have multiple reticulation configurations as well...
Also, each displayed tree stores multiple reticulation configurations. A new ReticulationConfigSet data structure makes sense.
This also changes (simplifies?) the handling of reticulation children, as having an inactive reticulation child or a dead node child is kinda the same thing.

lutteropp · 2021-03-08T11:15:20Z

Done! It works nicely and is ready to be integrated into the master branch. 😎

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A short observation about per-displayed-tree-clvs #47

A short observation about per-displayed-tree-clvs #47

lutteropp commented Feb 24, 2021 •

edited

Loading

lutteropp commented Feb 28, 2021

lutteropp commented Feb 28, 2021

lutteropp commented Feb 28, 2021

lutteropp commented Feb 28, 2021

lutteropp commented Feb 28, 2021

lutteropp commented Feb 28, 2021 •

edited

Loading

lutteropp commented Feb 28, 2021

stamatak commented Mar 1, 2021 via email

lutteropp commented Mar 7, 2021

lutteropp commented Mar 8, 2021

A short observation about per-displayed-tree-clvs #47

A short observation about per-displayed-tree-clvs #47

Comments

lutteropp commented Feb 24, 2021 • edited Loading

lutteropp commented Feb 28, 2021

lutteropp commented Feb 28, 2021

lutteropp commented Feb 28, 2021

lutteropp commented Feb 28, 2021

lutteropp commented Feb 28, 2021

lutteropp commented Feb 28, 2021 • edited Loading

lutteropp commented Feb 28, 2021

stamatak commented Mar 1, 2021 via email

lutteropp commented Mar 7, 2021

lutteropp commented Mar 8, 2021

lutteropp commented Feb 24, 2021 •

edited

Loading

lutteropp commented Feb 28, 2021 •

edited

Loading