Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A short observation about per-displayed-tree-clvs #47

Open
lutteropp opened this issue Feb 24, 2021 · 10 comments
Open

A short observation about per-displayed-tree-clvs #47

lutteropp opened this issue Feb 24, 2021 · 10 comments

Comments

@lutteropp
Copy link
Owner

lutteropp commented Feb 24, 2021

Naively, one would store number_of_network_nodes * number_of_displayed_trees CLV vectors. However, there are regions in the network that are exactly the same over multiple displayed trees. Those regions can be identified with the blob optimization trick.

---> One actually can save quite some memory and computations in there, if one cleverly reuses some CLVs/ shares them among multiple displayed trees. It will be tricky to make this work together with dynamical changes (network moves) to the network data structure.

@lutteropp
Copy link
Owner Author

Lol. Turns out blob optimization is not the way to go with this new likelihood definition we are using. Instead, each node in the network needs to store 2^(number_of_reticulations_in_subnetwork_rooted_at_the_node) clv vectors. I am currently working out the details (e.g., have subnetwork-specific tree_ids that encode which reticulations were taken, and combine them via bitshifting), but the potential speedup looks promising :sunglasses:
And now that I have a working naive per-displayed-tree-clvs implementation (that stores 16 clvs at each node in the example below) I can refine, it's actually not that big of a coding effort once I figured out the details on paper! :nerd_face: (the annoying thing remains redirecting all these clv-vector pointers from the pll_treeinfo_t object all the time)
(I am sick right now :face_with_thermometer:... but I can still do research while sneezing all day :test_tube:)

Okay, if we had memory usage issues and needed some tuneable in-between solution, then we could do with multiple clv vectors only at the megablob roots, recomputing all the other clv vectors within a megablob on-the-fly while iterating over its trees. And then, then it would make sense to use gray-code iteration order to minimize the number of clv re-computations within a megablob. But with low numbers of reticulations we are currently talking about, and still rather low number of taxa, and not superduper insanely large MSAs, memory issues are currently not our priority.

8afd0042e488d2c806de6534af5e30c4-3

Maybe we can even reduce the number of stored CLVs a bit more... to ignore unreachable/dead areas (this is, areas that have some inner node with suddenly no active children, due to how the active reticulations were chosen).

@lutteropp
Copy link
Owner Author

But even if memory usage ends up being an issue in the future, the blobs per-se do not make much sense anymore.
We could simply arbitrarily choose some on-the-fly nodes for which we don't want to keep all their clvs in memory, but rather recompute them on-demand...

@lutteropp
Copy link
Owner Author

RIP blobs and gray-code stuff... well, they still make sense for the "wrong" likelihood definition. Maybe we can use the wrong one as a heuristic during horizontal search, to speed things up a little...

@lutteropp
Copy link
Owner Author

Even more CLV saving potential, but not worth the extra effort and dirty case distinctions needed:
8afd0042e488d2c806de6534af5e30c4-6

@lutteropp
Copy link
Owner Author

Okay, actually this extra clv/computation saving is easier to implement than I thought. I'm confident that I can do it without much extra effort 🙂
8afd0042e488d2c806de6534af5e30c4-7

@lutteropp
Copy link
Owner Author

lutteropp commented Feb 28, 2021

In order to avoid pointer craziness, I will modify pll_update_partials in my libpll2 fork: Instead of giving it a long list of operations, I will always just give a single pll_operation_t. And I will specify which clv vectors and scale buffers to use (for parent, left, right) via the function call, thus entirely avoiding the use of partition->clv and partition->scale_buffer.

I will also modify pll_compute_root_loglikelihood accordingly.

@lutteropp
Copy link
Owner Author

The speedup potential of this is HUGE, as it will give us back true incremental loglikelihood computation in networks!!! :-)

(Also, I already have an idea for a very ad-hoc pseudolikelihood function that's quick to compute and easy to integrate then -> the one I wanted to try from the beginning... it apparently makes no biological sense, but maybe it serves as a good speedup-heuristic for identifying/pre-scoring promising move candidates)

@stamatak
Copy link
Collaborator

stamatak commented Mar 1, 2021 via email

@lutteropp
Copy link
Owner Author

Topologies like this one complicate things, due to not only reticulations, but also dead nodes. I have annotated the reticulation configurations for the displayed trees that need to be stored at each node in the picture on the right:
Unbenannte Notiz - 07 03 2021 03 03 - Seite 1

Luckily, I have already figured out a solution.

  • Solution for processNodeTwoChildren: For each left tree, we need to check under which conditions (if any) it is alone. For each right tree, we need to check under which conditions (if any) it is alone. And a combined tree from left and right can have multiple reticulation configurations as well...
  • Also, each displayed tree stores multiple reticulation configurations. A new ReticulationConfigSet data structure makes sense.
  • This also changes (simplifies?) the handling of reticulation children, as having an inactive reticulation child or a dead node child is kinda the same thing.

@lutteropp
Copy link
Owner Author

Done! It works nicely and is ready to be integrated into the master branch. 😎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants