Derivative of Network Loglikelihood for brlen optimization #49

lutteropp · 2021-03-14T13:16:26Z

(from #43)

Use Newton-Raphson instead of Brent for optimizing branches (this will give us some niiiiiice additional speedup). This requires some Maths now, as I need to understand how to compute the derivative of network loglikelihood. Given the derivatives of the displayed tree loglikelihoods, this should be easy. I then need to understand how pll_modules computes likelihood derivatives for trees (something something sumtable), and how this can be generalized to networks.

Luckily, given the derivatives of displayed tree loglikelihoods, the derivative of the network loglikelihood is easy to compute (little mistake in the image, but not relevant here - the displayed tree loglikelihoods of course depend on the partition, also on brlens, model parameters, etc):

(EDIT: THIS IS WRONG, SEE LATER COMMENT)

Thus the main challenge here is to understand these sumtables used for tree loglh derivatives in pll_modules, being especially careful about how this works with virtual rerooting. And then I need to do this for the displayed trees displayed by the network and voilà, then we can switch from Brent brlen-opt to Newton-Raphson brlen-opt! :-)

lutteropp · 2021-03-15T11:38:05Z

One tricky thing: There can be displayed trees for which the current branch we want to optimize is inactive.
What happens to their sumtables? Since here we care about derivatives, we can safely skip those trees, not compute sumtables for them, and the derivative and second derivative of these tree loglikelihoods is zero.

There is even some more CLV saving potential in this (to be explored if later enough runtime gets spent in CLV updates again): If we can safely skip a tree, then we don't need to update its CLVs after virtual rerooting.

lutteropp · 2021-03-15T13:36:07Z

I made a mistake in the partition loglikelihood formulas above. New image coming soon.

lutteropp · 2021-03-15T14:08:39Z

Do we really need the correct derivative of phylogenetic network likelihood? It is apparently pretty ugly: https://en.wikipedia.org/wiki/Product_rule#Product_of_more_than_two_factors

lutteropp · 2021-03-15T15:31:06Z

Now I did the Math correctly, and it does not look good for Newton-Raphson on phylogenetic networks :-( We can still do it as it is known to converge quicker, but it does not help in getting rid of pll_compute_edge_loglikelihood.

lutteropp · 2021-03-15T15:33:40Z

Or maybe we can still do something here? I need to dig into how the Math is done for derivative of tree likelihood. After all, there is a product involved there, too (the product over per-site likelihoods ... plus, also a product over partitions).

lutteropp · 2021-03-15T15:45:36Z

Whichever trick is used for trees there should also work for the networks. Because with the approach I used here, I get into the same problem if there is only a single displayed tree. There has to be some further trick involved!

lutteropp · 2021-03-15T15:51:24Z

Okay... now I understand the text in Alexeys thesis (https://cme.h-its.org/exelixis/pubs/dissAlexey.pdf, page 26).
What this sumtable gives us is fast computation of:

L(b)
L'(b)
L''(b)

My problem was that I thought it only gives L'(b) and L''(b)

lutteropp · 2021-03-15T15:58:23Z

My confusion originated from this: For trees, we end up getting and using these:

(loglL(T))' -> first derivative of loglikelihood of tree T
(logL(T))'' -> second derivative of loglikelihood of tree T

But what the trick with the sumtable actually gives us are fast computations of (with respect to branch b):

L(b) -> persite likelihood of tree T
L'(b) -> first derivative of persite likelihood of tree T
L''(b) -> second derivative of persite likelihood of tree T

Instead of directly reusing the final result ((loglL(T))' and (loglL(T))'') from the trees case, I need to plug these faster-computed tree persite likelihoods and their derivatives into the network loglikelihood and loglikelihood derivative formulas.

lutteropp · 2021-03-15T16:38:53Z

I got it all together now.
The main difference (compared to the tree case) is that in order to compute network loglikelihood derivatives, we also need to use the sumtables to compute the displayed tree loglikelihoods. Only the tree loglikelihood derivatives are not enough for us.

lutteropp · 2021-03-15T16:53:31Z

Wait... do I really need the fully computed displayed tree loglikelihoods here?

For LikelihoodModel.BEST, we can do without explicitly computing them. Because there we essentially have one tree per partition...
For LikelihoodModel.AVERAGE, I currently see no way around actually computing the displayed tree loglikelihoods... But I am still not convinced that they are really needed.

--> Looks like there is no way around actually inserting the displayed tree persite loglikelihoods (and their derivatives) into the formulas for the networks case. Thus, another drawing will follow soon here. :-)

lutteropp · 2021-03-15T17:38:22Z

Turns out that for networks, we can't go down to a per-site computation basis if we have more than one displayed tree.

I am convinced about this in the LikelihoodType.AVERAGE case.
For the LikelihoodType.BEST case it depends on whether max{f', g'} / max{f, g} = max{f'/f, g'/g} always holds or not.

lutteropp · 2021-03-15T17:44:50Z

Okay, found a counterexample (for x > 0):

f(x) = 2x + 3
g(x) = 2x + 5
f'(x) = 2
g'(x) = 2
max{f'(x), g'(x)} = 2
max {f(x), g(x)} = 2x+5
max{f'(x), g'(x)} / max{f(x), g(x)} = 2/(2x+5)
max {f'/f, g'/g} = max{2/(2x+3), 2/(2x+5)} = 2/(2x+3)

----> max{f', g'} / max{f, g} != max{f'/f, g'/g}

---> Thus, no matter which likelihood model (out of AVERAGE and BEST) we are using, as soon as we have more than one displayed tree, we cannot profit from going down to the persite-level.

lutteropp · 2021-03-15T17:49:02Z

This makes it clear that for networks, we need to compute all three of logL(T|P), (logL(T|P))', (logL(T|P))'' with help of the sumtable.

The only question remains whether it makes sense to go down to a per-site level in case the network has only one displayed tree. Special treatment for the single-tree case has led to numerical quirks when computing network loglikelihood before (there, while mathematically the same, the different operations (like, adding a log(exp()) around it or not) led to a slight numerical difference that confused BIC).

lutteropp · 2021-03-27T16:21:33Z

Computation of second network loglikelihood derivative caused numerical problems (underflow/overflow). Thus, I had to play a bit with the formula:

lutteropp · 2021-03-27T16:43:09Z

Damn the Newton Raphson in pll-modules is awful, also the second network loglikelihood derivative is such a high number that it does not fit into a double. Luckily, what Newton-Raphson actually needs is just the quotient of first and second derivative, as well as the sign of the second derivative...

I'll just switch to implementing vanilla Newton-Raphson myself, using https://www.cup.uni-muenchen.de/ch/compchem/geom/nr.html and Alexis lecture slides ...

lutteropp · 2021-03-27T17:02:28Z

Or no wait, maybe I can adapt the pll_modules implementation to directly take the quotient as argument...

lutteropp · 2021-03-27T17:34:28Z

Or no wait, maybe I can adapt the pll_modules implementation to directly take the quotient as argument...

I tried this (passing the quotient and trimming too large and too small values before passing them to the pll-modules NR implementation), but it made NR never find a better branch length.

Switching to writing my own NR implementation now.

lutteropp · 2021-03-27T17:45:18Z

Mhm. It also happens with my own NR implementation... maybe the derivatives are still wrong?

lutteropp · 2021-03-27T17:53:04Z

I checked how Newton-Raphson decides on the next branch length to propose: It essentially subtracts first_logl_derivative/second_logl_derivative from the previous branch length. With these values I get for network first and second logl derivative (computed on a network with 0 reticulations), NR cannot work:

Starting with network logl: -771.9593909
proposing brlen: 0.067574

Network partition likelihood derivatives for partition 0:
partition_lh: 1.971781113e-182
partition_lh_prime: 6.673864004e-06
partition_lh_prime_prime: 4.482882894e+181
partition_logl: -418.3915497
partition_logl_prime: 3.384688068e+176
partition_logl_prime_prime: 2.273519543e+363

Network partition likelihood derivatives for partition 1:
partition_lh: 2.80180277e-154
partition_lh_prime: 322.845173
partition_lh_prime_prime: 6.284610245e+108
partition_logl: -353.5678413
partition_logl_prime: 1.152276586e+156
partition_logl_prime_prime: -1.32774133e+312

Network loglikelihood derivatives:
network_logl_prime: 3.384688068e+176
network_logl_prime_prime: 2.273519543e+363
network_logl_prime / network_logl_prime_prime: 1.488743775e-187

Maybe something is wrong with my network loglikelihood derivative formula?

lutteropp · 2021-03-27T17:58:05Z

I also printed the loglikelihood and its derivatives for the single displayed tree of the network:

For partition 0:
tree_logl: -418.3915497
tree_logl_prime: -11.91731155
tree_logl_prime_prime: 418.2681682

For partition 1:
tree_logl: -353.5678413
tree_logl_prime: 5.777172868
tree_logl_prime_prime: 250.5172939

lutteropp · 2021-03-27T18:00:11Z

From looking at these numbers, I infer that something must have gone wrong when computing partition_logl_prime and partition_logl_prime_prime.

lutteropp · 2021-03-27T18:12:41Z

Formulas I used (same here for LikelihoodModel.AVERAGE and LikelihoodModel.BEST, as there is only 1 displayed tree with tree_prob 1.0):

partition_lh = exp(tree_logl) * tree_prob
partition_lh_prime = exp(tree_logl_prime) * tree_prob
partition_lh_prime_prime = exp(tree_logl_prime_prime) * tree_prob

partition_logl = log(partition_lh)
partition_logl_prime = (partition_lh_prime / partition_lh)
partition_logl_prime_prime = exp(log(partition_lh_prime_prime) - log(partition_lh)) - exp(2*log(partition_lh_prime) - 2*log(partition_lh))

Of course, in this case partition_logl_prime should be the same as tree_logl_prime. And partition_logl_prime_prime should be the same as tree_logl_prime_prime. The question is: Why isn't this the case?

lutteropp · 2021-03-27T18:15:10Z

Maybe I misunderstood these formulas from Alexeys PhD thesis, which I used for deriving the partition_logl_prime and partition_logl_prime_prime formulas?

lutteropp · 2021-03-27T18:44:12Z

I got it now: My mistake was that I derived the log. Instead, one should see it as a function g. Example:

Partition loglh, 2 displayed trees, LikelihoodModel.AVERAGE:
logl(P) = logl(T_1|P) * p(T_1) + logl(T_2|P) * p(T_2)

Now, think as if we already have tree loglikelihoods. Instead of logl, we write g, which gives us:

g(P) = g(T_1|P) * p(T_1) + g(T_2|P) * p(T_2)

And then, we easily see how to derive it:
g'(P) = g'(T_1|P) * p(T_1) + g'(T_2|P) * p(T_2).

---> We can compute the partition loglikelihood derivatives out of the displayed tree loglikelihood derivatives.

lutteropp · 2021-03-27T18:56:56Z

Mhm. But if I follow this, then the network loglikeihood derivative is simply the sum of the partition loglikelihood derivatives... I tried this as well, but this als didn't work out with Newton Raphson.

lutteropp · 2021-03-27T19:00:49Z

It looks much better now though. Now the error is "Exceeded maximum number of iterations" and the proposed branch lengths appear at least somewhat reasonable...

lutteropp · 2021-03-27T19:07:54Z

But if this is the correct network loglikelihood derivatives definition now (-> sum of partition loglikelihood derivatives, computed out of displayed tree loglikelihood derivatives), then we don't even need to compute tree_logl out of the sumtable anymore...

I'm still leaving it in as optional, as this allows for BRENT with sumtable combination.

lutteropp · 2021-03-27T19:18:57Z

OK it looks like I got Newton-Raphson Branch-Length optimization for networks to work. It is lots of dark voodoo though!!!

lutteropp · 2021-03-27T19:20:04Z

Posting the correct maths for network loglikelihood derivatives here soon. (I'm still on vacation, after all)

lutteropp · 2021-03-27T20:06:44Z

Mhmm... no. Now NR only works for zero-reticulation networks. The network loglikelihood derivatives are still wrong if there is more than one displayed tree.

lutteropp · 2021-03-27T20:53:33Z

Could still be a bug instead of a mistake in the formulas now.

It looks like I am currently not getting all the sumtables, as soon has we have reticulations in the network. -> EDIT: Nope, the number of sumtables was fine. There simply were some displayed trees where the branch is inactive...

lutteropp · 2021-03-27T23:00:33Z

Looks like I got Newton Raphson to also work with reticulations now!!! 🎉 It really is dark voodoo though... 👻

lutteropp · 2021-03-28T02:22:46Z

Mhm, but it nearly never converges if we have reticulations present... maybe the network loglikelihood derivative formulas are still wrong. :-(

lutteropp · 2021-03-28T02:29:10Z

I think I got it now. I still had some exp and log calls that were wrongly placed in the network partition logl derivatives. But I cannot test it now, as it is time to go to sleep first.

lutteropp · 2021-03-28T04:05:46Z

Now I definitely figured out my mistake. Here are the correct formulas for network partition loglikelihood derivatives, in the LikelihoodModel.AVERAGE case:

The problem was all the time that the sumtable stuff from pll-modules gives the first and second derivatives of tree-loglikelihood, but in order to get the first and second derivatives of tree-likelihood, it does not suffice to throw an exp on it. One needs to go down to a persite level!

lutteropp · 2021-03-28T04:27:43Z

--> Instead of using pll_compute_logl_derivatives, I need to write my own function pll_compute_lh_derivatives. For this, I need to use these formulas:

lutteropp · 2021-03-28T11:16:04Z

I figured out how to reuse the tree *loglikelihood * derivatives in my computation of tree likelihood
derivatives:

EDIT: The formula for the second derivative was wrong. I fixed the image:

lutteropp · 2021-03-28T11:54:47Z

The formulas for derivatives of partition loglikelihood if we have LikelihoodModel.BEST:

The overall network loglikelihood derivatives:

lutteropp · 2021-03-28T12:45:24Z

The Math does not look less complicated, if instead of network loglikelihood derivatives, we would look at network likelihood derivatives:

lutteropp · 2021-03-28T12:57:46Z

Okay. Here is my plan, sticking with network loglikelihood derivatives, as the code I have so far uses them (meaning, I just have to fix the partition loglikelihood derivatives and the tree likelihood derivatives in my code):
1.) Code this stuff.
2.) Test this stuff.
3.) Write it more formally in LaTeX.

... damn, I am really bad at being on vacation/ not working on weekends. :-(

lutteropp · 2021-03-28T22:33:12Z

Now it works 😎
Newton Raphson for LikelihoodModel.AVERAGE is only slightly faster than Brent_reroot. I believe this is due to the large amount of high-precision arithmetics one has to do in order to compute the partition loglikelihood derivatives and tree likelihood derivatives there.

But if we have LikelihoodModel.BEST, then Newton Raphson is MUCH faster than Brent_reroot! :-)

I am still doing some minor code optimizations (e.g., no need to precompute pmatrix diagonal more than once for each set of displayed trees), then I can post some speedups.

lutteropp · 2021-03-29T00:13:40Z

Okay, now also for LikelihoodModel.AVERAGE, Newton Raphson is nearly twice as fast as Brent_reroot. I changed the code such that pmatrix derivatives are only precomputed once per partition and then shared among the displayed trees.

stamatak · 2021-03-29T07:47:53Z

great stuff

…

On 29.03.21 03:13, Sarah Lutteropp wrote: Okay, now also for LikelihoodModel.AVERAGE, Newton Raphson is nearly twice as fast as Brent_reroot. I changed the code such that pmatrix derivatives are only precomputed once and then shared among the displayed trees. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#49 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGXB6RI7SFTE3FJSYPQCRTTF7H5DANCNFSM4ZE656PA>.

-- Alexandros (Alexis) Stamatakis Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology www.exelixis-lab.org

lutteropp · 2021-03-31T13:43:07Z

Second derivative of tree likelihood still had a mistake, I fixed it now (this only affected LikelihoodModel.AVERAGE):

Derivative of Network Loglikelihood for brlen optimization #49

Derivative of Network Loglikelihood for brlen optimization #49

Comments

lutteropp commented Mar 14, 2021 • edited Loading

lutteropp commented Mar 15, 2021

lutteropp commented Mar 15, 2021

lutteropp commented Mar 15, 2021 • edited Loading

lutteropp commented Mar 15, 2021 • edited Loading

lutteropp commented Mar 15, 2021 • edited Loading

lutteropp commented Mar 15, 2021

lutteropp commented Mar 15, 2021 • edited Loading

lutteropp commented Mar 15, 2021 • edited Loading

lutteropp commented Mar 15, 2021 • edited Loading

lutteropp commented Mar 15, 2021 • edited Loading

lutteropp commented Mar 15, 2021

lutteropp commented Mar 15, 2021 • edited Loading

lutteropp commented Mar 15, 2021 • edited Loading

lutteropp commented Mar 27, 2021 • edited Loading

lutteropp commented Mar 27, 2021 • edited Loading

lutteropp commented Mar 27, 2021

lutteropp commented Mar 27, 2021 • edited Loading

lutteropp commented Mar 27, 2021

lutteropp commented Mar 27, 2021 • edited Loading

lutteropp commented Mar 27, 2021

lutteropp commented Mar 27, 2021

lutteropp commented Mar 27, 2021

lutteropp commented Mar 27, 2021 • edited Loading

lutteropp commented Mar 27, 2021 • edited Loading

lutteropp commented Mar 27, 2021 • edited Loading

lutteropp commented Mar 27, 2021

lutteropp commented Mar 27, 2021

lutteropp commented Mar 27, 2021

lutteropp commented Mar 27, 2021

lutteropp commented Mar 27, 2021

lutteropp commented Mar 27, 2021 • edited Loading

lutteropp commented Mar 27, 2021

lutteropp commented Mar 28, 2021

lutteropp commented Mar 28, 2021

lutteropp commented Mar 28, 2021 • edited Loading

lutteropp commented Mar 28, 2021

lutteropp commented Mar 28, 2021 • edited Loading

lutteropp commented Mar 28, 2021 • edited Loading

lutteropp commented Mar 28, 2021 • edited Loading

lutteropp commented Mar 28, 2021 • edited Loading

lutteropp commented Mar 28, 2021

lutteropp commented Mar 29, 2021 • edited Loading

stamatak commented Mar 29, 2021 via email

lutteropp commented Mar 31, 2021

lutteropp commented Mar 14, 2021 •

edited

Loading

lutteropp commented Mar 15, 2021 •

edited

Loading

lutteropp commented Mar 15, 2021 •

edited

Loading

lutteropp commented Mar 15, 2021 •

edited

Loading

lutteropp commented Mar 15, 2021 •

edited

Loading

lutteropp commented Mar 15, 2021 •

edited

Loading

lutteropp commented Mar 15, 2021 •

edited

Loading

lutteropp commented Mar 15, 2021 •

edited

Loading

lutteropp commented Mar 15, 2021 •

edited

Loading

lutteropp commented Mar 15, 2021 •

edited

Loading

lutteropp commented Mar 27, 2021 •

edited

Loading

lutteropp commented Mar 27, 2021 •

edited

Loading

lutteropp commented Mar 27, 2021 •

edited

Loading

lutteropp commented Mar 27, 2021 •

edited

Loading

lutteropp commented Mar 27, 2021 •

edited

Loading

lutteropp commented Mar 27, 2021 •

edited

Loading

lutteropp commented Mar 27, 2021 •

edited

Loading

lutteropp commented Mar 27, 2021 •

edited

Loading

lutteropp commented Mar 28, 2021 •

edited

Loading

lutteropp commented Mar 28, 2021 •

edited

Loading

lutteropp commented Mar 28, 2021 •

edited

Loading

lutteropp commented Mar 28, 2021 •

edited

Loading

lutteropp commented Mar 28, 2021 •

edited

Loading

lutteropp commented Mar 29, 2021 •

edited

Loading