Sampling from the DAG's topological fringe #29

willdumm · 2022-10-17T22:28:23Z

We would like to try sampling trees from the history DAG which are topologically strange, and use these preferentially in the sample-optimize-merge loop. Topological strangeness will be quantified as the sum of Robinson Foulds distance from a tree to any other tree in the DAG. We can compute this using the SubtreeWeight framework as follows:

We need first a map $N$ from any clade in the hDAG (that is, any set of taxa that can be realized as the union of a node's child clades) to the number of trees in the hDAG which contain a node with that clade. More on this in Counting the number of trees in which each node (or clade) takes part #30 .
I show here starting at line 115 that for any tree in the hDAG, the sum of the Robinson Foulds distances from that tree $t$ to every other tree in the DAG (a sum which I'll call $D_t$) is given by
$$D_t = K + \sum_{e\in t} (|T| - 2N(c_e))$$
where...
- $K$ is an easily computable constant which we won't worry about (it's the sum of number of clades in each tree in the DAG),
- the sum is over edges $e$ in the tree $t$,
- $c_e$ is the clade of the child node of the edge $e$ in the tree $t$,
- and $|T|$ is the number of trees in the hDAG.
Note that the sum, and therefore the value $D_t - K$ which we'll be maximizing in trees, will probably be negative. Therefore, all we need for computing the (K-shifted) sum RF distance on all trees in the DAG via SubtreeWeight is an edge function which returns this value $|T| - 2N(c_e)$ for any edge $e$.
Once we have that implemented via SubtreeWeight, trees with higher shifted-sum-RF distances $D_t - K$ are the ones which are more exotic topologically, in the context of the other trees in the DAG. We now want to weight these higher in the distribution we're sampling trees from. This can be done by using cached weights from SubtreeWeight<ShiftedSumRFDistances>.ComputeWeightBelow(), and sampling edges below each clade in proportion to the relative cached weights of each edge's child node, in the same fashion as has been done for uniform sampling with SubtreeWeight<TreeCount>. The way this works will be related to the solution to Sampling from the DAG with probabilities (and constraints) #28 and how we resolve the discussion on PR Uniform sampling #31

The text was updated successfully, but these errors were encountered:

willdumm changed the title ~~Sampling from the DAG's topological 'edge'~~ Sampling from the DAG's topological fringe Oct 17, 2022

willdumm mentioned this issue Oct 17, 2022

Counting the number of trees in which each node (or clade) takes part #30

Open

willdumm added the proposed feature label Oct 17, 2022

willdumm mentioned this issue Oct 18, 2022

Uniform sampling #31

Merged

willdumm mentioned this issue Nov 10, 2022

Summary of goals #36

Open

willdumm assigned ognian- Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampling from the DAG's topological fringe #29

Sampling from the DAG's topological fringe #29

willdumm commented Oct 17, 2022 •

edited

Loading

Sampling from the DAG's topological fringe #29

Sampling from the DAG's topological fringe #29

Comments

willdumm commented Oct 17, 2022 • edited Loading

willdumm commented Oct 17, 2022 •

edited

Loading