Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized DAG contains nodes with identical LeafSets and CGs but different node IDs. #86

Open
willdumm opened this issue Aug 6, 2024 · 0 comments

Comments

@willdumm
Copy link
Contributor

willdumm commented Aug 6, 2024

The following exploration uses the optimized DAG /fh/fast/matsen_e/mbarker/larch/luka_optimized_10_iterations.pb. That DAG was produced by optimizing /fh/fast/matsen_e/wdumm/luka_larch/Lukas_dag.pb, which was prepared in Python from a gctree parsimony forest on a gcreplay alignment.

@marybarker should confirm, but I suspect the larch command was something like

larch-usher -i Lukas_dag.pb \
    -o luka_optimized_10_iterations.pb \
    -c 10 \
    --sample-method rf-maxsum

There's an issue with DAGs produced by Larch where nodes with the same child
clades and compact genome are given different node IDs, and it seems these
node IDs are being used to distinguish nodes (which they should not be).

Here's an exploration that illustrates the problem with the optimized hDAG referenced above:

>>> import historydag as hdag
>>> dag = hdag.mutation_annotated_dag.load_MAD_protobuf_file(
...     "luka_optimized_10_iterations.pb",
...     compact_genomes=True
... )
>>> cdag = dag.copy()

We'll just trim to MP trees so it's quicker to work with:

>>> cdag.trim_optimal_weight()
229
>>> cdag.summary()
<class 'historydag.mutation_annotated_dag.CGHistoryDag'>
Label fields:	('compact_genome', 'node_id')
Nodes:	2922
Edges:	31619
Histories:	1095840
Unique leaves in DAG:	120
Average number of parents of internal nodes:	4.898643825838687

In histories in the DAG:
Leaf node count range: 120 to 120
Total node count range: 162 to 180
Average pairwise RF distance:	37.334056849448594
Parsimony score range 229 to 229

This matches the information about the (trimmed) dag that Larch reported.

Now, to verify my claim that there are duplicate nodes assigned different
node_ids:

>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> for n in cdag.preorder(skip_ua_node=True):
...     d[(n.label.compact_genome, n.child_clades())].append(n)
... 
>>> from collections import Counter
>>> Counter(len(ls) for _, ls in d.items())
Counter({1: 2907, 2: 6, 3: 1})
>>> false_duplicates = [ls for _, ls in d.items() if len(ls) > 1]
>>> len(false_duplicates)
7
>>> [[len(n.clade_union()) for n in nl] for nl in false_duplicates]
[[6, 6], [5, 5], [2, 2], [2, 2, 2], [2, 2], [2, 2], [2, 2]]
>>> [[n.label.node_id for n in nl] for nl in false_duplicates]
[['113', '42071'], ['129', '128526'], ['3852', '14186'],
['5051', '14460', '826'], ['999', '3771'], ['1019', '114159'],
['1109', '43398']]

So, the duplicate nodes in this case are relatively close to leaves, and there
aren't many of them.

Here's what happens when we get rid of node IDs. We can do this here because
there are no ambiguities in leaf sequences, and leaf sequences are unique.

If node IDs were being assigned correctly, we should end up with a new hDAG
with the same number of nodes, edges, and histories, but here we expect to end
up with 8 fewer nodes, since that's the number of duplicates we saw above:

>>> ccdag = cdag.remove_label_fields(["node_id"])
>>> ccdag.summary()
<class 'historydag.mutation_annotated_dag.CGHistoryDag'>
Label fields:	('compact_genome',)
Nodes:	2914
Edges:	31595
Histories:	967020
Unique leaves in DAG:	120
Average number of parents of internal nodes:	4.909090909090909

In histories in the DAG:
Leaf node count range: 120 to 120
Total node count range: 162 to 180
Average pairwise RF distance:	37.82642904635102
Parsimony score range 229 to 229

unique node IDs didn't allow collapsing before (I verified this, but left it
out here for clarity), but in their absence over 200
edges can be eliminated by collapsing, reducing the number of unique MP trees here to
12k from over 1M reported by Larch.

>>> ccdag.convert_to_collapsed()
>>> ccdag.summary()
<class 'historydag.mutation_annotated_dag.CGHistoryDag'>
Label fields:	('compact_genome',)
Nodes:	2857
Edges:	31353
Histories:	12385
Unique leaves in DAG:	120
Average number of parents of internal nodes:	4.96200219218122

In histories in the DAG:
Leaf node count range: 120 to 120
Total node count range: 162 to 174
Average pairwise RF distance:	39.23777424136683
Parsimony score range 229 to 229

In summary, I think there are two issues, the first quite a bit more serious
than the second:

  1. First, when optimized trees, or perhaps moves, are being merged into the
    hDAG by Larch, either node comparisons aren't working correctly (using
    only LeafSets and compact genomes), so nodes are comparing unequal when in
    fact they are equal, or node ids are being assigned before merging, and
    they're erroneously being used in the node comparison.
  2. Second, I don't remember if Larch attempts to do some kind of collapsing of
    edges without mutations before or during merge, but if so that doesn't
    seem to be working, since collapsing is possible after removing node IDs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant