Skip to content

Latest commit

 

History

History
114 lines (77 loc) · 6.29 KB

README.md

File metadata and controls

114 lines (77 loc) · 6.29 KB

Minimalistic Seed Heuristic for A*

minSH is a short working implementation that aligns sequences $A$ and $B$ end to end using a minimal number of edit operations (substitutions, insertions and deletions). As a side effect, it computes the exact edit distance $ed(A,B)$ with near-linear scaling, given limited divergence $d=ed(A,B)/max(|A|, |B|)$. astar.py implements A* with seed heuristic $h_{seed}(i,j) = \Big| \big\{ s \in Seeds_{\geq i} \mid s \notin B \big\} \Big|$ in a short and simple way:

def build_seed_heuristic(A, B, k):
    """Builds the admissible seed heuristic for A and B with k-mers."""
    
    kmers = { B[j:j+k] for j in range(len(B)-k+1) }                 # O(nk): Index all kmers from B. O(n) with rolling hash
    seeds = [ A[i:i+k] for i in range(0, len(A)-k+1, k) ]           # O(n): Split A into non-overlapping seeds of length k.
    is_seed_missing = [ s not in kmers for s in seeds ]             # O(n): Is seed unseen in B, so >=1 edit must be made to align it.
    suffix_sum = np.cumsum(is_seed_missing[::-1])[::-1]             # O(n): How many of the remaining seeds have to be edited
    
    return lambda ij, k=k: suffix_sum[ ceildiv(ij[0], k) ]          # O(1): How many seeds starting after the current position i have to be edited?

Next, we just use the seed heuristic for a starndard A* on the alignment graph A x B:

h_seed = build_seed_heuristic(A, B, k=log(len(A)))
astar(A, B, h_seed)

Background

Sequence alignment as a shortest path problem

We can look at aligning sequences A and B as a process of sequentially aligning longer and longer prefixes of $A$ ($A[\dots i]$) to prefixes of $B$ ($B[\dots j]$) by matching, substituting, inserting or deleting single letters (minimizing edit distance). Thus, finding an alignment with a minimal number of edit operations is equivalent to finding a shortest path starting from node $s=(0,0)$ and ending at node $t=(|A|, |B|)$ in a graph of prefixes (also called edit graph or alignment graph), where each edge corresponds to one operation (with cost $0$ for a match or $1$ otherwise). This graph representation enables us to apply general shortest path algorithms.

Dijkstra's shortest path algorithm

The simplest algorithm we can use is Dijkstra's algorithm which finds a shortest path of length $d$ by sequentially exploring nodes $u$ by increasing distance $dist(s,u)$ from the start node $s$ and until expanding the end node $t$. The problem is that the search circles around $s$ regardless of where the target $t$ is, so in our 2D lattice graph the number of explored nodes with $dist(s,u) < d$ grows quadratically in $d$. For most data (e.g. genetic) the edit distance $d$ grows proportionally to $|s|$ and $|t|$, so the whole algorithm becomes quadratic which is practically infeasible for long sequences.

A* algorithm

The A* algirthm is a generalization of Dijkstra's algorithm that explores the nodes $u$ not just by their distance from the start $dist(s,u)$ but also adding an estimation of the remaining distance to the target $dist(s,u) + h(u,t)$. This heuristic function $h(u,t)$ allows for a potentially very direct search towards the target but it has to be designed depending on specific knowledge of the graph/task to be:

  1. admissible (i.e. to never exceed the remaining distance $d(u,t)$), or otherwise the found path may not be shortest.
  2. accurate in estimating $dist(s,u)$, or otherwise the search will not be directly going to $t$
  3. fast to be computed for each explored node, or otherwise, the A* algorithm will be slow in practice

Usage

To use in your projects, simply add this repository as a dependency:

pip install git+https://github.com/pesho-ivanov/minSH.git

Then call the align function with two strings A and B:

import math
from minsh.astar import h_dijkstra, align, build_seedh, print_stats

A = "ACGT" * 100
B = "AGCT" * 100

alphabet_size = 4
k = math.ceil(math.log(len(A), alphabet_size))
h_seed = build_seedh(A, B, k)

g_seed = align(A, B, h_seed)
print_stats(A, B, k, g_seed)

The library can also be used as a standalone command-line script:

$ astar data/small_A.fa data/small_B.fa
> Aligning sequences with len(A)=18, k=3:
> Edit distance: 3
> Error rate: 16.67%
> Explored band: 3.39

To explore additional tooling - clone the repository and install the dependencies:

git clone https://github.com/pesho-ivanov/minSH && cd minSH
pip install -r requirements.txt
python scripts/test.py      # tests
python scripts/generate.py  # synthetic FASTA files

TODO

Optimizations:

  • rolling hash: for linear time precomputation
  • greedy matching (aka sliding)
  • pruning, using index trees

Presentation:

  • visualization of the alignment (png, gif)
  • interactivity for adding patches
  • report stats
  • benchmark

Related work

minSH is inspired by minGPT to be small, clean, interpretable and educational re-implementation of the recent aligning approach based on the A* shortest path algorithm.

Detailed Publications on A* for alignment

AStarix semi-global seq-to-graph aligner:

A*PA global seq-to-seq aligner:

  • Groot Koerkamp and Ivanov (preprint 2023) — Applies SH to global alignment (edit distance). Generalizes SH with chaining, inexact matches, gap costs (for higher error rates). Optimizes SH with pruning (for near-linear scaling with length), and A* with diagonal transition (for faster quadratic mode). Licensed under the Mozilla Public License, Version 2.0. In short, you are free to use and abuse, but give it back to the community.