esl_distance.tex


The \eslmod{distance} module implements routines for inferring
mutational distances between pairs of aligned sequences, and for
constructing distance matrices from multiple sequence alignments.

The API for the \eslmod{distance} module is summarized in
Table~\ref{tbl:distance_api}.

\begin{table}[hbp]
\begin{center}
{\small
\begin{tabular}{|ll|}\hline
   \apisubhead{Pairwise distances for aligned text sequences}\\
\hyperlink{func:esl_dst_CPairId()}{\ccode{esl\_dst\_CPairId()}} & Pairwise identity of two aligned text strings.\\
\hyperlink{func:esl_dst_CJukesCantor()}{\ccode{esl\_dst\_CJukesCantor()}} & Jukes-Cantor distance for two aligned strings.\\
   \apisubhead{Pairwise distances for aligned digital seqs}\\
\hyperlink{func:esl_dst_XPairId()}{\ccode{esl\_dst\_XPairId()}} & Pairwise identity of two aligned digital seqs.\\
\hyperlink{func:esl_dst_XJukesCantor()}{\ccode{esl\_dst\_XJukesCantor()}} & Jukes-Cantor distance for two aligned digitized seqs.\\
   \apisubhead{Distance matrices for aligned text sequences}\\
\hyperlink{func:esl_dst_CPairIdMx()}{\ccode{esl\_dst\_CPairIdMx()}} & NxN identity matrix for N aligned text sequences. \\
\hyperlink{func:esl_dst_CDiffMx()}{\ccode{esl\_dst\_CDiffMx()}} & NxN difference matrix for N aligned text sequences.\\
\hyperlink{func:esl_dst_CJukesCantorMx()}{\ccode{esl\_dst\_CJukesCantorMx()}} & NxN Jukes/Cantor distance matrix for N aligned text seqs.\\
   \apisubhead{Distance matrices for aligned digital sequences}\\
\hyperlink{func:esl_dst_XPairIdMx()}{\ccode{esl\_dst\_XPairIdMx()}} & NxN identity matrix for N aligned digital seqs.\\
\hyperlink{func:esl_dst_XDiffMx()}{\ccode{esl\_dst\_XDiffMx()}} & NxN difference matrix for N aligned digital seqs.\\
\hyperlink{func:esl_dst_XJukesCantorMx()}{\ccode{esl\_dst\_XJukesCantorMx()}} & NxN Jukes/Cantor distance matrix for N aligned digital seqs.\\
\hline
\end{tabular}
}
\end{center}
\caption{The \eslmod{distance} API.}
\label{tbl:distance_api}
\end{table}


\subsection{Example of using the distance API}

The example code below opens a multiple sequence alignment file and
reads an alignment from it, then uses one of the routines from the
\eslmod{distance} module to calculate a fractional identity matrix
from it. The example then finds the average, minimum, and maximum of
the values in the identity matrix.

\input{cexcerpts/distance_example}

\subsection{Definition of pairwise identity and pairwise difference}

Given a pairwise sequence alignment of length $L$, between two
sequences of $n_1$ and $n_2$ residues ($n_1 \leq L$, $n_2 \leq L$),
where the $L$ aligned symbol pairs are classified and counted as
$c_{\mbox{ident}}$ identities, $c_{\mbox{mismat}}$ mismatches, and
$c_{\mbox{indel}}$ pairs that have a gap symbol in either or both
sequences ($c_{\mbox{ident}} + c_{\mbox{mismat}} + c_{\mbox{indel}} =
L$), \esldef{pairwise sequence identity} is defined as:

\[
   \mbox{pid} = \frac{c_{\mbox{ident}}}{\mbox{MIN}(n_1, n_2)},
\]

and \esldef{pairwise sequence difference} is defined as
\[
   \mbox{diff} = 1 - \mbox{pid} = \frac{\mbox{MIN}(n_1,n_2) - c_{\mbox{ident}}}{\mbox{MIN}(n_1, n_2)}.
\]

Both pid and diff range from 0 to 1. 

In the unusual case where $\mbox{MIN}(n_1,n_2)=0$ -- that is, one of
the aligned sequences consists entirely of gaps -- the percent
identity $0/0$ is defined as 0. The calculation is robust against
length 0 sequences, which do arise in real applications. (Not just in
bad input, either. For example, this arises when dealing with subsets
of the columns of a multiple alignment.)

There are many ways that pairwise identity might be calculated,
because there are a variety of choices for the denominator. In Easel,
identity calculations are used primarily in \emph{ad hoc} sequence
weight calculations for multiple sequence alignments, as part of
profile HMM or profile SCFG construction. Multiple alignments will
often contain short sequence fragments. We want to deal robustly with
cases where two short fragments may have little overlap, or none at
all. The most obvious calculation of pairwise identity,
$c_{\mbox{ident}} / c_{\mbox{ident}} + c_{\mbox{mismat}}$, is not
robust, because alignments with few aligned residues (either because
they are highly gappy, or they are partially overlapping fragments)
may receive artifactually high identities. Other definitions,
$c_{\mbox{ident}} / L$ or $c_{\mbox{ident}} / \mbox{MEAN}(n_1, n_2)$
or $c_{\mbox{ident}} / \mbox{MAX}(n_1, n_2)$ are also not robust,
sharing the disadvantage that good alignments of fragments to longer
sequences would be scored as artifactually low identities.


\subsection{Generalized Jukes-Cantor distances}

The Jukes-Cantor model of DNA sequence evolution assumes that all
substitutions occur at the same rate $\alpha$
\citep{JukesCantor69}. It is a reversible, multiplicative evolutionary
model. It implies equiprobable stationary probabilities. The
\esldef{Jukes/Cantor distance} is the maximum likelihood estimate of
the number of substitutions per site that have occurred between the
two sequences, correcting for multiple substitutions that may have
occurred the same site. Given an ungapped pairwise alignment of length
$L$ consisting of $c_{\mbox{ident}}$ identities and
$c_{\mbox{mismat}}$ mismatches (observed substitutions)
($c_{\mbox{ident}} + c_{\mbox{mismat}} = L$, the fractional observed
difference $D$ is defined as

\[
  D = \frac{c_{\mbox{mismat}}}{c_{\mbox{ident}} + c_{\mbox{mismat}}},
\]

and the Jukes-Cantor distance $d$ is defined in terms of $D$ as:

\[
  d = -\frac{3}{4} \log \left( 1 - \frac{4}{3} D \right)
\]

The Jukes/Cantor model does not allow insertions or deletions.  When
calculating ``Jukes/Cantor distances'' for gapped alignments, gap
symbols are simply ignored, and the same calculations above are
applied.

The Jukes-Cantor model readily generalizes from the four-letter DNA
alphabet to any alphabet size $K$, using the same definition of
observed fractional difference $D$. A \esldef{generalized Jukes-Cantor
distance} is:

\[
  d = -\frac{K-1}{K} \log \left( 1 - \frac{K}{K-1} D \right).
\]

The large-sample variance of this estimate $d$ is:

\[
   \sigma^2 = e^\frac{2Kd}{K-1} \frac{D(1-D)}{L'}
\]

where $L'$ is the total number of columns counted, exclusive of gaps,
$L' = c_{\mbox{ident}} + c_{\mbox{mismat}}$.

If the observed $D \geq \frac{K-1}{K}$, the maximum likelihood
Jukes-Cantor distance is infinity, as is the variance. In this case,
both $d$ and $V$ are returned as \ccode{HUGE\_VAL}.