Skip to content

Theoretical Background

Susann Vorberg edited this page Jun 4, 2018 · 1 revision

Theory

Markov Random Field Model

We can use a graphical model called a Markov Random Field to encode interdepencies of positions in a protein family's multiple sequence alignment.

MRF model

Single-position amino acid preferences are encoded in single emission potentials $v_i(a)$ (stored as a $L \times 20$ matrix of log-potentials). A potential of $v_i(a) = 0$ for all $a$ of a given position $i$ would denote uniform frequencies for all amino acids at that position. Increasing one $v_i(a)$ by $1$ would mean that that amino acid is $e=2.72\ldots$ times more likely to occur.

We also encode the preference of two amino acids $a$ and $b$ to prefer to occur together in two different positions $i$ and $j$ using a pairwise emission potential $w_{i,j}(a,b)$ (stored as a $L \times L \times 20 \times 20$ matrix of log-potentials). A coupling potential of $0$ means they co-occur as often as expected from their independent frequencies while positive potentials mean they occur together more often than expected independently.

From this model, we can define the probability of observing the amino acid $a$ at sequence position $i$, given the potentials $\vec v$ and $\vec w$ and all positions except for $i$:

\[ \log P(x_i = a|\vec v, \vec w, (x_1 \ldots x_L \setminus x_i)) \propto v_i(a) + \sum_{j=1 \atop j \ne i}^L w_{i,j}(a, x_j) \]

Using this probability model, we can use Gibbs sampling to draw new sequences from a starting point.

Sampling from Phylogenies

Now that we have a strategy for evolving sequences from a parental sequence, we can use a phylogenetic tree to encode evolutionary relationships between sequences. Starting from a starting sequence $s_0$, we can evolve along a markov chains for each descendant sequence $s_i$ by choosing a number of mutations proportional to the evolutionary distance $d_i$ and sampling independently using the Gibbs sampler.

Evolving along a tree

CCMgen provides several choices for phylogenies to evolve along.

Sampling without Phylogeny

If no phylogenetic dependencies between ancestral sequences is desired, you can choose to use a 'star-shaped' phylogenetic tree with a evolutionary distance long enough to ensure that sequences are drawn independently.

A "star" "tree"

Sampling from a Reconstructed Phylogenetic Tree

You can provide a phylogenetic tree in Newick (.dnd) format to use it for evolving sequences.

Sampling from a Perfect Binary Tree

Finally, you can evolve from a binary tree as well. A binary tree