This is the implementation of three of the most important algorithms that every student interested in the field of Bioinformatics should know (my humble opinion!).
First we initialize and fill in the scoring maxtrix since we're following the Dynamic Programming approach. The idea is that the score of the best possible alignment that ends at (i, j)
in matrix, is equal to the score of best alignment ending just previous to those positions (i-1, j-1)
, plus the score for aligning Xi
and Yj
(residue i
in protein sequence X
and residue j
in protein seequence Y
).
For scoring we use PAM250
to calculate Match and Mismatch and a fixed score of -9 is considered for gaps.
After finding the score of optimum alignment, which would be the highest score in bottom row or right-hand column of the matrix, we trace back through the matrix to recover that optimum alignment.
Input two protein sequence with maximum length of 100 residues in capital letters.
HEAGAWGHE
PAWHEA
Output will be the score of the alignment along with the pairwise aligned sequences.
20
HEAGAWGHE-
---PAW-HEA
First we perform Star Alignment by finding the pariwise alignment scores for each pair of protein sequences and form the distance matrix. Then we pick the center sequence based on the minimum total of distances for each sequence. Then sequences will be added to get aligned in dicreasing order of similarity (increase of distance) in respect to the center sequence.
Now to have better alignment for more divergent sequences we improve on the multiple sueqence alignemnt that we have achived by performing Block Based alignment. This is how we do this:
we find blocks with minimum two columns within the alignment containing gaps or mismatches, then apply the star_alignment
for these blocks alone. If realignment of any block improves the overall score of total alignments we switch to the realigned block.
Input number of protein sequences to be aligned and then each sequence in separate line with capital letters.
5
TAGCTACCAGGA
CAGCTACCAGG
TAGCTACCAGT
CAGCTATCGCGGC
CAGCTACCAGGA
Output will be the overall score of the multiple sequence alignment and aligned sequences in each line respectively to the input sequence.
240
TAGCTA-C-CAGGA
CAGCTA-C-CAGG-
TAGCTA-C-CA-GT
CAGCTATCGC-GGC
CAGCTA-C-CAGGA
Here we construct a statistical model by first building a profile using the given multiple sequenses, calculating the probability of appearance for each residue at it's respective position. In this process we use pseudocount of 2
to avoid the zero probability. After that we will localy align a given long sequence to previous multiple sequences using the profile for scoring the alignment and then report the subsequence with the highest score of alignment.
Input number of multiple sequences to build a profile from. After that each sequence should be in separate lines followed by the long sequence.
4
HVLIP
H-MIP
HVL-P
LVLIP
LIVPHHVPIPVLVIHPVLPPHIVLHHIHVHIHLPVLHIVHHLVIHLHPIVL
output will show the aligned subsequence with highest score.
H-L-P