This is the code of my masters's thesis, with the title "Unsupervised Learning of a Hidden Markov Model of a Family of Gene Structures from Unaligned Genomic Sequences.
Supervisor: Prof. Dr. Mario Stanke, the inventor of the gene prediction tool Augustus.
In the rapidly advancing field of genomics, an increasing number of genomes are being
sequenced and are awaiting structural annotation. This thesis introduces cgphmm
, a pioneering tool that utilizes alignment-free comparative analysis across multiple species.
The core innovation of cgphmm
is the use of a modified Hidden Markov Model combined
with unsupervised learning using gradient descent, a distinctive approach that serves as
an important proof of concept in the field of genomic analysis. A probabilistic model is
constructed that accurately captures the genic structure and sequences of a given exon
family. This is achieved through an unsupervised learning process involving sequences
from different species believed to contain a homologous exon. The model can then be
used for simultaneous exon annotation across all input species. Furthermore, cgphmm
is highly accurate in predicting exon-intron boundaries of human coding exons. The
overall runtime scales linearly with the number of species. This feature is particularly
beneficial in the era of big data, where the number of sequenced genomes is growing
exponentially.
The transitions of the HMM in use:
- git clone --recursive https://github.com/MattesMrzik/CGP-HMM
- cd viterbi_cc
- make
- chmod u+x Viterbi
- cd src
- python3 cgphmm.py
- get_exons_df.py creates a df containing all exons.
- get_internal_exon.py selects suitable exons. Mapps coordinates and extracts fasta sequences.
- select_good_exons_for_training.py selects exons that can be used for training. Copies them to a new directory.
- the script multi_run.py was used for running and evaluation.