Skip to content

This is the code of my Master's thesis, with the title "Unsupervised Learning of a Hidden Markov Model of a Family of Gene Structures from Unaligned Genomic Sequences".

License

Notifications You must be signed in to change notification settings

MattesMrzik/CGP-HMM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CGP-HMM

This is the code of my masters's thesis, with the title "Unsupervised Learning of a Hidden Markov Model of a Family of Gene Structures from Unaligned Genomic Sequences.

Supervisor: Prof. Dr. Mario Stanke, the inventor of the gene prediction tool Augustus.

Abstract

In the rapidly advancing field of genomics, an increasing number of genomes are being sequenced and are awaiting structural annotation. This thesis introduces cgphmm, a pioneering tool that utilizes alignment-free comparative analysis across multiple species. The core innovation of cgphmm is the use of a modified Hidden Markov Model combined with unsupervised learning using gradient descent, a distinctive approach that serves as an important proof of concept in the field of genomic analysis. A probabilistic model is constructed that accurately captures the genic structure and sequences of a given exon family. This is achieved through an unsupervised learning process involving sequences from different species believed to contain a homologous exon. The model can then be used for simultaneous exon annotation across all input species. Furthermore, cgphmm is highly accurate in predicting exon-intron boundaries of human coding exons. The overall runtime scales linearly with the number of species. This feature is particularly beneficial in the era of big data, where the number of sequenced genomes is growing exponentially.

The transitions of the HMM in use:

See data/hmm.png

Install

Run

  • cd src
  • python3 cgphmm.py

Dataset creation

  • get_exons_df.py creates a df containing all exons.
  • get_internal_exon.py selects suitable exons. Mapps coordinates and extracts fasta sequences.
  • select_good_exons_for_training.py selects exons that can be used for training. Copies them to a new directory.

Evaluation

  • the script multi_run.py was used for running and evaluation.

About

This is the code of my Master's thesis, with the title "Unsupervised Learning of a Hidden Markov Model of a Family of Gene Structures from Unaligned Genomic Sequences".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages