A repository of algorithms for genomic sequencing.
The repository currently consists of an implementation of the Boyer-Moore algorithm for exact alignment of a sequencing read to a reference genome. The implementation uses a linear-time algorithm (the "Z algorithm") as described by Gusfield (lecture notes 4 & 5) to construct indexes of the query string used for the good suffix and bad character rules.
The full test suite consists of tests for correctness and performance of the implementation of the Boyer-Moore string matching algorithm, using the Naïve string matching algorithm as a baseline comparison.
Test cases for correctness are available in tests
. The following command will convert these test cases into the FASTA format required to run the full test suite:
./format_test.sh
Test cases for performance must be downloaded. The reference genomes are from the Dec. 2013 (GRCh38/hg38) assembly of the human genome; chromosomes 1 and 20 were used and can be downloaded here and here respectively; place the (decompressed) downloaded files in the root directory of the repository. The query string is included in FASTA format in the file p.fa
and is taken from Ben Langmead's slides.
The full test suite can then be run with the following command:
./run_test.sh
Ben Langmead's course was immensely helpful, and was a guide throughout developing this repository and learning about genomic sequencing.