This repository contains solutions to bioinformatics coding challenges from rosalind.info. Problems are organised by the various different locations:
- Python Village: initial problems to learn a few basics about the Python programming language.
- Bioinformatics Stronghold: problems to discover the algorithms underlying a variety of bioinformatics topics.
- Bioinformatics Armory: unlike the stronghold in the Armory we solve problems by using existing tools.
- Bioinformatics Textbook Track: problems associated with Bioinformatics Algorithms: An Active Learning Approach.
- Algorithmic Heights: exercises to accompany the book Algorithms.
This repository is written as a python module and uses poetry and typer.
Solutions for each problem are located in individual files inside the directory for each location.
You can install the versions of dependencies used here with:
poetry install
To run solutions within this environment run, e.g.:
poetry run rosalind ini2 rosalind_ini2.txt
To run the solution on the provided "Sample Dataset" from rosalind.info (which should reproduce the "Sample Output"), run the solution in "test" mode:
poetry run rosalind --test ini2
pytest-snapshot
is used to test solutions to problems. In many cases solutions
generated will and should exactly match the "Sample Output" given at
rosalind.info. In cases, where e.g. ordering is not important, the
expected solutions (in tests/expected
) have been updated to match code used
here, but are equally valid solutions.
To run the tests use:
poetry run pytest
To update the tests (adding or modifying snapshots / expected output) use:
poetry run pytest --snapshot-update
Note that some solutions (that use Entrez) require an email address. This should be set as an environment variable, e.g.:
export ENTREZ_EMAIL=rosalind.franklin@cam.ac.uk
- My rosalind profile: https://rosalind.info/users/danhalligan/
- INI1: Installing Python
- INI2: Variables and Some Arithmetic
- INI3: Strings and Lists
- INI4: Conditions and Loops
- INI5: Working with Files
- INI6: Dictionaries
- DNA: Counting DNA Nucleotides
- RNA: Transcribing DNA into RNA
- REVC: Complementing a Strand of DNA
- FIB: Rabbits and Recurrence Relations
- GC: Computing GC Content
- HAMM: Counting Point Mutations
- IPRB: Mendel's First Law
- PROT: Translating RNA into Protein
- SUBS: Finding a Motif in DNA
- CONS: Consensus and Profile
- FIBD: Mortal Fibonacci Rabbits
- GRPH: Overlap Graphs
- IEV: Calculating Expected Offspring
- LCSM: Finding a Shared Motif
- LIA: Independent Alleles
- MPRT: Finding a Protein Motif
- MRNA: Inferring mRNA from Protein
- ORF: Open Reading Frames
- PERM: Enumerating Gene Orders
- PRTM: Calculating Protein Mass
- REVP: Locating Restriction Sites
- SPLC: RNA Splicing
- LEXF: Enumerating k-mers Lexicographically
- LGIS: Longest Increasing Subsequence
- LONG: Genome Assembly as Shortest Superstring
- PMCH: Perfect Matchings and RNA Secondary Structures
- PPER: Partial Permutations
- PROB: Introduction to Random Strings
- SIGN: Enumerating Oriented Gene Orderings
- SSEQ: Finding a Spliced Motif
- TRAN: Transitions and Transversions
- TREE: Completing a Tree
- CAT: Catalan Numbers and RNA Secondary Structures
- CORR: Error Correction in Reads
- INOD: Counting Phylogenetic Ancestors
- KMER: k-Mer Composition
- KMP: Speeding Up Motif Finding
- LCSQ: Finding a Shared Spliced Motif
- LEXV: Ordering Strings of Varying Length Lexicographically
- MMCH: Maximum Matchings and RNA Secondary Structures
- PDST: Creating a Distance Matrix
- REAR: Reversal Distance
- RSTR: Matching Random Motifs
- SSET: Counting Subsets
- ASPC: Introduction to Alternative Splicing
- EDIT: Edit Distance
- EVAL: Expected Number of Restriction Sites
- MOTZ: Motzkin Numbers and RNA Secondary Structures
- SCSP: Interleaving Two Motifs
- SETO: Introduction to Set Operations
- SORT: Sorting by Reversals
- SPEC: Inferring Protein from Spectrum
- TRIE: Introduction to Pattern Matching
- CONV: Comparing Spectra with the Spectral Convolution
- DBRU: Constructing a De Bruijn Graph
- EDTA: Edit Distance Alignment
- FULL: Inferring Peptide from Full Spectrum
- INDC: Independent Segregation of Chromosomes
- LREP: Finding the Longest Multiple Repeat
- RNAS: Wobble Bonding and RNA Secondary Structures
- AFRQ: Counting Disease Carriers
- CTEA: Counting Optimal Alignments
- GLOB: Global Alignment with Scoring Matrix
- PCOV: Genome Assembly with Perfect Coverage
- PRSM: Matching a Spectrum to a Protein
- SGRA: Using the Spectrum Graph to Infer Peptides
- SUFF: Encoding Suffix Trees
- GASM: Genome Assembly Using Reads
- GCON: Global Alignment with Constant Gap Penalty
- LING: Linguistic Complexity of a Genome
- LOCA: Local Alignment with Scoring Matrix
- MGAP: Maximizing the Gap Symbols of an Optimal Alignment
- MULT: Multiple Alignment
- PDPL: Creating a Restriction Map
- SEXL: Sex-Linked Inheritance
- WFMD: The Wright-Fisher Model of Genetic Drift
- ASMQ: Assessing Assembly Quality with N50 and N75
- EBIN: Wright-Fisher's Expected Behavior
- FOUN: The Founder Effect and Genetic Drift
- GAFF: Global Alignment with Scoring Matrix and Affine Gap Penalty
- GREP: Genome Assembly with Perfect Coverage and Repeats
- OAP: Overlap Alignment
- SIMS: Finding a Motif with Modifications
- SMGB: Semiglobal Alignment
- MREP: Identifying Maximal Repeats
- NWCK: Distances in Trees
- NKEW: Newick Format with Edge Weights
- CTBL: Creating a Character Table
- CUNR: Counting Unrooted Binary Trees
- ROOT: Counting Rooted Binary Trees
- ITWV: Finding Disjoint Motifs in a Gene
- QRT: Quartets
- MEND: Inferring Genotype from a Pedigree
- CSTR: Creating a Character Table from Genetic Strings
- CHBP: Character-Based Phylogeny
- CSET: Fixing an Inconsistent Character Set
- OSYM: Isolating Symbols in Alignments
- SPTD: Phylogeny Comparison with Split Distance
- CNTQ: Counting Quartets
- ALPH: Alignment-Based Phylogeny
- RSUB: Identifying Reversing Substitutions
- LAFF: Local Alignment with Affine Gap Penalty
- EUBT: Enumerating Unrooted Binary Trees
- KSIM: Finding All Similar Motifs
- QRTD: Quartet Distance
- For "QRTD" I have cheated by using tqDist. For the solution to run you will
need to install
quartet_dist
and have it available in your path. Well done to anyone else who solved this properly!
- INI: Introduction to the Bioinformatics Armory
- GBK: GenBank Introduction
- FRMT: Data Formats
- MEME: New Motif Discovery
- NEED: Pairwise Global Alignment
- TFSQ: FASTQ format introduction
- PHRE: Read Quality Distribution
- PTRA: Protein Translation
- FILT: Read Filtration by Quality
- RVCO: Complementing a Strand of DNA
- SUBO: Suboptimal Local Alignment
- BPHR: Base Quality Distribution
- CLUS: Global Multiple Alignment
- ORFR: Finding Genes with ORFs
- BFIL: Base Filtration by Quality
- For "MEME" and "CLUS" I have not written a solution. Use the web interface as instructed.
- For "SUBO", you need to run the online interface, identify the 32-40 bp and then can use the solution here to count the occurrences of this in the sequences.
- BA1A: Compute the Number of Times a Pattern Appears in a Text
- BA1B: Find the Most Frequent Words in a String
- BA1C: Find the Reverse Complement of a String
- BA1D: Find All Occurrences of a Pattern in a String
- BA1E: Find Patterns Forming Clumps in a String
- BA1F: Find a Position in a Genome Minimizing the Skew
- BA1G: Compute the Hamming Distance Between Two Strings
- BA1H: Find All Approximate Occurrences of a Pattern in a String
- BA1I: Find the Most Frequent Words with Mismatches in a String
- BA1J: Find Frequent Words with Mismatches and Reverse Complements
- BA1K: Generate the Frequency Array of a String
- BA1L: Implement PatternToNumber
- BA1M: Implement NumberToPattern
- BA1N: Generate the d-Neighborhood of a String
- BA2A: Implement MotifEnumeration
- BA2B: Find a Median String
- BA2C: Find a Profile-most Probable k-mer in a String
- BA2D: Implement GreedyMotifSearch
- BA2E: Implement GreedyMotifSearch with Pseudocounts
- BA2F: Implement RandomizedMotifSearch
- BA2G: Implement GibbsSampler
- BA2H: Implement DistanceBetweenPatternAndStrings
- BA3A: Generate the k-mer Composition of a String
- BA3B: Reconstruct a String from its Genome Path
- BA3C: Construct the Overlap Graph of a Collection of k-mers
- BA3D: Construct the De Bruijn Graph of a String
- BA3E: Construct the De Bruijn Graph of a Collection of k-mers
- BA3F: Find an Eulerian Cycle in a Graph
- BA3G: Find an Eulerian Path in a Graph
- BA3H: Reconstruct a String from its k-mer Composition
- BA3I: Find a k-Universal Circular String
- BA3J: Reconstruct a String from its Paired Composition
- BA3K: Generate Contigs from a Collection of Reads
- BA3L: Construct a String Spelled by a Gapped Genome Path
- BA3M: Generate All Maximal Non-Branching Paths in a Graph
- BA4A: Translate an RNA String into an Amino Acid String
- BA4B: Find Substrings of a Genome Encoding a Given Amino Acid String
- BA4C: Generate the Theoretical Spectrum of a Cyclic Peptide
- BA4D: Compute the Number of Peptides of Given Total Mass
- BA4E: Find a Cyclic Peptide with Theoretical Spectrum Matching an Ideal Spectrum
- BA4F: Compute the Score of a Cyclic Peptide Against a Spectrum
- BA4G: Implement LeaderboardCyclopeptideSequencing
- BA4H: Generate the Convolution of a Spectrum
- BA4I: Implement ConvolutionCyclopeptideSequencing
- BA4J: Generate the Theoretical Spectrum of a Linear Peptide
- BA4K: Compute the Score of a Linear Peptide
- BA4L: Trim a Peptide Leaderboard
- BA4M: Solve the Turnpike Problem
- BA5A: Find the Minimum Number of Coins Needed to Make Change
- BA5B: Find the Length of a Longest Path in a Manhattan-like Grid
- BA5C: Find a Longest Common Subsequence of Two Strings
- BA5D: Find the Longest Path in a DAG
- BA5E: Find a Highest-Scoring Alignment of Two Strings
- BA5F: Find a Highest-Scoring Local Alignment of Two Strings
- BA5G: Compute the Edit Distance Between Two Strings
- BA5H: Find a Highest-Scoring Fitting Alignment of Two Strings
- BA5I: Find a Highest-Scoring Overlap Alignment of Two Strings
- BA5J: Align Two Strings Using Affine Gap Penalties
- BA5K: Find a Middle Edge in an Alignment Graph in Linear Space
- BA5L: Align Two Strings Using Linear Space
- BA5M: Find a Highest-Scoring Multiple Sequence Alignment
- BA5N: Find a Topological Ordering of a DAG
- BA6A: Implement GreedySorting to Sort a Permutation by Reversals
- BA6B: Compute the Number of Breakpoints in a Permutation
- BA6C: Compute the 2-Break Distance Between a Pair of Genomes
- BA6D: Find a Shortest Transformation of One Genome into Another by 2-Breaks
- BA6E: Find All Shared k-mers of a Pair of Strings
- BA6F: Implement ChromosomeToCycle
- BA6G: Implement CycleToChromosome
- BA6H: Implement ColoredEdges
- BA6I: Implement GraphToGenome
- BA6J: Implement 2-BreakOnGenomeGraph
- BA6K: Implement 2-BreakOnGenome
- BA7A: Compute Distances Between Leaves
- BA7B: Compute Limb Lengths in a Tree
- BA7C: Implement AdditivePhylogeny
- BA7D: Implement UPGMA
- BA7E: Implement the Neighbor Joining Algorithm
- BA7F: Implement SmallParsimony
- BA7G: Adapt SmallParsimony to Unrooted Trees
- BA8A: Implement FarthestFirstTraversal
- BA8B: Compute the Squared Error Distortion
- BA8C: Implement the Lloyd Algorithm for k-Means Clustering
- BA8D: Implement the Soft k-Means Clustering Algorithm
- BA8E: Implement Hierarchical Clustering
- BA9A: Construct a Trie from a Collection of Patterns
- BA9B: Implement TrieMatching
- BA9C: Construct the Suffix Tree of a String
- BA9D: Find the Longest Repeat in a String
- BA9E: Find the Longest Substring Shared by Two Strings
- BA9F: Find the Shortest Non-Shared Substring of Two Strings
- BA9G: Construct the Suffix Array of a String
- BA9H: Pattern Matching with the Suffix Array
- BA9I: Construct the Burrows-Wheeler Transform of a String
- BA9J: Reconstruct a String from its Burrows-Wheeler Transform
- BA9K: Generate the Last-to-First Mapping of a String
- BA9L: Implement BWMatching
- BA9M: Implement BetterBWMatching
- BA9N: Find All Occurrences of a Collection of Patterns in a String
- BA9O: Find All Approximate Occurrences of a Collection of Patterns in a String
- BA9P: Implement TreeColoring
- BA9Q: Construct the Partial Suffix Array of a String
- BA9R: Construct a Suffix Tree from a Suffix Array
- BA10A: Compute the Probability of a Hidden Path
- BA10B: Compute the Probability of an Outcome Given a Hidden Path
- BA10C: Implement the Viterbi Algorithm
- BA10D: Compute the Probability of a String Emitted by an HMM
- BA10E: Construct a Profile HMM
- BA10F: Construct a Profile HMM with Pseudocounts
- BA10G: Perform a Multiple Sequence Alignment with a Profile HMM
- BA10H: Estimate the Parameters of an HMM
- BA10I: Implement Viterbi Learning
- BA10J: Solve the Soft Decoding Problem
- BA10K: Implement Baum-Welch Learning
- BA11A: Construct the Graph of a Spectrum
- BA11B: Implement DecodingIdealSpectrum
- BA11C: Convert a Peptide into a Peptide Vector
- BA11D: Convert a Peptide Vector into a Peptide
- BA11E: Sequence a Peptide
- BA11F: Find a Highest-Scoring Peptide in a Proteome against a Spectrum
- BA11G: Implement PSMSearch
- BA11H: Compute the Size of a Spectral Dictionary
- BA11I: Compute the Probability of a Spectral Dictionary
- BA11J: Find a Highest-Scoring Modified Peptide against a Spectrum
- FIBO: Fibonacci Numbers
- BINS: Binary Search
- DEG: Degree Array
- INS: Insertion Sort
- DDEG: Double-Degree Array
- MAJ: Majority Element
- MER: Merge Two Sorted Arrays
- 2SUM: 2SUM
- BFS: Breadth-First Search
- CC: Connected Components
- MS: Merge Sort
- PAR: 2-Way Partition
- 3SUM: 3SUM
- DAG: Testing Acyclicity
- DIJ: Dijkstra's Algorithm
- INV: Counting Inversions
- TS: Topological Sorting
- BIP: Testing Bipartiteness
- PAR3: 3-Way Partition
- MED: Median
- QS: Quick Sort
- HEA: Building a Heap
- SQ: Square in a Graph
- BF: Bellman-Ford Algorithm
- CTE: Shortest Cycle Through a Given Edge
- HS: Heap Sort
- PS: Partial Sort
- HDAG: Hamiltonian Path in DAG
- NWC: Negative Weight Cycle
- SCC: Strongly Connected Components
- SC: Semi-Connected Graph
- GS: General Sink
- SDAG: Shortest Paths in DAG
- 2SAT: 2-Satisfiability