Text Algorithms in CPP

Text Algorithms is a C++ project that implements substring search and sequence alignment algorithms. This project can be useful for bioinformatics and other full-text search tasks.

Dependencies

The project requires the following dependencies:

CMake >= 3.15
C++17-compatible compiler

Build

To build the project, follow these steps:

Clone the repository:

git clone https://github.com/your-username/TextAlgorithms.git

Navigate to the project directory:

cd TextAlgorithms

Run the following commands:

cmake -S . -B ./build
cmake --build ./build

Usage

Substring Search

The project implements the Rabin-Karp algorithm for substring search. To use it, include the SubstringSearch.h header and call the rabinKarp function with the haystack and needle strings:

#include "SubstringSearch.h"

// ...

std::string haystack = "Madam, I'm Adam";
std::string needle = "am";
std::vector<int> matches = rabinKarp(haystack, needle);
// matches contains the positions of the needle occurrences in the haystack

Sequence Alignment

The project implements the Needleman-Wunsch algorithm for sequence alignment. To use it, include the SequenceAlignment.h header and call the needlemanWunsch function with the two sequences and the similarity matrix:

#include "SequenceAlignment.h"

// ...

std::string seq1 = "GGGCGACACTCCACCATAGA";
std::string seq2 = "GGCGACACCCACCATACAT";
std::vector<std::string> alignment = needlemanWunsch(seq1, seq2, similarityMatrix);
// alignment contains the two sequences aligned with gaps

Examples

Substring Search

Find all occurrences of the string "AAGCCTCTCAAT" in the HIV virus sequence:

#include "SubstringSearch.h"
#include <fstream>
#include <iostream>

int main() {
  std::ifstream file("HIV.txt");
  std::string haystack((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());
  std::string needle = "AAGCCTCTCAAT";
  std::vector<int> matches = rabinKarp(haystack, needle);
  for (int match : matches) {
    std::cout << "Match at position " << match << std::endl;
  }
  return 0;
}

Sequence Alignment

Align two DNA sequences using a similarity matrix:

#include "SequenceAlignment.h"
#include <iostream>

int main() {
  std::string seq1 = "GGGCGACACTCCACCATAGA";
  std::string seq2 = "GGCGACACCCACCATACAT";
  std::vector<std::string> alignment = needlemanWunsch(seq1, seq2, similarityMatrix);
  std::cout << alignment[0] << std::endl << alignment[1] << std::endl;
  return 0;
}

Matching regular expressions

The program checks whether a sequence over the alphabet {A, C, G, T} matches a regular expression.
The input of the program is a file with two lines. The first line contains the sequence to be checked for a match. The second line contains a pattern that includes characters from the alphabet and the following characters:

. -- matches any single character from the alphabet;
? -- matches any single character from the alphabet or the absence of a character;
+ -- matches zero or more repetitions of the previous element;
* -- matches any sequence of characters from the alphabet or the absence of characters.

The output of the program is True/False - whether the given sequence matches the pattern.

Example input:

GGCGACACCCACCATACAT
G?G*AC+A*A.

Example output:

True

K-similar strings

Strings s1 and s2 are k-similar (for some non-negative integer k) if it is possible to swap two letters in s1 exactly k times so that the resulting string is equal to s2.

The program checks k-similarity of two sequences over the alphabet {A, C, G, T}.
The input of the program is a file with two lines. The output of the program is the smallest k for which s1 and s2 are k-similar. If the strings are not anagrams, print an error message.

Example input:

GGCGACACC
AGCCGCGAC

Example output:

Minimum Window Substring

A program for finding the minimum window substring for a sequence over the alphabet {A, C, G, T}. The input to the program is a file containing two lines: s and t. A window substring of string s is a substring that contains all characters present in string t (including duplicates). The output of the program is the minimum length window substring. If there is no window substring, return an empty string.

Example input:

GGCGACACCCACCATACAT
TGT

Example output:

GACACCCACCATACAT

License

This project is licensed under the terms of the MIT license. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
algorithm		algorithm
img		img
interface		interface
test		test
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.cc		main.cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Algorithms in CPP

Dependencies

Build

Usage

Substring Search

Sequence Alignment

Examples

Substring Search

Sequence Alignment

Matching regular expressions

K-similar strings

Minimum Window Substring

License

About

Languages

License

Astrodynamic/DNA_Analazer-Algorithms-for-working-with-text-in-CPP

Folders and files

Latest commit

History

Repository files navigation

Text Algorithms in CPP

Dependencies

Build

Usage

Substring Search

Sequence Alignment

Examples

Substring Search

Sequence Alignment

Matching regular expressions

K-similar strings

Minimum Window Substring

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages