This repository is built to house our code for project work involving the design and construction of a custom error correction algorithm for DNA sequences. Our algorithm integrates both KmerCo and RobustBF within itself.
We have investigated the performance of our custom error correction algorithm by conducting rigorous experiments using DNA sequences from four organisms - the African forest elephant (Loxodonta cyclotis), the Sunda flying lemur (Galeopterus variegatus), the gray mouse lemur (Microcebus murinus), and the common minke whale (Balaenoptera acutorostrata).
We were able to experimentally show that our algorithm was able to reduce the erroneous rate and increase the trustworthy rate. However, the algorithm wasn’t efficient and upto current standards of other error correcting algorithms (Lighter, for example). Our project uncovers a lot of future avenues that can be pursued in this regard, which will help improve the correctness and efficiency of the overall process.
This project is undertaken by us as part of the Satyendranath Bose Summer Internship Programme 2023 at National Institue of Technology, Silchar.
-
Download FASTQ files for the four datasets from here. Alternatively, one can also download it from here.
-
Retrieve the sequences from the FASTQ file into a text file. The following
AWK
andsed
commands can be used for the same:$ awk '{if(NR%4==2)print $0}' dataset.fastq > sequence.txt
$ sed -i ':a; N; s/\\n/ /; ta' sequence_dataset.txt
Extract the sequences for all the datasets similarly, and keep them in the same folder. The next step takes care of running the algorithm on all the different extracted sequences automatically.
-
Run the
runner.sh
script, with the inputspath1
andpath2
; where path1 is the absolute path to the repository folder (containingmain.c
), and path2 is the absolute path to the folder containing the extracted sequences from the datasets in .txt format../runner.sh path1 path2
Wait for the script to complete running. It will create a separate results folder inside the directory pointed to by
path1
, and put all the obtained outputs from the various datasets in that folder.
Alternatively, one can manually compile and run main.c
with a text file containing the extracted sequences from a FASTQ file as input.
List of references, resources, and research materials we used for the project are given below:
-
Nayak, Sabuzima & Patgiri, Ripon. (2023). KmerCo: A lightweight K-mer counting technique with a tiny memory footprint. arXiv:2305.07545
-
Sabuzima Nayak and Ripon Patgiri. 2021. robustBF: A High Accuracy and Memory Efficient 2D Bloom Filter.
-
Sabuzima Nayak and Ripon Patgiri. 2021. countBF: A general-purpose high accuracy and space efficient counting bloom filter. 2021 17th International Conference on Network and Service Management (CNSM). IEEE, Izmir, Turkey, 355–359.
-
Sabuzima Nayak and Ripon Patgiri. 2019. A Review on Role of Bloom Filter on DNA Assembly. IEEE Access 7 (2019), 66939–66954.
-
Song, L., Florea, L. & Langmead, B. Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol 15, 509 (2014).
-
Github repository of Lighter, a fast and memory-efficient sequencing error corrector
-
Github repository of RobustBF, a memory-efficient and highly accurate 2D Bloom Filter
-
Github repository of KmerCo, a lightweight k-mer counting technique with a tiny memory footprint
-
Github repository of CountBF, a lightweight k-mer counting technique with a tiny memory footprint
Note: The papers that we've put inside the "Papers"
directory of this repository have been taken directly from the aforementioned links, and have been put in here only for reference. All credits for those papers belong to the respective authors and publishers.
We thank and acknowledge our guide and mentor, Dr. Ripon Patgiri, and supervisor, Ms. Sabuzima Nayak, for supporting and guiding us during the internship. We also acknowledge and thank everyone associated with the Satyendranath Bose Summer Internship at the Department of CSE, NIT Silchar, including the relevant authorities and faculties.
-
Preetodeep Dev, B.Tech (CSE), Assam University, Silchar.
-
Swayampakula Kedharnath, B.Tech (CSE), National Institute of Technology, Silchar.