Skip to content

papa-delta/NITS-SNBSIP-2023

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 

Repository files navigation

A Custom Error Correction Algorithm for DNA Assembly

Abstract

This repository is built to house our code for project work involving the design and construction of a custom error correction algorithm for DNA sequences. Our algorithm integrates both KmerCo and RobustBF within itself.

We have investigated the performance of our custom error correction algorithm by conducting rigorous experiments using DNA sequences from four organisms - the African forest elephant (Loxodonta cyclotis), the Sunda flying lemur (Galeopterus variegatus), the gray mouse lemur (Microcebus murinus), and the common minke whale (Balaenoptera acutorostrata).

We were able to experimentally show that our algorithm was able to reduce the erroneous rate and increase the trustworthy rate. However, the algorithm wasn’t efficient and upto current standards of other error correcting algorithms (Lighter, for example). Our project uncovers a lot of future avenues that can be pursued in this regard, which will help improve the correctness and efficiency of the overall process.

This project is undertaken by us as part of the Satyendranath Bose Summer Internship Programme 2023 at National Institue of Technology, Silchar.

Steps to run the error correction code:

  1. Download FASTQ files for the four datasets from here. Alternatively, one can also download it from here.

  2. Retrieve the sequences from the FASTQ file into a text file. The following AWK and sed commands can be used for the same:

    $ awk '{if(NR%4==2)print $0}' dataset.fastq > sequence.txt

    $ sed -i ':a; N; s/\\n/ /; ta' sequence_dataset.txt

    Extract the sequences for all the datasets similarly, and keep them in the same folder. The next step takes care of running the algorithm on all the different extracted sequences automatically.

  3. Run the runner.sh script, with the inputs path1 and path2; where path1 is the absolute path to the repository folder (containing main.c), and path2 is the absolute path to the folder containing the extracted sequences from the datasets in .txt format.

    ./runner.sh path1 path2

    Wait for the script to complete running. It will create a separate results folder inside the directory pointed to by path1, and put all the obtained outputs from the various datasets in that folder.

Alternatively, one can manually compile and run main.c with a text file containing the extracted sequences from a FASTQ file as input.

References

List of references, resources, and research materials we used for the project are given below:

Note: The papers that we've put inside the "Papers" directory of this repository have been taken directly from the aforementioned links, and have been put in here only for reference. All credits for those papers belong to the respective authors and publishers.

Acknowledgements:

We thank and acknowledge our guide and mentor, Dr. Ripon Patgiri, and supervisor, Ms. Sabuzima Nayak, for supporting and guiding us during the internship. We also acknowledge and thank everyone associated with the Satyendranath Bose Summer Internship at the Department of CSE, NIT Silchar, including the relevant authorities and faculties.

Authors:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published