Skip to content

zhukovanadezhda/genome-assembler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧬🛠️De Bruijn Graph-Based Assembler

📝Introduction

This repository contains an implementation of a de Bruijn graph-based assembler to assemble the genome of the enterovirus A71. This genome is particularly interesting due to its short length: 7408 nucleotides, linear, and non-segmented.

The provided fastq file was generated using the ART program [Huang 2011] with the following command:

art_illumina -i eva71.fna -ef -l 100 -f 20 -o eva71 -ir 0 -dr 0 -ir2 0 -dr2 0 -na -qL 41 -rs 1539952693

The reads have maximum quality (41) and contain no insertions. Only the reads corresponding to the 5' -> 3' strands are provided.

In the folder debruijn-tp/data/, you will find:

  • eva71.fna: the genome of the virus of interest
  • eva71_plus_perfect.fq: reads

🔄Dependency Installation

To set up the environment, run the following commands:

conda env create -f environment.yml
conda activate genome-assembler

🧑‍💻️Usage

Clone the repository and navigate to the project folder:

git clone git@github.com:zhukovanadezhda/genome-assembler.git
cd genome-assembler

Run the assembler with:

python3 debruijn.py \
-i <input fastq file>  \ # single-end fastq file
-k <kmer size>         \ # optional, default is 21
-o <output file>         # file with the contigs

⚙️Testing

You can test the program by running:

pytest --cov=debruijn

🎁Example of Usage

To assemble the enterovirus A71 genome, you can run the following command:

python3 debruijn/debruijn.py -i data/eva71_plus_perfect.fq

This will produce a file named contigs.fasta, which contains the contig(s) generated by the assembler. If the program successfully resolves the genome, only one contig should be produced, representing the complete assembly of the enterovirus A71 genome.

Verifying the Assembly with BLAST

To confirm that the assembled genome matches the reference genome, you can use BLAST to compare the generated contig against the reference sequence. Follow these steps:

  1. Install BLAST+ using:
sudo apt install ncbi-blast+
  1. Create a BLAST database from the reference genome:
makeblastdb -in data/eva71.fna -dbtype nucl
  1. Run the BLAST comparison:
cd data
blastn -query ../contigs.fasta -db eva71.fna

Example BLAST Output

Here is an example of what the beginning of the BLAST report might look like:

>EVA71_BrCr_U22521 EVA71
Length=7408

Score = 13649 bits (7391),  Expect = 0.0
Identities = 7391/7391 (100%), Gaps = 0/7391 (0%)
Strand=Plus/Plus

Query  1     GTGGGTTGTCACCCACCCACAGGGTCCACTGGGCGCTAGTACACTGGTATCTCGGTACCT  60
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  12    GTGGGTTGTCACCCACCCACAGGGTCCACTGGGCGCTAGTACACTGGTATCTCGGTACCT  71
...

This output shows a perfect match between the assembled genome and the reference genome, indicating that the assembly was successful.

✉️Contact

If you have any questions, feel free to contact me via email: nadiajuckova@gmail.com

About

🛠️🧬De Bruijn graph-based genome assembler📈📊

Topics

Resources

License

Stars

Watchers

Forks

Languages