This repository contains an implementation of a de Bruijn graph-based assembler to assemble the genome of the enterovirus A71. This genome is particularly interesting due to its short length: 7408 nucleotides, linear, and non-segmented.
The provided fastq file was generated using the ART program [Huang 2011] with the following command:
art_illumina -i eva71.fna -ef -l 100 -f 20 -o eva71 -ir 0 -dr 0 -ir2 0 -dr2 0 -na -qL 41 -rs 1539952693
The reads have maximum quality (41) and contain no insertions. Only the reads corresponding to the 5' -> 3' strands are provided.
In the folder debruijn-tp/data/
, you will find:
eva71.fna
: the genome of the virus of interesteva71_plus_perfect.fq
: reads
To set up the environment, run the following commands:
conda env create -f environment.yml
conda activate genome-assembler
Clone the repository and navigate to the project folder:
git clone git@github.com:zhukovanadezhda/genome-assembler.git
cd genome-assembler
Run the assembler with:
python3 debruijn.py \
-i <input fastq file> \ # single-end fastq file
-k <kmer size> \ # optional, default is 21
-o <output file> # file with the contigs
You can test the program by running:
pytest --cov=debruijn
To assemble the enterovirus A71 genome, you can run the following command:
python3 debruijn/debruijn.py -i data/eva71_plus_perfect.fq
This will produce a file named contigs.fasta
, which contains the contig(s) generated by the assembler. If the program successfully resolves the genome, only one contig should be produced, representing the complete assembly of the enterovirus A71 genome.
To confirm that the assembled genome matches the reference genome, you can use BLAST to compare the generated contig against the reference sequence. Follow these steps:
- Install BLAST+ using:
sudo apt install ncbi-blast+
- Create a BLAST database from the reference genome:
makeblastdb -in data/eva71.fna -dbtype nucl
- Run the BLAST comparison:
cd data
blastn -query ../contigs.fasta -db eva71.fna
Here is an example of what the beginning of the BLAST report might look like:
>EVA71_BrCr_U22521 EVA71
Length=7408
Score = 13649 bits (7391), Expect = 0.0
Identities = 7391/7391 (100%), Gaps = 0/7391 (0%)
Strand=Plus/Plus
Query 1 GTGGGTTGTCACCCACCCACAGGGTCCACTGGGCGCTAGTACACTGGTATCTCGGTACCT 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 12 GTGGGTTGTCACCCACCCACAGGGTCCACTGGGCGCTAGTACACTGGTATCTCGGTACCT 71
...
This output shows a perfect match between the assembled genome and the reference genome, indicating that the assembly was successful.
If you have any questions, feel free to contact me via email: nadiajuckova@gmail.com