Documentation of code used to conduct de novo genome assembly for the endangered Eulemur rufifrons from sequencing efforts conducted entirely with portable Nanopore sequencing in the host country of Madagascar.
See programs.txt for full list of software programs and versions used for project
Convert FAST5 files to pod5 (pod5 now available output for Dorado)
pod5 convert fast5 ./input/*.fast5 --output converted.pod5
pod5 merge e1.pod5 e2.pod5 --output merged.pod5
Download Dorado model
dorado download --model dna_r9.4.1_e8_sup@v3.6
Run Dorado
dorado basecaller -x "cuda:0" \
dna_r9.4.1_e8_sup@v3.6 \
pod5/eul_${SEQ}/ \
--emit-fastq > fastqs/eul_${SEQ}_pod5_sup.fastq
Concatenate, rename, and gzip
cat *.fastq > eul_all.fq
mv eul_all.fq eul_all.fastq
gzip eul_all.fastq > eul_all.fastq.gz
porechop -i fastqs/eul_all.fastq.gz -o fastqs/eul_trimmed.fastq -t 24
flye --nano-hq fastqs/eul_trimmed.fastq \
--out-dir results/flye \
--genome-size 3.0g \
--asm-coverage 40 \
-t 40
Broke polishing into three distinct stages for memory purposes
mini_align -i fastqs/eul_all.fastq -r results/flye/eulemur_rufifrons_all_assembly.fasta -m \
-p results/medaka/ -t 24
medaka consensus results/medaka/calls_to_draft.bam results/medaka/contigs_batch$BATCH.hdf \
--model r941_min_sup_g507 --batch 200 --threads 2 \
--regions $CONTIGS
medaka stitch results/medaka/*.hdf polished_eul_assembly.fasta
Map reads to assembly
minimap2 -t 20 -ax map-ont $FA $FQ | samtools sort -m 1G -o $OUT/aligned.bam -T tmp.ali
Generate read coverage histogram and manually determine cutoffs for low, medium, and high coverage
purge_haplotigs hist -b $BAM -g $FA -t 16
Analyze coverage by contigs using cutoffs determined above
purge_haplotigs cov -i aligned.bam.gencov -l 5 -m 35 -h 75 \
-o 75high/e_ruf_coverage_stats.csv \
-j 80 -s 80
Run purging pipeline
purge_haplotigs purge -g $FA -c 75high/e_ruf_coverage_stats.csv \
-t 16 \
-o 75high/e_ruf_curated
Build standard database
./kraken2-build --standard --threads 24 --db database/
Run Kraken2
kraken/kraken2 --db kraken/standard_db/ --threads 24 --use-names --report results/kraken/eul_ruf_contam_kraken.report.txt /scratch/lrh85/nanopore/results/repeat/eulemur_rufifrons_all.fasta.masked --classified-out results/kraken/classified/eul_ruf_contam.kraken.classified.out.txt --unclassified-out results/kraken/unclassified/eul_ruf_contam.kraken.unclassified.out.txt
Pull taxids from Kraken2 report and remove from genome
awk '/Bacteria/,/Viruses/' eul_ruf_all_kraken.report.txt | head -n -1 | cut -f 5 > bacteria_taxids.txt
grep -f bacteria_taxids.txt -Fw eul_ruf_all.kraken.classified.out.txt > bacteria_contigs.txt
cat bacteria_contigs.txt | cut -d " " -f1 > contigs_to_remove.txt
awk '(NR==FNR) { toRemove[$1]; next }
/^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }
p' contigs_to_remove.txt eulemur_rufifrons_all.fasta.masked > eulemur_rufirons_filtered_masked.fasta
Removing contigs smaller than 3k with seqkit (2.5.1)
cat eulemur_rufirons_filtered_masked.fasta | seqkit seq -m 3000 > eulruf_trim_filt_purged_3krm.fasta
FA=eulruf_trim_filt_purged_3krm.fasta
/cache/home/lrh85/RepeatMasker/RepeatMasker $FA -pa 54 -species Primate -xsmall
Scaffolding to Lemur catta reference genome (mLemCat1.pri)
OUT=eulruf_scaffold_lcattaREF
REF=ref_genomes/lemur_catta/ncbi_dataset/data/GCF_020740605.2/GCF_020740605.2_mLemCat1.pri_genomic.fa
RES=results/repeat/eulemur_rufifrons_filtered_maksed.fasta
#initial scaffolding
ragtag.py scaffold \
-o results/ragtag/$OUT \
-w -r -t 20 \
--mm2-params '-x map-ont' \
$REF \
$RES
Removing contigs which erroneously mapped to Y chromosome and unpolished mitochondrial genome
awk '(NR==FNR) { toRemove[$1]; next }
/^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }
p' rm.txt ragtag.scaffold.fasta > eulruf_filtered_purge_rmy.fasta
Adding polished mitochondrial genome produced with Flye meta mode (see below)
cat eulruf_filtered_purge_rmy.fasta polished_eulruf_mito_assembly.fasta > eulemur_rufifrons.fasta
Representative scripts to evaluate completeness of generated assembly by comparing to reference genome
quast $FA \
-r ref_genomes/lemur_catta/ncbi_dataset/data/RefSeq/GCF_020740605.2_mLemCat1.pri_genomic.fa \
-g ref_genomes/lemur_catta/ncbi_dataset/data/RefSeq/genomic.gff \
--nanopore fastqs/eul_trimmed.fasta \
-o results/results/quast/eul_filt_purge_3krm_masked_scaff \
--large --no-snps
Representative scripts to evaluate completeness of generated assembly using vertebrate- and primate-specific lineage databases, respectively.
busco -i $FASTA -m genome -l vertebrata_odb10 -c 24 -o $OUT_DIR -f --miniprot
busco -i $FASTA -m genome -l primates_odb10 -c 24 -o $OUT_DIR -f --miniprot
Convert ref and target genome to 2bit
```shell
faToTwoBit genome.fa genome.2bit
Chain target (human) to query (Eulemur rufifrons scaffolded to Lemur_catta)
/cache/home/lrh85/make_lastz_chains/make_chains.py \
--pd results/chains/rufi_scaf_lcatta_chained_human -f \
--cluster_executor slurm \
--cluster_queue p_deenr_1 \
human eulemur \
2bit/hg38.2bit 2bit/eulruf_scaffold_lcattaRS.2bit
Rename chromosomes back
CHAIN=rufi_lcatta
RPATH=/home/lrh85/make_lastz_chains/standalone_scripts/
$RPATH/rename_chromosomes_back.py --rename_table_query eul_rufifrons_chrom_rename_table.tsv hg38.eul_rufifrons.final.chain
Convert target GTF to bed-12
gtfToGenePred ref_genomes/human/hg38.ncbiRefSeq.gtf ref_genomes/human/temp.genePred
genePredToBed ref_genomes/human/temp.genePred ref_genomes/human/hg38.bed
Annotating against set of human genes from https://github.com/hillerlab/TOGA/tree/master/supply
TOGA_PATH=/home/lrh85/TOGA
BED_PATH=ref_genomes/human/GRCh38/toga.transcripts.bed
ISO_PATH=ref_genomes/human/GRCh38/toga.isoforms.hg38.txt
$TOGA_PATH/toga.py results/chains/rufi_scaf_lcatta_chained_human/hg38.eulruf.renamed.final.chain $BED_PATH 2bit/hg38.2bit 2bit/eulruf_scaffold_lcattaRS.2bit \
--pd results/toga/rufifrons --pn eul_rufifrons \
-i $ISO_PATH --chn 20 --cb 8,16,32,64,128,256 \
--cjn 100 \
--nd $TOGA_PATH/nextflow_logs --nc $TOGA_PATH/nextflow_config_files
Interpretting TOGA output by extracting Intact (I), Partially Intact (PI) or Uncertain Loss (UL) genes:
grep PROJECTION loss_summ_data.tsv | awk '{if ($3 == "I" || $3 == "PI" || $3 == "UL") print $2}' | sort > I_PI_UL.txt
sort -k4,4 query_annotation.bed -o query_annotation.sorted.bed
join -1 1 -2 4 I_PI_UL.txt query_annotation.sorted.bed -t $'\t' > query_annotation.I_PI_UL.bed
To generate table of gene loss
grep GENE loss_summ_data.tsv | cut -f3 | sort | uniq -c
Map reads to Lemur catta mitogenome
minimap2 -t 20 -ax map-ont ref_genomes/lemur_catta/ncbi_dataset/data/RefSeq/lcattaRS_mito.fasta fastqs/eul_all.fastq.gz > results/minimap/eulruf/eulruf_mito.sam
Subsample reads
seqkit sample -n 10000 eulruf_mito.fastq -o eulruf_mito_subsample.fastq
flye --nano-hq fastqs/eulruf_mito_subsample.fastq \
--out-dir results/flye/eulruf_mito_subsample \
--genome-size 17k --meta \
-t 40
medaka_consensus -b 100 -i fastqs/eulruf_mito_subsample.fastq -d results/flye/eulruf_mito_subsample/eulruf_mito.fasta -o results/medaka/ -t 24 -m r941_min_sup_g507