Skip to content

Calculate Assembly Statistics

Emily Delorean edited this page Oct 7, 2024 · 2 revisions

Use bbtools to calculate assembly stats. It is important to specify -Xmx5g t=4 or bbtools stats.sh will not run.

#!/bin/bash
#SBATCH --job-name=STATS
#SBATCH --ntasks-per-node=4
#SBATCH -o "%x_%j.o"
#SBATCH -e "%x_%j.e"

################
module load bbtools

echo ${file}
stats.sh -Xmx5g t=4 in=assembly.fa

Heterozygosity

This script compares the heterozygosity between the two genome assemblies using mummer alignments.

#!/bin/bash
#SBATCH --job-name="genome_heterozygosity"
#SBATCH -n 40
#SBATCH -t 1-10:00:00
#SBATCH --output="%x_%j.o" # job standard output file (%j replaced by job id)
#SBATCH --error="%x_%j.e" # job standard error file (%j replaced by job id)

module load mummer

query='HDA149v1.0_assembly.fasta'
ref='HDA330v1.0_assembly.fasta'
out='HDA149_onto_HDA330'

nucmer --prefix=${out} ${ref} ${query}

# Non repeat snps
show-snps -Clr ${out}.delta > ${out}_nonrepeat.snps

# Count the number of SNPS
wc -l HDA149_onto_HDA330_nonrepeat.snps
  • It generatestwo files. 1) the alignment: HDA149_onto_HDA330.delta and 2) the SNPs HDA149_onto_HDA330_nonrepeat.snps
  • Each SNP is on a single line in the .snps file. Use wc -l HDA149_onto_HDA330_nonrepeat.snps to count the number of lines and subtract the 5 header lines to count the number of SNPs. Divide the number of SNPs by the genome size and multiply by 100 to get the percent heterozygosity. For example, if there are 4088858 lines in the snps file and the chile pepper genome is 3.5 Gb then (4,088,853 SNPs / 3,500,000,000 bp) x 100 = 0.1168% heterozygosity.