Scripts and notes related to manuscript:
Stuart KC, Sherwin WB, Austin JJ, Bateson M, Eens M, Brandley MC, Rollins LA 2022. Historical museum samples enable the examination of divergent and parallel evolution during invasion. Molecular Ecology, 31(1): 1836-1852, doi.org/10.1111/mec.16353
This includes:
2021_11_20_notebook_10049: PopGen analysis
2021_11_20_notebook_10040: Outlier analysis
This page contains small vignette's on the following
The following is a summary of some SNP calling pipelines that I have used to process raw DArTseq data, but they may be applied to other types of reduced representation sequencing (RRS) raw data (alongside a reference genome).
Variant calling falls into a few primary steps:
For the first two steps of the process, you can mix and match mapping and variant calling programs, though some are best paired. The below code covers:
- Mapping with BWA mem and aln, then variant calling with Stacks Gstacks.
- Mapping with bowtie, then variant calling with GATK.
Ulitmately I used BWA aln as my mapping software of choice, as this is recommended for Illumina single-end reads shorter than ~70bp, which is good for DArT-seq.
Below is a list of the modules used (and versions).
module load stacks/2.2
module load fastqc/0.11.8
module load bwa/0.7.17
module load bowtie/2.3.5.1
module load java/8u121
module load samtools/1.10
module load picard/2.18.26
module load gatk/4.1.0.0
RAW_DIR=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/rawdata/batch2_split
OUTPUT_DIR=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/processing/samples_batch
BARCODE_DIR=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/processing/barcodes
process_radtags -p ${RAW_DIR}/ -o ${OUTPUT_DIR}/ -b ${BARCODE_DIR}/bardcode_named.txt -r -c -q --renz_1 pstI --renz_2 sphI
Run QC:
fastqc *.fq.gz --outdir=fastqc/
Count reads:
zcat *.fq.gz | echo $((`wc -l`/4))
Place your genome somewhere sensible and index it for BWA. Note that I've put my index files into a folder called bwa, and have given them a prefix (-p). '&> bwa/bwa_index.oe' allows me to keep a record of the output/error messages.
genome_fa=Sturnus_vulgaris_2.3.1.simp.fasta
bwa index -p bwa/stuv $genome_fa &> bwa/bwa_index.oe
bwa_db=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/genome/bwa/stuv
for i in cat sample_names.txt;
do
sample=$i
echo WORKING WITH ${sample}
fq_file=../../samples_batch/merged/${sample}.fq.gz
bam_file=${sample}.bam
bwa_db=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/genome/bwa/stuv
bwa mem -t 16 -M $bwa_db $fq_file | samtools view -bS | samtools sort > $bam_file
done
Check reads mapped successfully:
samtools flagstat SAMPLENAME.bam #sub out SAMPLENAME with a few sample names
Aligning with bwa aln is very similar. Note that in the above example I pipe the sam file straight into a sorted bam file, where as here I don't. It doesn't make too much of a difference when you are working on small data files, but when working on whole genome data moving the data into bam form straight away saves a lot of space.
for sample in cat sample_names.txt;
do
echo WORKING WITH ${sample}
bwa aln -t 16 -B 5 $bwa_db ../../samples_batch/merged/${sample}.fq.gz > ${sample}.sai
bwa samse $bwa_db ${sample}.sai ../../samples_batch/merged/${sample}.fq.gz > ${sample}.sam
samtools view -bS ${sample}.sam | samtools sort -o ${sample}.bam
done
Check reads mapped successfully
samtools flagstat ABGYY.bam
samtools flagstat ATB1.bam
ATB18: 2935527/4370005=67.17% mapped
NOR17: 1787288/2733999=65.37% mapped
As you can see, my historical samples had much lower mapping rates:
samtools flagstat HIST11.bam
samtools flagstat HIST12.bam
samtools flagstat HIST8.bam
HIST11: 146792/2349968=6.25% mapped
HIST12:275346/3076038=8.95% mapped
HIST8: 156867/2843194=5.52%% mapped
For BWA mem and aln, the output is a BAM file. Next, I called variants using gstacks. This is not a necessity; once you have the BAM files you can use different variant calling methods.
gstacks -I ./ -M ../historic_populations.txt -O ./
Now the variant calling is done, we can produce a VCF file using Stacks Populations
populations -P ./ -M ../historic_populations.txt --vcf
Populations is nice because it gives you some summary info at the end of the variant calling process:
Removed 0 loci that did not pass sample/population constraints from 412079 loci.
Kept 412079 loci, composed of 28225204 sites; 20 of those sites were filtered, 243589 variant sites remained.
27345713 genomic sites, of which 846899 were covered by multiple loci (3.1%).
Mean genotyped sites per locus: 68.47bp (stderr 0.00).
Population summary statistics (more detail in populations.sumstats_summary.tsv): aw: 11.272 samples per locus; pi: 0.17294; all/variant/polymorphic sites: 15515519/232371/142127; private alleles: 24046 hist: 4.8985 samples per locus; pi: 0.12483; all/variant/polymorphic sites: 18008520/91625/34830; private alleles: 13284 mv: 11.015 samples per locus; pi: 0.14832; all/variant/polymorphic sites: 19889591/226170/98305; private alleles: 4740 mw: 11.146 samples per locus; pi: 0.16776; all/variant/polymorphic sites: 15527851/232281/128802; private alleles: 13279 nc: 11.442 samples per locus; pi: 0.17365; all/variant/polymorphic sites: 15591084/235304/135530; private alleles: 16176 or: 11.45 samples per locus; pi: 0.16313; all/variant/polymorphic sites: 14915003/232899/117912; private alleles: 7942
You can even do some preliminary filtering on Stacks:
populations -P ./ -M ../historic_populations.txt --vcf -r 0.50 --min_maf 0.01 --write_random_snp -t 8
Though I prefer the filtering options on VCFtools personally, as there are more options. The flag 'write_random_snp' or 'write-single-snp' is useful to run here through as it keep only one SNP per locus.
You will find that this process has very similar steps to the above BWA and Stacks processes. The overall workflow of SNP variant calling is usually quite similar.
Create a Bowtie genome database:
bowtie2-build -f Sturnus_vulgaris_2.3.1.simp.fasta Sturnus_vulgaris_2.3.1.simp
You will also need to create a sequence dictionary for GATK.
java -Xmx10g -jar /apps/picard/2.18.26/picard.jar CreateSequenceDictionary R=Sturnus_vulgaris_2.3.1.simp.fasta O=Sturnus_vulgaris_2.3.1.simp.dict
samtools faidx Sturnus_vulgaris_2.3.1.simp.fasta
The below covers a few steps: bowtie2 mapping; samtools processed a SAM into a sorted BAM file; picard marks duplicates; picard index; GATK haplotype caller. Each of these steps is run independently for each sample.
for i in cat sample_names.txt;
do
echo working with $i;
GENOME=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/genome/Sturnus_vulgaris_2.3.1.simp
READS=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/processing/samples_batch/merged
bowtie2 -p 10 --phred33 --very-sensitive-local -x ${GENOME} -I 149 -X 900 --rg-id "$i" --rg SM:"$i" -U ${READS}/"$i".fq.gz -S "$i".sam
samtools view -bS ${i}.sam | samtools sort -o ${i}.bam
java -Xmx128g -jar /apps/picard/2.18.26/picard.jar MarkDuplicates INPUT="$i".bam OUTPUT="$i"_mark.bam METRICS_FILE="$i"_sort.metrics.txt MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
java -Xmx48g -jar /apps/picard/2.18.26/picard.jar BuildBamIndex I="$i"_mark.bam
gatk --java-options "-Xmx120G" HaplotypeCaller -R ${GENOME}.fasta -I ${i}_mark.bam -O ${i}.g.vcf -ERC GVCF --native-pair-hmm-threads 16;
done
Finally, we combine the individually called haplotypes into one VCF file.
GENOME=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/genome/Sturnus_vulgaris_2.3.1.simp.fasta
gatk --java-options "-Xmx120G" CombineGVCFs \
--reference $GENOME \
--variant SAMPLE_A.g.vcf \
...
--variant SAMPLE_Z.g.vcf \
--output historical_GATK_variantscombined.vcf
And finally perform joint genotyping on the samples
gatk --java-options "-Xmx120g" GenotypeGVCFs \
-R $GENOME \
-V historical_GATK_variantscombined.vcf \
-O historical_GATK_variantsgenotyped.vcf
GATK doesn't produce any summaries like populations, however we can make a quick count of the number of SNPs using the below:
grep -v "#" historical_BWAmem_GATK_variantsgenotyped.vcf | wc -l
Below is a summary of the number of variants I obtained at the end of the above pipeline, both without filtering and with some filtering. Note that the low number of SNPs called by the bowtie-GATK combo was not a result of mapping (bowtie produced high read mapping percentages). This leaves me to believe that GATK didn't do very well with the DArT-seq data, which is a bit strange as I have seen it used quite successfully on many reduced representation data sets. Most of my new projects work with WGS so I have not investigated this further, but will leave this result here in case another finds similar results in their own work and needs perepctive.