Sv4_HistoricalStarlings

Scripts and notes related to manuscript:

Stuart KC, Sherwin WB, Austin JJ, Bateson M, Eens M, Brandley MC, Rollins LA 2022. Historical museum samples enable the examination of divergent and parallel evolution during invasion. Molecular Ecology, 31(1): 1836-1852, doi.org/10.1111/mec.16353

This includes:

2021_11_20_notebook_10049: PopGen analysis

2021_11_20_notebook_10040: Outlier analysis

This page contains small vignette's on the following

DArTseq SNP variant calling

SNP variant calling for raw RRS (DArTseq) data (with reference)

The following is a summary of some SNP calling pipelines that I have used to process raw DArTseq data, but they may be applied to other types of reduced representation sequencing (RRS) raw data (alongside a reference genome).

Variant calling falls into a few primary steps:

Mapping raw reads to a reference genome (or for refernce free variant calling forming the pseudogenome and calling references to this)

Calling the variant SNPs for each sample based on the mapped reads

Filtering the called SNPs, however I won't cover this here as it is quite project specific.

For the first two steps of the process, you can mix and match mapping and variant calling programs, though some are best paired. The below code covers:

Mapping with BWA mem and aln, then variant calling with Stacks Gstacks.
Mapping with bowtie, then variant calling with GATK.

Ulitmately I used BWA aln as my mapping software of choice, as this is recommended for Illumina single-end reads shorter than ~70bp, which is good for DArT-seq.

Software

Below is a list of the modules used (and versions).

module load stacks/2.2
module load fastqc/0.11.8
module load bwa/0.7.17
module load bowtie/2.3.5.1
module load java/8u121
module load samtools/1.10
module load picard/2.18.26
module load gatk/4.1.0.0

Data cleaning: Stacks

RAW_DIR=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/rawdata/batch2_split
OUTPUT_DIR=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/processing/samples_batch
BARCODE_DIR=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/processing/barcodes

process_radtags -p ${RAW_DIR}/ -o ${OUTPUT_DIR}/ -b ${BARCODE_DIR}/bardcode_named.txt -r -c -q --renz_1 pstI --renz_2 sphI

Run QC:

fastqc *.fq.gz --outdir=fastqc/

Count reads:

zcat *.fq.gz | echo $((`wc -l`/4))

Mapping: BWA aln and mem

Place your genome somewhere sensible and index it for BWA. Note that I've put my index files into a folder called bwa, and have given them a prefix (-p). '&> bwa/bwa_index.oe' allows me to keep a record of the output/error messages.

genome_fa=Sturnus_vulgaris_2.3.1.simp.fasta
bwa index -p bwa/stuv $genome_fa &> bwa/bwa_index.oe
bwa_db=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/genome/bwa/stuv

Aligning with bwa mem

Here I have my sample names as a list within the file 'sample_names.txt'. We are working with single-end reads, so provide just one fastq file. If this was paired-end reduced representation or whole genome sequencing, we could provide a second R2 fastq file.

for i in cat sample_names.txt;
do
sample=$i
echo WORKING WITH ${sample}
fq_file=../../samples_batch/merged/${sample}.fq.gz
bam_file=${sample}.bam
bwa_db=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/genome/bwa/stuv
bwa mem -t 16 -M $bwa_db $fq_file | samtools view -bS | samtools sort > $bam_file
done

Check reads mapped successfully:

samtools flagstat SAMPLENAME.bam #sub out SAMPLENAME with a few sample names

Aligning with bwa aln

Aligning with bwa aln is very similar. Note that in the above example I pipe the sam file straight into a sorted bam file, where as here I don't. It doesn't make too much of a difference when you are working on small data files, but when working on whole genome data moving the data into bam form straight away saves a lot of space.

for sample in cat sample_names.txt;
do
echo WORKING WITH ${sample}
bwa aln -t 16 -B 5 $bwa_db ../../samples_batch/merged/${sample}.fq.gz > ${sample}.sai
bwa samse $bwa_db ${sample}.sai ../../samples_batch/merged/${sample}.fq.gz > ${sample}.sam
samtools view -bS ${sample}.sam | samtools sort -o ${sample}.bam
done

Check reads mapped successfully

samtools flagstat ABGYY.bam
samtools flagstat ATB1.bam

ATB18: 2935527/4370005=67.17% mapped
NOR17: 1787288/2733999=65.37% mapped

As you can see, my historical samples had much lower mapping rates:

samtools flagstat HIST11.bam
samtools flagstat HIST12.bam
samtools flagstat HIST8.bam

HIST11: 146792/2349968=6.25% mapped
HIST12:275346/3076038=8.95% mapped
HIST8: 156867/2843194=5.52%% mapped

Variant Calling: gstacks

For BWA mem and aln, the output is a BAM file. Next, I called variants using gstacks. This is not a necessity; once you have the BAM files you can use different variant calling methods.

gstacks -I ./ -M ../historic_populations.txt -O ./

Now the variant calling is done, we can produce a VCF file using Stacks Populations

populations -P ./ -M ../historic_populations.txt --vcf

Populations is nice because it gives you some summary info at the end of the variant calling process:

Removed 0 loci that did not pass sample/population constraints from 412079 loci.
Kept 412079 loci, composed of 28225204 sites; 20 of those sites were filtered, 243589 variant sites remained.

27345713 genomic sites, of which 846899 were covered by multiple loci (3.1%).
Mean genotyped sites per locus: 68.47bp (stderr 0.00).

Population summary statistics (more detail in populations.sumstats_summary.tsv): aw: 11.272 samples per locus; pi: 0.17294; all/variant/polymorphic sites: 15515519/232371/142127; private alleles: 24046 hist: 4.8985 samples per locus; pi: 0.12483; all/variant/polymorphic sites: 18008520/91625/34830; private alleles: 13284 mv: 11.015 samples per locus; pi: 0.14832; all/variant/polymorphic sites: 19889591/226170/98305; private alleles: 4740 mw: 11.146 samples per locus; pi: 0.16776; all/variant/polymorphic sites: 15527851/232281/128802; private alleles: 13279 nc: 11.442 samples per locus; pi: 0.17365; all/variant/polymorphic sites: 15591084/235304/135530; private alleles: 16176 or: 11.45 samples per locus; pi: 0.16313; all/variant/polymorphic sites: 14915003/232899/117912; private alleles: 7942

You can even do some preliminary filtering on Stacks:

populations -P ./ -M ../historic_populations.txt --vcf -r 0.50 --min_maf 0.01 --write_random_snp -t 8

Though I prefer the filtering options on VCFtools personally, as there are more options. The flag 'write_random_snp' or 'write-single-snp' is useful to run here through as it keep only one SNP per locus.

Alternate mapping and variant calling process: Bowtie + GATK

You will find that this process has very similar steps to the above BWA and Stacks processes. The overall workflow of SNP variant calling is usually quite similar.

Create a Bowtie genome database:

bowtie2-build -f Sturnus_vulgaris_2.3.1.simp.fasta Sturnus_vulgaris_2.3.1.simp

You will also need to create a sequence dictionary for GATK.

java -Xmx10g -jar /apps/picard/2.18.26/picard.jar CreateSequenceDictionary R=Sturnus_vulgaris_2.3.1.simp.fasta O=Sturnus_vulgaris_2.3.1.simp.dict
samtools faidx Sturnus_vulgaris_2.3.1.simp.fasta

The below covers a few steps: bowtie2 mapping; samtools processed a SAM into a sorted BAM file; picard marks duplicates; picard index; GATK haplotype caller. Each of these steps is run independently for each sample.


for i in cat sample_names.txt;
do
echo working with $i;
GENOME=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/genome/Sturnus_vulgaris_2.3.1.simp
READS=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/processing/samples_batch/merged
bowtie2 -p 10 --phred33 --very-sensitive-local -x ${GENOME} -I 149 -X 900 --rg-id "$i" --rg SM:"$i" -U ${READS}/"$i".fq.gz -S "$i".sam
samtools view -bS ${i}.sam | samtools sort -o ${i}.bam
java -Xmx128g -jar /apps/picard/2.18.26/picard.jar MarkDuplicates INPUT="$i".bam OUTPUT="$i"_mark.bam METRICS_FILE="$i"_sort.metrics.txt MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
java -Xmx48g -jar /apps/picard/2.18.26/picard.jar BuildBamIndex I="$i"_mark.bam
gatk --java-options "-Xmx120G" HaplotypeCaller -R ${GENOME}.fasta -I ${i}_mark.bam -O ${i}.g.vcf -ERC GVCF --native-pair-hmm-threads 16;
done

Finally, we combine the individually called haplotypes into one VCF file.

GENOME=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/genome/Sturnus_vulgaris_2.3.1.simp.fasta
gatk --java-options "-Xmx120G" CombineGVCFs \
--reference $GENOME \
--variant SAMPLE_A.g.vcf \
...
--variant SAMPLE_Z.g.vcf \
--output historical_GATK_variantscombined.vcf

And finally perform joint genotyping on the samples

gatk --java-options "-Xmx120g" GenotypeGVCFs \
-R $GENOME \
-V historical_GATK_variantscombined.vcf \
-O historical_GATK_variantsgenotyped.vcf

GATK doesn't produce any summaries like populations, however we can make a quick count of the number of SNPs using the below:

grep -v "#" historical_BWAmem_GATK_variantsgenotyped.vcf | wc -l

Closing remarks

Below is a summary of the number of variants I obtained at the end of the above pipeline, both without filtering and with some filtering. Note that the low number of SNPs called by the bowtie-GATK combo was not a result of mapping (bowtie produced high read mapping percentages). This leaves me to believe that GATK didn't do very well with the DArT-seq data, which is a bit strange as I have seen it used quite successfully on many reduced representation data sets. Most of my new projects work with WGS so I have not investigated this further, but will leave this result here in case another finds similar results in their own work and needs perepctive.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Sv4_vignette		Sv4_vignette
2021_11_20_notebook_10040.pdf		2021_11_20_notebook_10040.pdf
2021_11_20_notebook_10049.pdf		2021_11_20_notebook_10049.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sv4_HistoricalStarlings

SNP variant calling for raw RRS (DArTseq) data (with reference)

Software

Data cleaning: Stacks

Mapping: BWA aln and mem

Aligning with bwa mem

Aligning with bwa aln

Variant Calling: gstacks

Alternate mapping and variant calling process: Bowtie + GATK

Closing remarks

About

Releases

Packages

katarinastuart/Sv4_HistoricalStarlings

Folders and files

Latest commit

History

Repository files navigation

Sv4_HistoricalStarlings

SNP variant calling for raw RRS (DArTseq) data (with reference)

Software

Data cleaning: Stacks

Mapping: BWA aln and mem

Aligning with bwa mem

Aligning with bwa aln

Variant Calling: gstacks

Alternate mapping and variant calling process: Bowtie + GATK

Closing remarks

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages