Variant calling was performed according to GATK4 best practices for germline short variant discovery using a reproducible and scalable pipeline written with snakemake. This pipeline makes some minor adaptations to the standard workflow to workaround issues for non-model organisms such as the lack of a reliable reference SNP database for base quality recalibration.
As per GATK4 best practices, HaplotypeCaller
was used to call SNPs and
indels for every sample. Then genomicsDBImport
and GenotypeGVCFs
were
used to do joint-calling for all samples of every scaffold. We excluded
InDels in the final callset.
Our initial variant call set was then further filtered as follows:
1.Remove sites located within 5bp of any InDels.
bcftools filter -g 5 --threads 20 -O z -o Adigi.indel5bp.vcf.gz Adigi.gatk_raw.vcf.gz
SNPs left: 33,617,435
2.A generic hard filtering recommended by
gatk
(QD<10, QUAL<30, SOR>3, FS>60, MQ<40, MQRankSUM<-12.5, ReadPosRankSum<-8
).
In addition, only biallelic SNPs were included from here.
SNPs left: 20,040,500
3.Sites located in simple repeat regions identified by
mdust v2006.10.17
were removed.
mdust reference.fa -c |cut -f1,3,4 > genome.mdust.bed
vcftools --gzvcf Adigi.indel5bp_snp_hardfilter_passed_biallelic.vcf.gz \
--exclude-bed genome.mdust.bed \
--recode --recode-INFO-all --stdout | \
bgzip > Adigi.indel5bp_snp_hardfilter_passed_biallelic_mdust.vcf.gz
SNPs left: 19,981,271
4.Check for the presence of clones, siblings, or other close familial
relationships among the sequenced individuals based on pairwise kinship
coefficient estimated by vcftools --relatedness2
(Manichaikul et
al.,)[https://academic.oup.com/bioinformatics/article/26/22/2867/228512].
vcftools --gzvcf Adigi.indel5bp_snp_hardfilter_passed_biallelic_mdust.vcf.gz --relatedness2 --out indel5bp_snp_hardfilter_passed_biallelic_mdust
There were no pairs of samples with relatedness values indicating close kinship. (First-degree relatives are ~0.25, and 2nd-degree ~0.125, and 3rd degree 0.0625.)
5.We filtered call sets by site depth, missing genotypes and genotype quality, Sites with more than 10% missingness and a mean depth less than 10X or greater than two standard deviations of the mean depth in samples.
vcftools --gzvcf Adigi.indel5bp_snp_hardfilter_passed_biallelic_mdust.vcf.gz \
--max-missing 0.9 --minQ 30 --min-meanDP 10 --max-meanDP 33 \
--minDP 3 --minGQ 20 \
--remove-filtered-geno-all \
--recode --recode-INFO-all \
--stdout | bgzip > Adigi.v2.DPg90gdp3gq30.vcf.gz
SNPs left: 11,731,102
- We removed SNPs with Inbreeding Coefficient < -0.05. According to
this article on the gatk
website
negative values of
InbreedingCoeff
can be used as a proxy for poor mapping.
SNPs left: 9,772,006
- Monomorphic sites (SNPs that are non-reference in all samples) were removed since they contain no information in the following analysis, this account for 115,452 sites in the dataset.
bcftools filter -e 'AC=0 || AC==AN' --threads 20 Adigi.v2.DPg90gdp3gq30.Fis0.05_pass.vcf.gz |bgzip > Adigi.v2.filtered.vcf.gz
The final filtered variant set contained 9,656,554 SNPs
We then checked how many genotypes were missing in each sample and how this relates to overall sample sequencing depth
Figure 1: The statistics of average coverage depth and percentage of missing genotypes in all samples.
Figure 2: SNP distribution for all samples.
The samples with low coverage depth contain more missing genotypes, this
makes sense because we filtered genotypes with very low coverage to
ensure accuracy. Only one sample, RS3_S_250
had an appreciable
genotype missingness (almost 30%). We decided to keep it for the
purposes of population structure analysis but to remove it from analyses
based on phasing (haplotype based analysis).
After performing our population structure analysis (see later sections)
we noticed one single sample BR_5_121_S125_L004
from inshore were
clustered with south offshore samples. Later investigation (see
18.radseq_check) revealed that this was very
likely due sample label switching at some point during the sequencing
process. This sample was included in haplotype phasing
since its genotype calls were accurate and irrespective of its label it
was part of the populations under study. However, it was excluded from
all population structure, demography and selection analyses since these
relied on accurate assignment to sample locations.
Since population structure analyses, SMC analyses and SFS-based demographic modelling all have specific input requirements we performed additional filtering steps for each of those separately, starting from the SNP callset described above. See sections on those analyses for details of additional QC and filtering.