Skip to content

Latest commit

 

History

History
134 lines (101 loc) · 5.47 KB

02.quality_control.md

File metadata and controls

134 lines (101 loc) · 5.47 KB

Variant calling and quality control

Variant calling was performed according to GATK4 best practices for germline short variant discovery using a reproducible and scalable pipeline written with snakemake. This pipeline makes some minor adaptations to the standard workflow to workaround issues for non-model organisms such as the lack of a reliable reference SNP database for base quality recalibration.

As per GATK4 best practices, HaplotypeCaller was used to call SNPs and indels for every sample. Then genomicsDBImport and GenotypeGVCFswere used to do joint-calling for all samples of every scaffold. We excluded InDels in the final callset.

General filtering

Our initial variant call set was then further filtered as follows:

1.Remove sites located within 5bp of any InDels.

bcftools filter -g 5 --threads 20 -O z -o Adigi.indel5bp.vcf.gz Adigi.gatk_raw.vcf.gz

SNPs left: 33,617,435

2.A generic hard filtering recommended by gatk (QD<10, QUAL<30, SOR>3, FS>60, MQ<40, MQRankSUM<-12.5, ReadPosRankSum<-8). In addition, only biallelic SNPs were included from here.

SNPs left: 20,040,500

3.Sites located in simple repeat regions identified by mdust v2006.10.17 were removed.

mdust reference.fa -c |cut -f1,3,4 > genome.mdust.bed

vcftools --gzvcf Adigi.indel5bp_snp_hardfilter_passed_biallelic.vcf.gz \
  --exclude-bed genome.mdust.bed \
  --recode --recode-INFO-all --stdout | \
  bgzip > Adigi.indel5bp_snp_hardfilter_passed_biallelic_mdust.vcf.gz

SNPs left: 19,981,271

4.Check for the presence of clones, siblings, or other close familial relationships among the sequenced individuals based on pairwise kinship coefficient estimated by vcftools --relatedness2 (Manichaikul et al.,)[https://academic.oup.com/bioinformatics/article/26/22/2867/228512].

vcftools --gzvcf Adigi.indel5bp_snp_hardfilter_passed_biallelic_mdust.vcf.gz --relatedness2 --out indel5bp_snp_hardfilter_passed_biallelic_mdust

There were no pairs of samples with relatedness values indicating close kinship. (First-degree relatives are ~0.25, and 2nd-degree ~0.125, and 3rd degree 0.0625.)

5.We filtered call sets by site depth, missing genotypes and genotype quality, Sites with more than 10% missingness and a mean depth less than 10X or greater than two standard deviations of the mean depth in samples.

vcftools --gzvcf Adigi.indel5bp_snp_hardfilter_passed_biallelic_mdust.vcf.gz \
    --max-missing 0.9 --minQ 30 --min-meanDP 10 --max-meanDP 33 \
    --minDP 3 --minGQ 20 \
    --remove-filtered-geno-all \
    --recode --recode-INFO-all \
    --stdout | bgzip > Adigi.v2.DPg90gdp3gq30.vcf.gz

SNPs left: 11,731,102

  1. We removed SNPs with Inbreeding Coefficient < -0.05. According to this article on the gatk website negative values of InbreedingCoeff can be used as a proxy for poor mapping.

SNPs left: 9,772,006

  1. Monomorphic sites (SNPs that are non-reference in all samples) were removed since they contain no information in the following analysis, this account for 115,452 sites in the dataset.
bcftools filter -e 'AC=0 || AC==AN' --threads 20 Adigi.v2.DPg90gdp3gq30.Fis0.05_pass.vcf.gz |bgzip > Adigi.v2.filtered.vcf.gz

The final filtered variant set contained 9,656,554 SNPs

We then checked how many genotypes were missing in each sample and how this relates to overall sample sequencing depth

Figure 1: The statistics of average coverage depth and percentage of missing genotypes in all samples.

Figure 2: SNP distribution for all samples.

Notes

The samples with low coverage depth contain more missing genotypes, this makes sense because we filtered genotypes with very low coverage to ensure accuracy. Only one sample, RS3_S_250 had an appreciable genotype missingness (almost 30%). We decided to keep it for the purposes of population structure analysis but to remove it from analyses based on phasing (haplotype based analysis).

After performing our population structure analysis (see later sections) we noticed one single sample BR_5_121_S125_L004 from inshore were clustered with south offshore samples. Later investigation (see 18.radseq_check) revealed that this was very likely due sample label switching at some point during the sequencing process. This sample was included in haplotype phasing since its genotype calls were accurate and irrespective of its label it was part of the populations under study. However, it was excluded from all population structure, demography and selection analyses since these relied on accurate assignment to sample locations.

Since population structure analyses, SMC analyses and SFS-based demographic modelling all have specific input requirements we performed additional filtering steps for each of those separately, starting from the SNP callset described above. See sections on those analyses for details of additional QC and filtering.