02.quality_control.Rmd

---
title: "Variant calling and quality control"
output: github_document
---

```{r, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
library(tidyverse)
library(cowplot)
library(stringi)
library(ggpubr)
```

Variant calling was performed according to [GATK4 best practices](https://gatk.broadinstitute.org/hc/en-us/sections/360007226651-Best-Practices-Workflows) for germline short variant discovery using a reproducible and scalable pipeline written with [snakemake](https://github.com/bakeronit/snakemake-gatk4-non-model).  This pipeline makes some minor adaptations to the standard workflow to workaround issues for non-model organisms such as the lack of a reliable reference SNP database for base quality recalibration. 

As per GATK4 best practices, `HaplotypeCaller` was used to call SNPs and indels for every sample. Then `genomicsDBImport` and `GenotypeGVCFs `were used to do joint-calling for all samples of every scaffold. We excluded InDels in the final callset.

### General filtering

Our initial variant call set was then further filtered as follows:

1.Remove sites located within 5bp of any InDels.

```bash
bcftools filter -g 5 --threads 20 -O z -o Adigi.indel5bp.vcf.gz Adigi.gatk_raw.vcf.gz
```
**SNPs left: 33,617,435**

2.A generic hard filtering recommended by [gatk](https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants) (`QD<10, QUAL<30, SOR>3, FS>60, MQ<40, MQRankSUM<-12.5, ReadPosRankSum<-8`). In addition, only biallelic SNPs were included from here.

**SNPs left: 20,040,500**

3.Sites located in simple repeat regions identified by `mdust v2006.10.17` were removed.

```bash
mdust reference.fa -c |cut -f1,3,4 > genome.mdust.bed

vcftools --gzvcf Adigi.indel5bp_snp_hardfilter_passed_biallelic.vcf.gz \
  --exclude-bed genome.mdust.bed \
  --recode --recode-INFO-all --stdout | \
  bgzip > Adigi.indel5bp_snp_hardfilter_passed_biallelic_mdust.vcf.gz
```

**SNPs left: 19,981,271**

4.Check for the presence of clones, siblings, or other close familial relationships among the sequenced individuals based on pairwise kinship coefficient estimated by vcftools `--relatedness2` (Manichaikul et al.,)[https://academic.oup.com/bioinformatics/article/26/22/2867/228512].

```bash
vcftools --gzvcf Adigi.indel5bp_snp_hardfilter_passed_biallelic_mdust.vcf.gz --relatedness2 --out indel5bp_snp_hardfilter_passed_biallelic_mdust
```

There were no pairs of samples with relatedness values indicating close kinship. (First-degree relatives are ~0.25, and 2nd-degree ~0.125, and 3rd degree 0.0625.)

5.We filtered call sets by site depth, missing genotypes and genotype quality, Sites with more than 10% missingness and a mean depth less than 10X or greater than two standard deviations of the mean depth in samples.

```bash
vcftools --gzvcf Adigi.indel5bp_snp_hardfilter_passed_biallelic_mdust.vcf.gz \
    --max-missing 0.9 --minQ 30 --min-meanDP 10 --max-meanDP 33 \
    --minDP 3 --minGQ 20 \
    --remove-filtered-geno-all \
    --recode --recode-INFO-all \
    --stdout | bgzip > Adigi.v2.DPg90gdp3gq30.vcf.gz
```

**SNPs left: 11,731,102**

6. We removed SNPs with Inbreeding Coefficient < -0.05. According to [this article on the gatk website](https://gatk.broadinstitute.org/hc/en-us/articles/360035531992-Inbreeding-Coefficient) negative values of `InbreedingCoeff` can be used as a proxy for poor mapping. 

**SNPs left: 9,772,006**

7. Monomorphic sites (SNPs that are non-reference in all samples) were removed since they contain no information in the following analysis, this account for 115,452 sites in the dataset.

```bash
bcftools filter -e 'AC=0 || AC==AN' --threads 20 Adigi.v2.DPg90gdp3gq30.Fis0.05_pass.vcf.gz |bgzip > Adigi.v2.filtered.vcf.gz
```

**The final filtered variant set contained 9,656,554 SNPs**

We then checked how many genotypes were missing in each sample and how this relates to overall sample sequencing depth

```{r}
idepth <- read_tsv("data/hpc/vcf_filtering/Adigi.v2.filtered.idepth")
imiss <- read_tsv("data/hpc/vcf_filtering/Adigi.v2.filtered.imiss")

df <- left_join(idepth,imiss) %>%  
  tidyr::extract(INDV,into="site",regex="([^_]*)",remove = FALSE) %>% 
  tidyr::extract(INDV,into="sample_label",regex="([^_]*_[^_]*_[^_]*)_",remove = FALSE) %>% 
  mutate(loc = case_when(
    site=="AI" ~ "IN",
    site=="BR" ~ "IN",
    site=="AR" ~ "NO",    
    site=="RS1" ~ "SO",
    site=="RS2" ~ "SO",
    site=="RS3" ~ "SO",
  )) %>% 
  arrange(F_MISS) %>% 
  dplyr::mutate(rn = row_number())
```


```{r,fig.align='center',fig.retina=2,fig.height=10,fig.width=12}
source("scripts/color_scheme.R")

p1 <- ggplot(df) + geom_col(aes(x=reorder(sample_label,rn),y=F_MISS,fill=loc)) +
  labs(x="",y="Proportion of genotypes missing") + 
  scale_fill_manual(values=myCol) + 
  theme_pubr() + theme(legend.position = "none",axis.text.x = element_text(angle=90,size=8))

p2 <- ggplot(df) + geom_col(aes(x=reorder(sample_label,rn),y=MEAN_DEPTH,fill=loc)) +
  labs(x="",y="Mean Sequencing Depth") + 
  scale_fill_manual(values=myCol) + 
  theme_pubr() + 
  theme(axis.text.x = element_blank(), legend.position = "none")

plot_grid(p2,p1,nrow = 2,labels = c("A","B"))
```

**Figure 1:** The statistics of average coverage depth and percentage of missing genotypes in all samples.

```{r}
ggsave("figures/qcplot_suppfig1.png",width = 12,height = 8)
```


```{r,fig.align='center',fig.retina=2}
snpden <- read_tsv("data/hpc/vcf_filtering/Adigi.v2.filtered.snpden") %>% filter(SNP_COUNT>10) 
ggplot(snpden) + geom_histogram(aes(x=SNP_COUNT,y=..density..),binwidth = 20,color="black",fill="darkblue") +
  labs(x="SNP count per 10kb",y="Density") + theme_classic()
```

**Figure 2:** SNP distribution for all samples.

### Notes
The samples with low coverage depth contain more missing genotypes, this makes sense because we filtered genotypes with very low coverage to ensure accuracy. Only one sample, `RS3_S_250` had an appreciable genotype missingness (almost 30%). We decided to keep it for the purposes of population structure analysis but to remove it from analyses based on phasing (haplotype based analysis). 

After performing our population structure analysis (see later sections) we noticed one single sample `BR_5_121_S125_L004` from inshore were clustered with south offshore samples. Later investigation (see [18.radseq_check](18.radseq_check.md)) revealed that this was very likely due sample label switching at some point during the sequencing process. This sample was included in [haplotype phasing](03.phasing.md) since its genotype calls were accurate and irrespective of its label it was part of the populations under study.  However, it was excluded from all population structure, demography and selection analyses since these relied on accurate assignment to sample locations.

Since population structure analyses, SMC analyses and SFS-based demographic modelling all have specific input requirements we performed additional filtering steps for each of those separately, starting from the SNP callset described above.  See sections on those analyses for details of additional QC and filtering.