- Snakemake workflow: variant calling using the Genome Analysis Toolkit (GATK) best practices
- Table of contents
- Motivation
- Pipeline sections
- Step 1 - Compile List of Output Files
- Step 2 - Gather Genome Data
- Step 3 - Perform Fastq Quality Control
- Step 4 - Map Reads to Genome
- Step 5 - Generate Mapping Quality Statistics
- Step 6 - Perform Variant Calling
- Step 7 - Perform Variant Filtering (Hard or Soft)
- Step 8 - Annotatate Variants and Calculate Allele Frequencies
- Project dependencies:
- Where to start
- Directory structure
- Running the analysis
- Feedback and Issues
- This repository contains a pipeline built with Snakemake for variant calling using Illumina-generated sequences and is based on the GATK best practices for variant calling using either hard or soft filters.
- Additionally, this pipeline aims to reproduce a recently published pipeline that optimized the GATK4 variant calling pipeline for Plasmodium falciparum (preprint). However, this is not limited to P. falciparum and can be used for any organism of interest.
- The pipeline handles paired-end reads and below are the analysis sections in the Snakefile:
rule all
- gather all output files
gather_genome_data
: aggregate genome data from the snpeff foldergatk_genome_dict
: create genome dictionary for gatk toolssamtools_index
: index the genome fasta filebedops_gff2bed
: convert the genome annotation .gff to .bed file
trim_reads
: trim adapters and low quality bases using trimmomatic or fastp
bwa_index
: generate bwa genome-index files for mapping readsbwa_mem
: map reads to genome, fixmate, convert .sam to .bam and finally remove artifactsmark_duplicates
: mark duplicate reads using gatk MarkDuplicatesSpark or Samblaster
samtools_idxstats
: calculate alignment statistics based on the reference sequencesamtools_flagstats
: calculates and summarizes various alignment statisticssamtools_depth
: calculate the depth of coverage for each position in the genomegatk_insert_size_metrics
: collect insert size metricsgatk_alignment_summary_metrics
: generate a summary of alignment metrics from the BAM file
gatk_haplotypecaller
: call snps and indels via local re-assembly of haplotypes and generate gVCFsgenerate_sample_name_map
: generate a map of sample names and the respective vcf filesgatk_genomics_db_import
: merge gVCFs into one genomic databasegatk_genotype_gvcfs
: perform joint genotyping and generate the final VCF in which all samples have been jointly genotyped
bcftools_normalize
: normalize indels, left-align variants, split multiallelic sites into multiple rows and recover multiallelics from multiple rows- hard_filter_variants:
gatk_split_variants
: separate snps and indels into separate vcf filesgatk_filter_hard
: apply hard filters to snps and indelsgatk_merge_vcfs
: merge snps and indels into one vcf file
- soft_filter_variants:
gatk_vqsr_indels
: perform variant quality score recalibration to indelsgatk_apply_vqsr_indels
: apply variant quality score recalibration to indelsgatk_vqsr_snps
: perform variant quality score recalibration to snpsgatk_apply_vqsr_snps
: apply variant quality score recalibration to snps
gatk_filter_pass
: filter out variants that do not pass the hard or soft filters
snpeff_annotate_variants
: variant annotation and functional effect predictiongatk_variants_to_table
: extract variant information into a tablepython
- calculate allele frequencies and transform the summary table from wide to long format
-
Conda - an open-source package management system and environment management system that runs on various platforms, including Windows, MacOS, Linux
-
Snakemake - a workflow management system that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern specification language in python style.
-
Install conda for your operating System (the pipeline is currently tested on Linux and MacOS):
-
Clone this project using the following command in your terminal:
git clone https://github.com/kevin-wamae/snakemake-illuminaVarGATK.git
-
Type the following command in your terminal to navigate into the cloned directory using the command below. This will be the root directory of the project:
cd snakemake-illuminaVarGATK
-
Note: All subsequent commands should be run from the root directory of this project. However, users can modify the scripts to their liking
- Below is the default directory structure:
- config/ - contains the Snakemake-configuration files
- input/ - contains input files
- bed/ - contains the bed files for specifying the intervals of interest
- fastq/ - contains the FastQ files
- known_sites/ - contains the positive-training dataset for variant filtering
- output/ - contains numbered-output directories from the analysis
- workflow/ - contains the Snakemake workflow files
- envs/ - contains the Conda environment-configuration files
- scripts/ - contains the scripts used in the pipeline
.
|-- config
|-- input
| |-- bed
| |-- fastq
| -- known_sites
|-- output
|-- workflow
|-- envs
|-- scripts
-
This pipeline uses
global_wildcards()
to match FastQ sample names and mate files in theinput/fastq/
directory, using the naming convention below:reads_R1.fastq.gz
= first matereads_R2.fastq.gz
= second mate- If you have a different naming convention (eg. this), you can rename the FastQ files by executing the python script in the workflow/scripts/ directory:
python workflow/scripts/fastq_rename.py
- Therefore, the user can deposit their FastQ files in the
input/fastq/
directory or edit theconfig/config.yaml
file to point to the directory with FastQ files and the pipeline will automatically match the sample names and mates files
-
The configuration file (
config/config.yaml
) specifies additional resources and can be modified to suit one's needs, such as:- Input files
- Output directories,
- The option to choose between tools and methods, e.g.:
fastp
ortrimmomatic
for read trimminggatk MarkDuplicatesSpark
orsamblaster
for marking duplicates- hard or soft filtering of variants
- Other parameters, such as the number of threads to use
After navigating into the root directory of the project, run the analysis by executing the following commands in your terminal to:
-
Create a conda analysis environment by running the command below in your terminal. This will create a conda environment named
variant-calling-gatk
and install Snakemake and SnpEff and Graphviz (for visualizing the workflow) in the environment. Note: This only needs to be done once.conda env create --file workflow/envs/environment.yaml
-
Activate the conda environment by running the command below in your terminal. Note: This needs to be done every time you exit and restart your terminal and want re-run this pipeline
conda activate variant-calling-gatk
-
Execute the shell script below to create the SnpEff database for variant annotation. This will download the P. falciparum genome data from PlasmoDB and create a database in the output/ directory. Note: This is an important step because the genome-FASTA and GFF files are required for read-mapping and variant calling. It can also be modified to suit one's needs such as download genome files for your organism of interest:
bash workflow/scripts/create_snpeff_db.sh
-
Finally, execute the whole Snakemake pipeline by running the following command in your terminal:
snakemake --use-conda --cores 2 --jobs 1
- This will run the whole pipeline using a maximum of two cores and one job in parallel. The
--cores
flag specifies the number of cores to use for each job and the--jobs
flag specifies the number of jobs to run in parallel. - If you want to run the pipeline using more resources, you can increase the number of cores and jobs. For example, to run the pipeline using 4 cores and 2 jobs in parallel, run the following command:
snakemake --use-conda --cores 4 --jobs 2
- Additionally, you can change the
threads
entry inline 3
of the configuration file (config/config.yaml
) to specify the number of cores to use for each step in the pipeline.
-
Once the analysis is complete, look through output/ directory to view the results of the analysis
-
Summary statistics can be generated with stand alone scripts in the
workflow/scripts/
directory:- To do this, create an conda environment with the following command:
conda env create --file workflow/envs/variant-calling-stats.yaml
- activate the conda environment by running the following command:
conda activate variant-calling-stats
- To generate a summary of the raw reads, run the following command and look through the
stats_1_raw_fastq.tsv
file in the project directory:python workflow/scripts/get_raw_fastq_stats.py
- To generate a summary of the trimmed reads, run the following command and look through the
stats_2_trimmed_fastq.tsv
file:python workflow/scripts/get_trimmed_fastq_stats.py
- To generate a summary of the mapped reads, run the following command and look through the
stats_3_mapped_reads.tsv
file:python workflow/scripts/get_mapped_reads_stats.py
- To generate a summary of the variants called, run the following command and look through the
stats_4_variant_calling.tsv
file:bash workflow/scripts/get_variant_calling_stats.sh
- Exit this conda environment by running the following command:
conda deactivate variant-calling-stats
- To do this, create an conda environment with the following command:
-
Finally, you can deactivate the variant calling conda environment by running the following command:
conda deactivate variant-calling-gatk
Report any issues or bugs by openning an issue here or contact me via email at wamaekevin[at]gmail.com