Snakemake workflow: variant calling using the Genome Analysis Toolkit (GATK) best practices

Motivation

This repository contains a pipeline built with Snakemake for variant calling using Illumina-generated sequences and is based on the GATK best practices for variant calling using either hard or soft filters.
Additionally, this pipeline aims to reproduce a recently published pipeline that optimized the GATK4 variant calling pipeline for Plasmodium falciparum (preprint). However, this is not limited to P. falciparum and can be used for any organism of interest.

Pipeline sections

The pipeline handles paired-end reads and below are the analysis sections in the Snakefile:

Step 1 - Compile List of Output Files

rule all - gather all output files

Step 2 - Gather Genome Data

gather_genome_data: aggregate genome data from the snpeff folder
gatk_genome_dict: create genome dictionary for gatk tools
samtools_index: index the genome fasta file
bedops_gff2bed: convert the genome annotation .gff to .bed file

Step 3 - Perform Fastq Quality Control

trim_reads: trim adapters and low quality bases using trimmomatic or fastp

Step 4 - Map Reads to Genome

bwa_index: generate bwa genome-index files for mapping reads
bwa_mem: map reads to genome, fixmate, convert .sam to .bam and finally remove artifacts
mark_duplicates: mark duplicate reads using gatk MarkDuplicatesSpark or Samblaster

Step 5 - Generate Mapping Quality Statistics

samtools_idxstats: calculate alignment statistics based on the reference sequence
samtools_flagstats: calculates and summarizes various alignment statistics
samtools_depth: calculate the depth of coverage for each position in the genome
gatk_insert_size_metrics: collect insert size metrics
gatk_alignment_summary_metrics: generate a summary of alignment metrics from the BAM file

Step 6 - Perform Variant Calling

gatk_haplotypecaller: call snps and indels via local re-assembly of haplotypes and generate gVCFs
generate_sample_name_map: generate a map of sample names and the respective vcf files
gatk_genomics_db_import: merge gVCFs into one genomic database
gatk_genotype_gvcfs: perform joint genotyping and generate the final VCF in which all samples have been jointly genotyped

Step 7 - Perform Variant Filtering (Hard or Soft)

bcftools_normalize: normalize indels, left-align variants, split multiallelic sites into multiple rows and recover multiallelics from multiple rows
hard_filter_variants:
- gatk_split_variants: separate snps and indels into separate vcf files
- gatk_filter_hard: apply hard filters to snps and indels
- gatk_merge_vcfs: merge snps and indels into one vcf file
soft_filter_variants:
- gatk_vqsr_indels: perform variant quality score recalibration to indels
- gatk_apply_vqsr_indels: apply variant quality score recalibration to indels
- gatk_vqsr_snps: perform variant quality score recalibration to snps
- gatk_apply_vqsr_snps: apply variant quality score recalibration to snps
gatk_filter_pass: filter out variants that do not pass the hard or soft filters

Step 8 - Annotatate Variants and Calculate Allele Frequencies

snpeff_annotate_variants: variant annotation and functional effect prediction
gatk_variants_to_table: extract variant information into a table
python - calculate allele frequencies and transform the summary table from wide to long format

Project dependencies:

Conda - an open-source package management system and environment management system that runs on various platforms, including Windows, MacOS, Linux
Snakemake - a workflow management system that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern specification language in python style.

Where to start

Install conda for your operating System (the pipeline is currently tested on Linux and MacOS):
- Linux
- MacOS
Clone this project using the following command in your terminal:
- git clone https://github.com/kevin-wamae/snakemake-illuminaVarGATK.git
Type the following command in your terminal to navigate into the cloned directory using the command below. This will be the root directory of the project:
- cd snakemake-illuminaVarGATK
Note: All subsequent commands should be run from the root directory of this project. However, users can modify the scripts to their liking

Directory structure

Below is the default directory structure:
- config/ - contains the Snakemake-configuration files
- input/ - contains input files
  - bed/ - contains the bed files for specifying the intervals of interest
  - fastq/ - contains the FastQ files
  - known_sites/ - contains the positive-training dataset for variant filtering
- output/ - contains numbered-output directories from the analysis
- workflow/ - contains the Snakemake workflow files
  - envs/ - contains the Conda environment-configuration files
  - scripts/ - contains the scripts used in the pipeline

.
|-- config
|-- input
|   |-- bed
|   |-- fastq
|   -- known_sites
|-- output
|-- workflow
    |-- envs
    |-- scripts

This pipeline uses global_wildcards() to match FastQ sample names and mate files in the input/fastq/ directory, using the naming convention below:
- reads_R1.fastq.gz = first mate
- reads_R2.fastq.gz = second mate
- If you have a different naming convention (eg. this), you can rename the FastQ files by executing the python script in the workflow/scripts/ directory:
  - python workflow/scripts/fastq_rename.py
- Therefore, the user can deposit their FastQ files in the input/fastq/ directory or edit the config/config.yaml file to point to the directory with FastQ files and the pipeline will automatically match the sample names and mates files
The configuration file (config/config.yaml) specifies additional resources and can be modified to suit one's needs, such as:
- Input files
- Output directories,
- The option to choose between tools and methods, e.g.:
  - fastp or trimmomatic for read trimming
  - gatk MarkDuplicatesSpark or samblaster for marking duplicates
  - hard or soft filtering of variants
- Other parameters, such as the number of threads to use

Running the analysis

After navigating into the root directory of the project, run the analysis by executing the following commands in your terminal to:

Create a conda analysis environment by running the command below in your terminal. This will create a conda environment named variant-calling-gatk and install Snakemake and SnpEff and Graphviz (for visualizing the workflow) in the environment. Note: This only needs to be done once.
- conda env create --file workflow/envs/environment.yaml
Activate the conda environment by running the command below in your terminal. Note: This needs to be done every time you exit and restart your terminal and want re-run this pipeline
- conda activate variant-calling-gatk
Execute the shell script below to create the SnpEff database for variant annotation. This will download the P. falciparum genome data from PlasmoDB and create a database in the output/ directory. Note: This is an important step because the genome-FASTA and GFF files are required for read-mapping and variant calling. It can also be modified to suit one's needs such as download genome files for your organism of interest:
- bash workflow/scripts/create_snpeff_db.sh
Finally, execute the whole Snakemake pipeline by running the following command in your terminal:
- snakemake --use-conda --cores 2 --jobs 1
- This will run the whole pipeline using a maximum of two cores and one job in parallel. The --cores flag specifies the number of cores to use for each job and the --jobs flag specifies the number of jobs to run in parallel.
- If you want to run the pipeline using more resources, you can increase the number of cores and jobs. For example, to run the pipeline using 4 cores and 2 jobs in parallel, run the following command:
  - snakemake --use-conda --cores 4 --jobs 2
- Additionally, you can change the threads entry in line 3 of the configuration file (config/config.yaml) to specify the number of cores to use for each step in the pipeline.
Once the analysis is complete, look through output/ directory to view the results of the analysis
Summary statistics can be generated with stand alone scripts in the workflow/scripts/ directory:
- To do this, create an conda environment with the following command:
  - conda env create --file workflow/envs/variant-calling-stats.yaml
  - activate the conda environment by running the following command: conda activate variant-calling-stats
- To generate a summary of the raw reads, run the following command and look through the stats_1_raw_fastq.tsv file in the project directory:
  - python workflow/scripts/get_raw_fastq_stats.py
- To generate a summary of the trimmed reads, run the following command and look through the stats_2_trimmed_fastq.tsv file:
  - python workflow/scripts/get_trimmed_fastq_stats.py
- To generate a summary of the mapped reads, run the following command and look through the stats_3_mapped_reads.tsv file:
  - python workflow/scripts/get_mapped_reads_stats.py
- To generate a summary of the variants called, run the following command and look through the stats_4_variant_calling.tsv file:
  - bash workflow/scripts/get_variant_calling_stats.sh
- Exit this conda environment by running the following command:
  - conda deactivate variant-calling-stats
Finally, you can deactivate the variant calling conda environment by running the following command:
- conda deactivate variant-calling-gatk

Feedback and Issues

Report any issues or bugs by openning an issue here or contact me via email at wamaekevin[at]gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 188 Commits
.vscode		.vscode
config		config
input		input
output		output
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snakemake workflow: variant calling using the Genome Analysis Toolkit (GATK) best practices

Table of contents

Motivation

Pipeline sections

Step 1 - Compile List of Output Files

Step 2 - Gather Genome Data

Step 3 - Perform Fastq Quality Control

Step 4 - Map Reads to Genome

Step 5 - Generate Mapping Quality Statistics

Step 6 - Perform Variant Calling

Step 7 - Perform Variant Filtering (Hard or Soft)

Step 8 - Annotatate Variants and Calculate Allele Frequencies

Project dependencies:

Where to start

Directory structure

Running the analysis

Feedback and Issues

About

Releases 1

Packages

Languages

License

kevin-wamae/snakemake-illumina-gatkvariant

Folders and files

Latest commit

History

Repository files navigation

Snakemake workflow: variant calling using the Genome Analysis Toolkit (GATK) best practices

Table of contents

Motivation

Pipeline sections

Step 1 - Compile List of Output Files

Step 2 - Gather Genome Data

Step 3 - Perform Fastq Quality Control

Step 4 - Map Reads to Genome

Step 5 - Generate Mapping Quality Statistics

Step 6 - Perform Variant Calling

Step 7 - Perform Variant Filtering (Hard or Soft)

Step 8 - Annotatate Variants and Calculate Allele Frequencies

Project dependencies:

Where to start

Directory structure

Running the analysis

Feedback and Issues

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages