Genome Analysis Toolkit (GATK)

The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic short variant calling, and to tackle copy number (CNV) and structural variation (SV). In addition to the variant callers themselves, the GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data, and bundles the popular Picard toolkit.

It is a collection of command-line tools for analyzing high-throughput seqiencing data with a primary focus on variant discovery.Tools can be used individually or chained together into complete workflows.

This page contains the GATK best practice pipelines.

Reference data pre-processing

Download Hg38 Reference Genome and unzip it

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz

BWA requires an idenxed reference genome. This step will produce 5 files (amb, ann, bwt, pac and sa) needed for the alignment of the reads to the genome. -a (algorithm) use bwtsw which works for most genomes unless they are under a certain size.

bwa index -a bwtsw hg38.fa or bwa index hg38.fa

Create sequence dictionary using the Picard tool CreateSequenceDictionary.jar. The tool will create and name the output as hg38.dict.

java -jar CreateSequenceDictionary.jar \
REFERENCE=hg38.fa

Index the reference genome using samtools

samtools faidx hg38.fa

Order of process

Downloading [https://gatk.broadinstitute.org/hc/en-us/articles/360036194592-Getting-started-with-GATK4]

Optimizations for code --java-options

Java heap size

 -Xms - set initial Java heap size
 -Xmx - set maximum Java heap size
 -Xss - set java thread stack size

-Xms4000m means the initial heap size will be 4Gb and this helps to not readjust later on.

References

[https://gatk.broadinstitute.org/hc/en-us/categories/360002302312]

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Data pre-processing for variant discovery		Data pre-processing for variant discovery
Map and clean up short read sequence data efficiently		Map and clean up short read sequence data efficiently
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genome Analysis Toolkit (GATK)

Reference data pre-processing

Optimizations for code --java-options

References

About

Releases

Packages

marianakhoul/GATK

Folders and files

Latest commit

History

Repository files navigation

Genome Analysis Toolkit (GATK)

Reference data pre-processing

Optimizations for code --java-options

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages