Deprecated. Updated benchmarks available at
This documents describes the capabilities of Sentieon DNAseq pipeline matching different versions of GATK germline pipelines. If you have any additional questions, please visit or contact the technical support at Sentieon Inc. at
Fastq files of NA12878 were downloaded from FTP site:
Hg38 and other databases were downloaded from GATK resource bundle.
See here:
Arguments | File |
fasta | Homo_sapiens_assembly38.fasta |
known_Mills | Mills_and_1000G_gold_standard.indels.hg38.vcf.gz |
known_1000G | 1000G_phase1.snps.high_confidence.hg38.vcf.gz |
known_dbsnp | dbsnp_146.hg38.vcf.gz |
calling_intervals_list | wgs_calling_regions.hg38.interval_list |
BWA 0.7.15-r1140:
bwa mem -M -Y -K 10000000 \
-R '@RG\tID:NA12878\tSM:NA12878\tPL:ILLUMINA' \
$fasta $fastq1 $fastq2 | \
samtools sort -o sorted.bam
samtools index sorted.bam
sentieon bwa mem -M -Y -K 10000000 \
-R '@RG\tID:NA12878\tSM:NA12878\tPL:ILLUMINA' \
$fasta $fastq1 $fastq2 | \
sentieon util sort -i - \
-r $fasta -o sorted.bam --sam2bam
java -jar picard.jar MarkDuplicates \
I=sorted.bam \
O=deduplicated.bam \
M=duplication.metrics \
gatk MarkDuplicates \
-I sorted.bam \
-O deduplicated.bam \
-M duplication.metrics \
sentieon driver -r $fasta -i sorted.bam \
--algo LocusCollector --fun score_info score.txt.gz
sentieon driver -r $fasta -i sorted.bam \
--algo Dedup --rmdup --score_info score.txt.gz deduped.bam
GATK 3.7/3.8:
java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-I deduplicated.bam \
-R $fasta \
--knownSites $known_Mills \
--knownSites $known_1000G \
--knownSites $known_dbsnp \
-o bqsr.grp
java -jar GenomeAnalysisTK.jar \
-T PrintReads \
-R $fasta \
-I deduplicated.bam \
-BQSR bqsr.grp \
-o recalibrated.bam
gatk BaseRecalibrator \
-I deduplicated.bam \
-R $fasta \
--known-sites $known_Mills \
--known-sites $known_1000G \
--known-sites $known_dbsnp \
-O bqsr.grp
gatk ApplyBQSR \
-R $fasta \
-I deduplicated.bam \
--bqsr-recal-file bqsr.grp \
-O recalibrated.bam
sentieon driver -r $fasta \
-i deduped.bam \
--algo QualCal \
-k $known_dbsnp \
-k $known_1000G \
-k $known_Mills \
*Sentieon variant callers can perform the recalibration on the fly using a pre-recalibration bam plus the recalibration table. Recalibrated bam can be generated by the ReadWriter algo.
# This step is optional
sentieon driver -i deduped.bam -q recal_data.table --algo ReadWriter recaled.bam
Command line to compare GATK and Sentieon DNAseq results:
Output of GATK is used as the baseline. \
GATK.vcf.gz \
Sentieon.vcf.gz \
-o output_dir \
-r Homo_sapiens_assembly38.fasta \
--engine=vcfeval \
--engine-vcfeval-template hs38.sdf
Command line:
GATK 3.7/3.8:
java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R $fasta \
-L $calling_intervals_list \
-I recalibrated.bam \
-o output.g.vcf.gz
java -jar GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R $fasta \
-L $calling_intervals_list \
--variant output.g.vcf.gz \
--dbsnp $known_dbsnp \
-o output.vcf.gz
sentieon driver -r $fasta \
-i deduped.bam \
-q recal_data.table \
--interval $calling_intervals_list \
--algo Haplotyper \
--emit_mode gvcf \
sentieon driver -r $fasta \
--interval $calling_intervals_list \
--algo GVCFtyper \
-v output.g.vcf.gz \
--call_conf 10 \
--emit_conf 10 \
-d $known_dbsnp \
Type | TRUTH | QUERY | METRIC | |||||
TOTAL | TP | FN | TOTAL | FP | Recall | Precision | F1_Score | |
INDEL | 848723 | 848238 | 485 | 874360 | 538 | 0.999429 | 0.999385 | 0.999407 |
SNP | 4001821 | 4000797 | 1024 | 4005753 | 1033 | 0.999744 | 0.999742 | 0.999743 |
Command line:
GATK 4.0:
gatk HaplotypeCaller \
-R $fasta \
-L $calling_intervals_list \
-I recalibrated.bam \
-O output.g.vcf.gz
gatk GenotypeGVCFs \
-R $fasta \
-L $calling_intervals_list \
-V output.g.vcf.gz \
--dbsnp $known_dbsnp \
-O output.vcf.gz
sentieon driver -r $fasta \
-i deduped.bam \
-q recal_data.table \
--interval $calling_intervals_list \
--algo Haplotyper \
--emit_mode gvcf \
sentieon driver -r $fasta \
--interval $calling_intervals_list \
--algo GVCFtyper \
-v output.g.vcf.gz \
--call_conf 10 \
--emit_conf 10 \
-d $known_dbsnp \
Type | TRUTH | QUERY | METRIC | |||||
TOTAL | TP | FN | TOTAL | FP | Recall | Precision | F1_Score | |
INDEL | 849960 | 846375 | 3585 | 874364 | 2434 | 0.995782 | 0.997216 | 0.996499 |
SNP | 4003643 | 3998527 | 5116 | 4005750 | 3319 | 0.998722 | 0.999171 | 0.998947 |
Command line:
GATK 4.1:
gatk HaplotypeCaller \
-R $fasta \
-L $calling_intervals_list \
-I recalibrated.bam \
-O output.g.vcf.gz
gatk GenotypeGVCFs \
-R $fasta \
-L $calling_intervals_list \
-V output.g.vcf.gz \
--dbsnp $known_dbsnp \
-O output.vcf.gz
sentieon driver -r $fasta \
-i deduped.bam \
-q recal_data.table \
--interval $calling_intervals_list \
--algo Haplotyper \
--emit_mode gvcf \
sentieon driver -r $fasta \
--interval $calling_intervals_list \
--algo GVCFtyper \
-v output.g.vcf.gz \
-d $known_dbsnp \
--genotype_model multinomial \
*Sentieon uses the option --genotype_model multinomial to match the output of the default newQual model in GATK 4.1.
Type | TRUTH | QUERY | METRIC | |||||
TOTAL | TP | FN | TOTAL | FP | Recall | Precision | F1_Score | |
INDEL | 855716 | 850790 | 4926 | 894426 | 10869 | 0.994243 | 0.987848 | 0.991035 |
SNP | 3999272 | 3990379 | 8893 | 4006624 | 11826 | 0.997776 | 0.997048 | 0.997412 |
Computing environment:
- Google Compute Engine
- n1-standard-32 (32 vCPUs, 120 GB memory)
- Local SSD Scratch Disk 2x375G
- centos-7-v20190619
Stage | Sentieon | GATK3.8 | GATK4.0 | GATK4.1 |
Alignment | 2:42:44 | 5:38:35 | 5:49:39 | 5:45:39 |
Dedup | 0:06:16 | 4:04:25 | 2:11:43 | 2:06:32 |
BQSR | 0:10:10 | 4:17:09 | 1:39:57 | 1:40:06 |
HaplotypeCaller | 0:41:02 | 3:21:37 | 6:56:53 | 5:37:52 |
GenotypeGVCFs | 0:00:55 | 2:04:08 | 2:02:55 | 2:05:22 |
Total | 3:41:07 | 19:25:54 | 18:41:07 | 17:15:31 |
Sentieon SpeedUp | -- | 5.3X | 5.1X | 4.7X |
The benchmark was performed on two different instances. Both instances have Intel® Xeon® Platinum 8124M CPU @ 3.00GHz with dual stripped NVMe SSD.
Intance | vCPU | Memory |
c5d.9xlarge | 36 | 72GB |
c5d.18xlarge | 72 | 144GB |
On both instances, HG001 30x was processed and completed in less than 90 core-hours.
Machine | c5d.9xlarge | c5d.18xlarge | ||
Stage | time (hh:mm) | core*hours | time(hh:mm) | core*hours |
Alignment | 01:41 | 60.67 | 00:54 | 65.12 |
LocusCollector | 00:01 | 0.93 | 00:01 | 1.5 |
Dedup | 00:03 | 1.47 | 00:03 | 2.48 |
BQSR | 00:05 | 3.14 | 00:03 | 3.56 |
HC | 00:24 | 14.41 | 00:13 | 16.16 |
GVCFtyper | 00:01 | 0.3 | 00:01 | 0.34 |
Total | 02:24 | 80.92 | 01:24 | 89.16 |
For this evaluation, we used both HG001 and HG002 with depth of about 50x from the PrecisionFDA truth challenge. Reference b37 is used for this benchmark.
Sample | Reads 1 | Reads 2 |
HG001 (50x) | HG001-NA12878-50x_1.fastq.gz | HG001-NA12878-50x_2.fastq.gz |
HG002 (50x) | HG002-NA24385-50x_1.fastq.gz | HG002-NA24385-50x_2.fastq.gz |
The truthset of HG001 and HG002 can be found at Giab latest release page.
Name | File |
HG001 VCF | HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz |
HG001 BED | HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed |
HG002 VCF | HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz |
HG002 BED | HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed |
Sample | Type | TP | FN | FP | Recall | Precision | F1_Score |
HG001 | INDEL | 359926 | 3112 | 10133 | 0.9914 | 0.9726 | 0.9819 |
SNP | 2785549 | 1741 | 7236 | 0.9994 | 0.9974 | 0.9984 | |
HG002 | INDEL | 462614 | 806 | 1085 | 0.9983 | 0.9977 | 0.9980 |
SNP | 3046197 | 1640 | 5339 | 0.9995 | 0.9983 | 0.9989 |
Using Sentieon DNAscope with machine learning model, we are able to further improve the variant calling accuracy. Please see DNAscope Machine Learning Model for more details.