GATK Metrics Collectors

GATK metrics collectors aggregate descriptive statistics and/or summaries of collections of GATK reads. There are two types of collectors: single-level collectors that aggregate data for all reads, and multi-level collectors that aggregate data for all reads as well as at finer levels of granularity such as:

per sample
per library
per read group

Each collector, whether single or multi-level, needs to be able to run from within four different contexts:

a standalone walker tool
a standalone Spark tool
the org.broadinstitute.hellbender.tools.picard.analysis.CollectMultipleMetrics walker tool
the org.broadinstitute.hellbender.tools.spark.pipelines.metrics.CollectMultipleMetricsSpark Spark tool

In order to share a single implementation of a given metric across all of these contexts, GATK contains a metrics framework which prescribes a set of component classes and a factorization for metrics implementations. This helps to separate the processing of individual reads from the processing of an RDD, and the acquisition of the RDD. These classes are described below. The GATK code base contains example collectors and code in the org.broadinstitute.hellbender.tools.examples.metrics.single and org.broadinstitute.hellbender.tools.examples.metrics.multi packages.

##Single-Level Collectors

A single-level collector minimally has the following component classes (where "X" in the class name represents the specific type of metrics being collected):

   XMetrics

container class defining the aggregate metrics being collected
extends htsjdk.samtools.metrics.MetricBase
processes and collects metrics from a single read
has a combine method for combining/reducing results collected in parallel when the collector is run from a Spark tool

XMetricsArgumentCollection

defines the set of parameters for XMetrics collection
extends org.broadinstitute.hellbender.metrics.MetricsArgumentCollection
is used as a GATK command line argument collection

XMetricsCollectorSpark

serves as an adapter/bridge between an RDD and the (read-based) XMetrics
implements org.broadinstitute.hellbender.tools.spark.pipelines.metrics.MetricsCollectorSpark
processes an entire JavaRDD of GATKReads
delegates processing of a single read to the XMetrics class

Given this set of implementation classes, one each of a Spark and non-Spark walker tool can be implemented by extending the respective base classes and delegating to the implementation classes (the implementation classes can also be used from within CollectMultipleMetrics and CollectMultipleMetricsSpark - see the source code for examples):

CollectXMetrics

extends org.broadinstitute.hellbender.tools.picard.analysis.SinglePassSamProgram
delegates directly to XMetrics

 CollectXMetricsSpark

extends org.broadinstitute.hellbender.tools.spark.pipelines.metrics.MetricsCollectorSparkTool
delegates to XMetricsCollectorSpark (which in turn delegates processing of each read to XMetrics)

Note that of all of these classes, only CollectXMetrics, CollectXMetricsSpark, and XMetricsCollectorSpark are required to implement specific interfaces.

##Multi-Level Collectors:

In a single-level collector, the XMetrics class serves as both the container for aggregate metrics, as well as the processing unit for individual reads. For multi-level collectors, the XMetrics class serves only as the metrics container (and must extend org.broadinstitute.hellbender.metrics.MultiLevelMetrics), and three additional component classes are required in order to take advantage of the multi-level distribution (sample/library/read group) provided by the metrics framework:

XMetricsCollector

extends the org.broadinstitute.hellbender.metrics.MultiLevelReducibleCollector class, which provides automatic distribution of reads across multiple units of collection (sample/library/read group)
processes and collects metrics from a single read be delegating to the distribution framework (MultiLevelCollectorReducible)
has (by convention) a combineUnit method for combining like unit levels that have been collected in parallel when the collector is run from a Spark tool
must provide a combine method for combining aggregate results that have been collected in parallel when the collector is run from a Spark tool. this method delegates to combineUnit  XMetricsPerUnitCollector:
extends org.broadinstitute.hellbender.metrics.PerUnitMetricCollector
collects metrics for a single unit (sample, library or read group)
are created and maintained by the metrics framework
contains a combiner for combining/reducing like units

XMetricsCollectorArgs:

represents data extracted from a single read for this metric type
used as a type parameter for org.broadinstitute.hellbender.metrics.MultiLevelReducibleCollector

The following schematic shows the general relationships of these collector component classes in the context of various tools, with the arrows indicating a "delegates to" relationship via composition or inheritance: 

Metrics Components

The general lifecycle of a Spark collector (XMetricsCollectorSpark in the diagram above) looks like this:  

CollectorType collector = new CollectorType() 
CollectorArgType args = // get metric-specific input arguments
// pass the input arguments to the collector for initialization 
collector.initialize(args);  
ReadFilter filter = collector.getReadFilter(samFileHeader);  - collector.collectMetrics( getReads().filter(filter), samFileHeader );
collector.saveMetrics(getReadSourceName(), getAuthHolder());

##Notes on Parallel Metrics Collection and Combine Methods

Some metrics may not be easily parallelizable. Collectors should only use this framework if the results of the metrics can be logically combined from aggregate data collected in a parallel from multiple Spark partitions with full fidelity.
The combine methods described above are only called by the framework when the collector is being run in parallel from a Spark context. Standalone tools process records serially and do not require combine functionality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GATK Metrics Collectors

Clone this wiki locally