Skip to content

GATK Metrics Collectors

Chris Norman edited this page Nov 9, 2016 · 10 revisions

GATK metrics collectors aggregate descriptive statistics and/or summaries of collections of GATK reads. There are two types of collectors: single-level collectors that aggregate data for all reads, and multi-level collectors that aggregate data for all reads as well as at finer levels of granularity such as:

  • per sample
  • per library
  • per read group

Each collector, whether single or multi-level, needs to be able to run from within four different contexts:

  • a standalone walker tool
  • a standalone Spark tool
  • the org.broadinstitute.hellbender.tools.picard.analysis.CollectMultipleMetrics walker tool
  • the org.broadinstitute.hellbender.tools.spark.pipelines.metrics.CollectMultipleMetricsSpark Spark tool


In order to share a single implementation of a given metric across all of these contexts, GATK contains a metrics framework which prescribes a set of component classes and a factorization for metrics implementations. This helps to separate the processing of individual reads from the processing of an RDD, and the acquisition of the RDD. These classes are described below. The GATK code base contains example collectors and code in the org.broadinstitute.hellbender.tools.examples.metrics.single and org.broadinstitute.hellbender.tools.examples.metrics.multi packages.

##Single-Level Collectors

A single-level collector minimally has the following component classes (where "X" in the class name represents the specific type of metrics being collected):



 XMetrics

  • container class defining the aggregate metrics being collected
  • extends htsjdk.samtools.metrics.MetricBase
  • processes and collects metrics from a single read
  • has a combine method for combining/reducing results collected in parallel when the collector is run from a Spark tool

XMetricsArgumentCollection

  • defines the set of parameters for XMetrics collection
  • extends org.broadinstitute.hellbender.metrics.MetricsArgumentCollection
  • is used as a GATK command line argument collection

XMetricsCollectorSpark

  • serves as an adapter/bridge between an RDD and the (read-based) XMetrics
  • implements org.broadinstitute.hellbender.tools.spark.pipelines.metrics.MetricsCollectorSpark
  • processes an entire JavaRDD of GATKReads
  • delegates processing of a single read to the XMetrics class

Given this set of implementation classes, one each of a Spark and non-Spark walker tool can be implemented by extending the respective base classes and delegating to the implementation classes (the implementation classes can also be used from within CollectMultipleMetrics and CollectMultipleMetricsSpark - see the source code for examples):

CollectXMetrics

  • extends org.broadinstitute.hellbender.tools.picard.analysis.SinglePassSamProgram
  • delegates directly to XMetrics

CollectXMetricsSpark

  • extends org.broadinstitute.hellbender.tools.spark.pipelines.metrics.MetricsCollectorSparkTool
  • delegates to XMetricsCollectorSpark (which in turn delegates processing of each read to XMetrics)


Note that of all of these classes, only CollectXMetrics, CollectXMetricsSpark, and XMetricsCollectorSpark are required to implement specific interfaces.

##Multi-Level Collectors:

In a single-level collector, the XMetrics class serves as both the container for aggregate metrics, as well as the processing unit for individual reads. For multi-level collectors, the XMetrics class serves only as the metrics container (and must extend org.broadinstitute.hellbender.metrics.MultiLevelMetrics), and three additional component classes are required in order to take advantage of the multi-level distribution (sample/library/read group) provided by the metrics framework:

XMetricsCollector

  • extends the org.broadinstitute.hellbender.metrics.MultiLevelReducibleCollector class, which provides automatic distribution of reads across multiple units of collection (sample/library/read group)
  • processes and collects metrics from a single read be delegating to the distribution framework (MultiLevelCollectorReducible)
  • has (by convention) a combineUnit method for combining like unit levels that have been collected in parallel when the collector is run from a Spark tool
  • must provide a combine method for combining aggregate results that have been collected in parallel when the collector is run from a Spark tool. this method delegates to combineUnit
 XMetricsPerUnitCollector:
  • extends org.broadinstitute.hellbender.metrics.PerUnitMetricCollector
  • collects metrics for a single unit (sample, library or read group)
  • are created and maintained by the metrics framework
  • contains a combiner for combining/reducing like units

XMetricsCollectorArgs:

  • represents data extracted from a single read for this metric type
  • used as a type parameter for org.broadinstitute.hellbender.metrics.MultiLevelReducibleCollector

The following schematic shows the general relationships of these collector component classes in the context of various tools, with the arrows indicating a "delegates to" relationship via composition or inheritance:


Metrics Components

The general lifecycle of a Spark collector (XMetricsCollectorSpark in the diagram above) looks like this:



  • CollectorType collector = new CollectorType()

  • CollectorArgType args = // get metric-specific input arguments
  • // pass the input arguments to the collector for initialization

  • collector.initialize(args);


  • ReadFilter filter = collector.getReadFilter(samFileHeader); 
- collector.collectMetrics(
getReads().filter(filter),
samFileHeader
);
  • collector.saveMetrics(getReadSourceName(), getAuthHolder());


##Notes on Parallel Metrics Collection and Combine Methods

  • Some metrics may not be easily parallelizable. Collectors should only use this framework if the results of the metrics can be logically combined from aggregate data collected in a parallel from multiple Spark partitions with full fidelity.
  • The combine methods described above are only called by the framework when the collector is being run in parallel from a Spark context. Standalone tools process records serially and do not require combine functionality.