-
Notifications
You must be signed in to change notification settings - Fork 596
GATK Metrics Collectors
GATK metrics collectors aggregate descriptive statistics and/or summaries of collections of GATK reads. There are two types of collectors: single-level collectors that aggregate data for all reads, and multi-level collectors that aggregate data for all reads as well as at finer levels of granularity such as:
- per sample
- per library
- per read group
Each collector, whether single or multi-level, needs to be able to run from within four different contexts:
- a standalone walker tool
- a standalone Spark tool
- the
org.broadinstitute.hellbender.tools.picard.analysis.CollectMultipleMetrics
walker tool - the
org.broadinstitute.hellbender.tools.spark.pipelines.metrics.CollectMultipleMetricsSpark
Spark tool
In order to share a single implementation of a given metric across all of these contexts, GATK contains a metrics framework which prescribes a set of component classes and a factorization for metrics implementations. This helps to separate the processing of individual reads from the processing of an RDD, and the acquisition of the RDD. These classes are described below. The GATK code base contains example collectors and code in the org.broadinstitute.hellbender.tools.examples.metrics.single
and org.broadinstitute.hellbender.tools.examples.metrics.multi
packages.
##Single-Level Collectors
A single-level collector minimally has the following component classes (where "X" in the class name represents the specific type of metrics being collected):
XMetrics
- container class defining the aggregate metrics being collected
- extends
htsjdk.samtools.metrics.MetricBase
- processes and collects metrics from a single read
- has a combine method for combining/reducing results collected in parallel when the collector is run from a Spark tool
XMetricsArgumentCollection
- defines the set of parameters for XMetrics collection
- extends
org.broadinstitute.hellbender.metrics.MetricsArgumentCollection
- is used as a GATK command line argument collection
XMetricsCollectorSpark
- serves as an adapter/bridge between an RDD and the (read-based) XMetrics
- implements
org.broadinstitute.hellbender.tools.spark.pipelines.metrics.MetricsCollectorSpark
- processes an entire JavaRDD of GATKReads
- delegates processing of a single read to the XMetrics class
Given this set of implementation classes, one each of a Spark and non-Spark walker tool can be implemented by extending the respective base classes and delegating to the implementation classes (the implementation classes can also be used from within CollectMultipleMetrics and CollectMultipleMetricsSpark - see the source code for examples):
CollectXMetrics
- extends
org.broadinstitute.hellbender.tools.picard.analysis.SinglePassSamProgram
- delegates directly to XMetrics
CollectXMetricsSpark
- extends
org.broadinstitute.hellbender.tools.spark.pipelines.metrics.MetricsCollectorSparkTool
- delegates to XMetricsCollectorSpark (which in turn delegates processing of each read to XMetrics)
Note that of all of these classes, only CollectXMetrics, CollectXMetricsSpark, and XMetricsCollectorSpark are required to implement specific interfaces.
##Multi-Level Collectors:
In a single-level collector, the XMetrics class serves as both the container for aggregate metrics, as well as the processing unit for individual reads. For multi-level collectors, the XMetrics class serves only as the metrics container (and must extend org.broadinstitute.hellbender.metrics.MultiLevelMetrics), and three additional component classes are required in order to take advantage of the multi-level distribution (sample/library/read group) provided by the metrics framework:
XMetricsCollector
- extends the
org.broadinstitute.hellbender.metrics.MultiLevelReducibleCollector
class, which provides automatic distribution of reads across multiple units of collection (sample/library/read group) - processes and collects metrics from a single read be delegating to the distribution framework (MultiLevelCollectorReducible)
- has (by convention) a combineUnit method for combining like unit levels that have been collected in parallel when the collector is run from a Spark tool
- must provide a combine method for combining aggregate results that have been collected in parallel when the collector is run from a Spark tool. this method delegates to combineUnit XMetricsPerUnitCollector:
- extends
org.broadinstitute.hellbender.metrics.PerUnitMetricCollector
- collects metrics for a single unit (sample, library or read group)
- are created and maintained by the metrics framework
- contains a combiner for combining/reducing like units
XMetricsCollectorArgs:
- represents data extracted from a single read for this metric type
- used as a type parameter for
org.broadinstitute.hellbender.metrics.MultiLevelReducibleCollector
The following schematic shows the general relationships of these collector component classes in the context of various tools, with the arrows indicating a "delegates to" relationship via composition or inheritance:
The general lifecycle of a Spark collector (XMetricsCollectorSpark in the diagram above) looks like this:
- CollectorType collector = new CollectorType()
- CollectorArgType args = // get metric-specific input arguments
- // pass the input arguments to the collector for initialization
- collector.initialize(args);
- ReadFilter filter = collector.getReadFilter(samFileHeader); - collector.collectMetrics( getReads().filter(filter), samFileHeader );
- collector.saveMetrics(getReadSourceName(), getAuthHolder());
##Notes on Parallel Metrics Collection and Combine Methods
- Some metrics may not be easily parallelizable. Collectors should only use this framework if the results of the metrics can be logically combined from aggregate data collected in a parallel from multiple Spark partitions with full fidelity.
- The combine methods described above are only called by the framework when the collector is being run in parallel from a Spark context. Standalone tools process records serially and do not require combine functionality.