Skip to content

Latest commit

 

History

History
46 lines (28 loc) · 2.92 KB

userguide.adoc

File metadata and controls

46 lines (28 loc) · 2.92 KB

Envelope User Guide

Table of Contents

Fundamentals

Envelope runs a data pipeline as a graph of steps that are defined by the configuration. Each step can specify other steps as dependencies so that Envelope will wait for those steps to be submitted first. Steps that have no dependencies will run immediately. Steps that have common dependencies will run in parallel.

The key mapping of Envelope to Spark is that each Envelope data step is a DataFrame, which is created from either an external input or a derivation of one or more other data steps. Each data step’s DataFrame is registered by the step name as a Spark SQL temporary table so that subsequent queries can read directly from that step.

Each data step can optionally write to an external output. The way that a step is applied to the output, e.g. to append, or to upsert, or to maintain a Type 2 SCD, is specified with a planner. The way that a step groups keys together (which only occurs when the planner requires existing records) is specified with a partitioner.

Each data step that uses an input with a translator will add an additional data step that contains the input values that errored on translation, and with the same name as the original data step plus the suffix '_errored'.

Envelope can loop over a sub-graph of the pipeline steps by using a loop step. See the looping guide for more on the loop step type.

When at least one of the external input steps is a stream, e.g. Kafka, then the Envelope pipeline becomes a Spark Streaming job and the graph of steps will be executed every micro-batch.

Configuration

Envelope can read Java properties, JSON, and HOCON configuration files.

Each pipeline configuration file must define:

  • One or more data steps with no dependencies

  • One or more data steps with an output

  • One compatible planner for each output

And can additionally define:

  • One application section for pipeline-level configurations

  • One or more steps with one or more dependencies

  • One or more loop steps

  • Zero or one or more steps with a partitioner

For the full specification of Envelope configurations, see the configuration guide documentation page.

The final configuration is constructed with the following layers (in order of precedence):

Then the composite is resolved for substitutions and used throughout the pipeline.

Configurations and runtime information required for running Envelope in secure clusters can be found in the security guide.