Envelope User Guide

Table of Contents

Fundamentals
Configuration

Fundamentals

Envelope runs a data pipeline as a graph of steps that are defined by the configuration. Each step can specify other steps as dependencies so that Envelope will wait for those steps to be submitted first. Steps that have no dependencies will run immediately. Steps that have common dependencies will run in parallel.

The key mapping of Envelope to Spark is that each Envelope data step is a DataFrame, which is created from either an external input or a derivation of one or more other data steps. Each data step’s DataFrame is registered by the step name as a Spark SQL temporary table so that subsequent queries can read directly from that step.

Each data step can optionally write to an external output. The way that a step is applied to the output, e.g. to append, or to upsert, or to maintain a Type 2 SCD, is specified with a planner. The way that a step groups keys together (which only occurs when the planner requires existing records) is specified with a partitioner.

Each data step that uses an input with a translator will add an additional data step that contains the input values that errored on translation, and with the same name as the original data step plus the suffix '_errored'.

Envelope can loop over a sub-graph of the pipeline steps by using a loop step. See the looping guide for more on the loop step type.

When at least one of the external input steps is a stream, e.g. Kafka, then the Envelope pipeline becomes a Spark Streaming job and the graph of steps will be executed every micro-batch.

Configuration

Envelope can read Java properties, JSON, and HOCON configuration files.

Each pipeline configuration file must define:

One or more data steps with no dependencies
One or more data steps with an output
One compatible planner for each output

And can additionally define:

One application section for pipeline-level configurations
One or more steps with one or more dependencies
One or more loop steps
Zero or one or more steps with a partitioner

For the full specification of Envelope configurations, see the configuration guide documentation page.

The final configuration is constructed with the following layers (in order of precedence):

System environment (see Optional System or ENV variables)
Primary Envelope configuration file
Command line parameters (either as HOCON or JSON; see ConfigFactory.parseString())

Then the composite is resolved for substitutions and used throughout the pipeline.

Configurations and runtime information required for running Envelope in secure clusters can be found in the security guide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

userguide.adoc

userguide.adoc

Envelope User Guide

Fundamentals

Configuration

Files

userguide.adoc

Latest commit

History

userguide.adoc

File metadata and controls

Envelope User Guide

Fundamentals

Configuration