Envelope runs a data pipeline as a graph of steps that are defined by the configuration. Each step can specify other steps as dependencies so that Envelope will wait for those steps to be submitted first. Steps that have no dependencies will run immediately. Steps that have common dependencies will run in parallel.
The key mapping of Envelope to Spark is that each Envelope data step is a DataFrame, which is created from either an external input or a derivation of one or more other data steps. Each data step’s DataFrame is registered by the step name as a Spark SQL temporary table so that subsequent queries can read directly from that step.
Each data step can optionally write to an external output. The way that a step is applied to the output, e.g. to append, or to upsert, or to maintain a Type 2 SCD, is specified with a planner. The way that a step groups keys together (which only occurs when the planner requires existing records) is specified with a partitioner.
Each data step that uses an input
with a translator
will add an additional data step that contains the input values that errored on translation, and with the same name as the original data step plus the suffix '_errored'.
Envelope can loop over a sub-graph of the pipeline steps by using a loop step. See the looping guide for more on the loop step type.
When at least one of the external input steps is a stream, e.g. Kafka, then the Envelope pipeline becomes a Spark Streaming job and the graph of steps will be executed every micro-batch.
Envelope can read Java properties, JSON, and HOCON configuration files.
Each pipeline configuration file must define:
-
One or more data steps with no dependencies
-
One or more data steps with an output
-
One compatible planner for each output
And can additionally define:
-
One application section for pipeline-level configurations
-
One or more steps with one or more dependencies
-
One or more loop steps
-
Zero or one or more steps with a partitioner
For the full specification of Envelope configurations, see the configuration guide documentation page.
The final configuration is constructed with the following layers (in order of precedence):
-
System environment (see Optional System or ENV variables)
-
Primary Envelope configuration file
-
Command line parameters (either as HOCON or JSON; see ConfigFactory.parseString())
Then the composite is resolved for substitutions and used throughout the pipeline.
Configurations and runtime information required for running Envelope in secure clusters can be found in the security guide.