-
Notifications
You must be signed in to change notification settings - Fork 3
Satisfaction Key Concepts
Here are some of the key concepts of Satisfaction
A Goal represents a dataset that we wish to create, along with how to generate that dataset. Satisfaction is about the data, and making sure that we can generate that data. The Satisfaction Engine will do everything it can to "satisfy" the Goal.
Satisfaction uses Evidence to find out if a Goal has actually been satisfied. Most Goals also have a specific dataset which is produced, called a DataOutput which is a specific form of Evidence. If the Evidence says the Goal has already been satisfied, or the DataOutput says the dataset has already been created, Satisfaction will realize this, and not attempt to re-satisfy the Goal.
Often times the datasets that Satisfaction creates are parameterized by certain Variables. For example, many datasets that we want to produce are generated on a daily or hourly basis, so we would often have a different set of data for these different periods. These time-based Variable's are known as TemporalVariables, but they aren't the only type of variables.
A specific set of values for a set of Variables is known as a Witness. A specific dataset of a DataOutput for a given Witness is a DataInstance. For example, a possible Witness for an hourly job might be (( Variable("dt") -> "20150831"), ( Variable("hour") -> "15" ).
If you are familiar with Hive, think of a DataOutput as being analogous to a partitioned Hive table, and a DataInstance to be a specific Hive Partition. The Variables are analogous to the table's partition columns, and the Witness would be the values of those columns for the specific partition. Similarily, you could consider a DataOutput to be a top-level folder on some filesystem (like HDFS), and the DataInstance would be some sub-folder containing actual data, whose path would correspond to a specific Witness. Essentially, a Witness is a specific Variable substitution, or a key-value Map, but typed as a separate type to be more explicit.
Each Goal has a set of Variables associated with it, which (usually) is the set of Variables associated with the Goal's DataOutput. An Evidence object will tell you if it has been satisfied for a specific Witness. Given a specific Witness, a DataOutput will be able to produce a specific DataInstance.
To actually produce the dataset, some actual process or job needs to be run. Usually this is some sort of Hadoop job, like executing a Hive query, but can be any sort of batch ETL processing. The object which actually runs the process extends the Satisfier trait, which has a method satisfy which takes a Witness, and produces an ExecutionResult, which describes if the process succeeded or failed, and various properties about the process. Satisfaction comes with Satisfiers for Hive and Hadoop MR, and others are being implemented for other technologies, like Spark and Spark SQL.
To be more flexible, a Goal needs a SatisfierFactory rather than just a Satisfier. This is to allow a Goal to be satisfied for multiple Witnesses at the same time, without sharing any state between them. If one's Satisfier is truly stateless, then a singleton Satisfier can be used, using the Goal.SatisfierFactory constructor.
Goals can have dependencies on other Goals. That is to say, one can not start satisfying a Goal, or create the DataOutput of a Goal for a particular Witness, until the DataOutput from some other Goal has been produced. This is done through the Goal.addDependency() method. (FYI, Goals themselves are immutable; Goal.addDependency() returns a new Goal with the dependency, rather than changing the state of some underlying structure. This is pretty standard for immutable Scala objects.)
For any particular Witness, Goal.addDependency adds a dependency on the DataInstance of the dependent goal, for the same Witness. There are times when you want to be dependent on the DataInstance for a different Witness. For example, if you have a Goal which is partitioned by dt, which represents "processing date", it may be dependent upon events which are partitioned by the "transaction date" or when the events actually occured. The Goal would be dependent upon the DataInstance for the previous day.
Data Engineers and Data Scientists should organize their jobs and workflows into cohesive units called "Tracks". A Track is a unit of deployment for a project, and should be associated with one or more "Top-Level" DataOutput's ( usually one), which is the major result of all those jobs. A Track represents something which can be deployed to Satisfaction.
To create a Track, one would generally create a Git repo for a Scala project, which will contain the DSL for the Track. The developer would usually copy an existing project and modify it. (Eventually we will develop tools to generate skeleton projects from a template, something like Typesafe Activator or giter8 ). It would be a standard Scala project, which uses sbt to build, and imports the SBT Satisfaction plugin. Some more guidance into writing a Track is available on the page [How to create a Track](How to create a Track). The developer can add additional resources needed by the track under src/main/resource ( for example, complicated SQL queries in external files). The developer can also develop application logic in Java or Scala, such as Hive UDFs, or Spark functions, or even logic needed to describe a complicated workflow. This can be added alongside the Track code under src/main/scala or src/main/java . The developer may also want to write unit tests for this logic, and place it under src/test .
When the developer is ready to run the Track, she would run upload from the sbt console. This would upload the track from the development box, to a well known location on HDFS ( or potentially some
One thing to note is that Satisfaction has a notion of versioning baked into it from the beginning. In the build.sbt file, the developer specifies a version for the track, in the format of (major.minor.patch), which is the standard for package releases. The developer can further specify the version, by specifying a "variant", (ie perhaps using a specific version of a third-party package, like Hive or spark, for testing), or can also specify a "user" ( here is a track with changes made by "joebob").