(This is work-in-progress code that @JoshRosen wrote in grad school).
This is an experimental compiler for a new language based on Bloom.
The command line interface to the Bloom compiler is implemented by the
Compiler
object.
During development, the easiest way to run the compiler is through sbt run
;
use
sbt "run --infile <sourcefile> --target [dataflow|rxflow]"
to compile sourcefile
and generate either a GraphViz .DOT file representing
a dataflow graph (for the dataflow
target) or Javascript code (for the
rxflow
target).
To build a binary distribution of the compiler, use the
sbt-native-packager sbt targets;
for example, running sbt stage
will stage a binary distribution in the
compiler/target/universal/stage
directory, where you can use
./compiler/target/universal/stage/bin/bloom-compiler
to run the compiler.
We don't have a language reference yet, but in the meantime there are some example programs that give a flavor of the language.
To compile a Bloom program, the compiler executes the following phases:
- Parse source
- Resolve names
- Assign types
- Analyze dependencies
- Stratify collections and rules
- Generate intermediate dataflow graph representation
- Calculate invalidation and rescan sets
- Emit source code to instantiate dataflow on the target runtime
A goal of this implementation is to maintain a clean separation of concerns between the different compilation and analysis components. For example, the stratification logic is implemented in two files, one that contains all of the dependency analysis code, and another that declaratively describes the stratification in terms of dependencies.
Several components of the compiler are implemented using Kiama. We use its positional parser utilities, lazy attribute grammars, and tree-rewriting systems to address "the AST typing problem".
The runtime-specific dataflow code generators try to isolate the Bloom-specific code to a few operators that sit at the edges of the dataflow, plus some orchestration code that sits outside the dataflow. By not imposing Bloom-specific requirements on the dataflow operators, this allows us to re-use existing dataflow systems by implementing at most one or two new operators.
The RxFlow backend translates a Bloom program to a Javascript object that encapsulate a dataflow and exposes RxJS Observer and Observable interfaces for integration with external data sources.
This description is out of date, now that I'm moving to punctuation-based control flow:
To execute a single logical timestep of a Bloom program, an instance of the RxFlow Javascript runtime performs the following actions:
- For all tables, check whether they have pending deletions. Perform those deletions and record whether any records were actually deleted from the tables. This second step is necessary because deletions may try to remove nonexistent tuples.
- Determine the set of invalidated elements: for any table that performed deletions, look up the set of invalidated elements by indexing into a static structure generated by the compiler's invalidation analysis module.
- For each invalidated dataflow element, perform the appropriate calls to clear its cached state.
- In order, execute each stratum's dataflow until fixed point:
- To initiate the computation of this stratum:
- Rescan the invalidated tables, then perform any pending insertions that were deferred in the previous timestep.
- To detect termination:
- Acyclic dataflows sources can emit "end-of-round" messages once they have produced all of their outputs; these messages can propagate downstream until the dataflow sinks have finished the round.
- For cyclic dataflows, we can rely on the fact that every cycle
includes either a scratch or table; we can buffer insertions into
these tables and call
flush()
on each table to initiate another cycle. If a round passes where no tuples were flushed, then we declare termination. This effectively breaks the cycles and replaces them by iterative execution of an acyclic dataflow; it's possible to implement a completely asynchronous, cyclic dataflow, but that requires more complex termination detection techniques.
- To initiate the computation of this stratum:
In this architecture, the Table element is somewhat Bloom-specific, but the
other dataflow elements are completely generic; they only need to support
a clear/reset
method to clear cached state and possibly an end-of-round
mechanism to cause buffering operators to emit all of their outputs.