Lessons learned from building predicate pushdown #1309

Jstein77 · 2024-06-26T16:17:36Z

Jstein77
Jun 26, 2024
Maintainer

Predicate Pushdown improvements, lessons learned, broader implications - a mostly technical retro

Improvements, in no particular order:

Metric time and time dimension filter pushdown support
Entity pushdown support
Consolidation of time range constraint handling into pushdown optimizer
Consolidation of filter specification onto list input model
Time filter expansion for cumulative windows etc.

Lessons learned:

Predicate pushdown is super hard, because it's all edge cases. I mean I thought I knew that, but let's never again think anything about this might be straightforward, because it won't be.
The way we construct Dataflow Plans is fairly robust from the perspective of ensuring that every metric computation is isolated from every other one. This, however, makes DAG optimizations exceedingly difficult to do holistically
Interactions between optimizers are very difficult to reason about
There is a fundamental linkage between how we do predicate pushdown, what operations we can support, and how we construct the Dataflow Plan
There is a real risk of punching the predicate pushdown optimizer through a boundary that it should not traverse, because there is no container node that consistently allows us to say "pushdown can't proceed beyond this point."

Broader implications:
It's my opinion that we should, prototype a two-stage DataflowPlanBuilder to see if it makes things easier to reason about while producing more readable (and probably more efficient) SQL.

My idea is the first pass would construct input sources as CTEs. What we'd do is collect all of the elements requested in the query and determine which joins we need to make. We'd build a set of CTE nodes for each distinct denormalized metric source. We can apply a union filter against it (i.e., we'd apply a where constraint to each CTE that was effectively a big OR between all the filters) and push down whatever we could past the join within the CTE.

Note the implicit empty filter - if there is no query filter and one metric requests booking__is_instant while the other does not have a filter, we cannot apply the booking__is_instant filter inside the CTE.

Once we have those nodes we can build the metric branches more or less as we do today. They will point at the CTEs instead of the raw measure sources.

So we can, at that point, eliminate or greatly simplify the source scan optimizer, and predicate pushdown also gets easier to reason about, because we can simply apply all of the filters for a metric branch to the CTE input as needed.

I think if we do this well we could even allow for things like aggregate awareness in a more natural way, because the CTE builder could provide appropriate measure aggregations off of partially-aggregated inputs defined in the semantic manifest (or similar).

This can be prototyped today off of extensions of what we have in metricflow_semantics. It probably wouldn't be production-viable without the entity graph we keep talking about, but at least we can experiment a little bit.

Jstein77 · 2024-06-26T16:23:02Z

Jstein77
Jun 26, 2024
Maintainer Author

Posting on behalf of @tlento!

0 replies

Smallhi · 2024-07-07T14:26:53Z

Smallhi
Jul 7, 2024

Following this thread, looks very promising! 👀

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lessons learned from building predicate pushdown #1309

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Lessons learned from building predicate pushdown #1309

Jstein77 Jun 26, 2024 Maintainer

Predicate Pushdown improvements, lessons learned, broader implications - a mostly technical retro

Replies: 2 comments

Jstein77 Jun 26, 2024 Maintainer Author

Smallhi Jul 7, 2024

Jstein77
Jun 26, 2024
Maintainer

Jstein77
Jun 26, 2024
Maintainer Author

Smallhi
Jul 7, 2024