"Pass-through" or "re-defined" metrics #1465

siljamardla · 2024-10-18T12:24:37Z

siljamardla
Oct 18, 2024

Disclaimer: the phrasing and descriptions might be a bit messy, but I hope you'll see the main point :)

Feature

Update metric configurations to allow "re-defining" the same metric on top of multiple tables.
Specifically, both definitions mean exactly the same thing and are expected to result in exactly the same number.

When writing MetricFlow queries it should be possible to manually specify which version of the metric (i.e. which table) to query, but there should also be default behaviour to prefer the "better" table. Better depends on the context, it might be:

the least granular table that has the appropriate dimensions available, to optimise for performance
the "original" table to optimise for accuracy

Use cases

I have two very frequent use cases in mind: using pre-aggregations and building self-service datasets

Pre-aggregations

We define metric1 on top of fact_model1 and metric2 on top of fact_model2
We use saved queries and exports to generate a pre-aggregated table with metric1 and metric2 aggregated to date grain (i.e. a table with columns date, metric1, metric2)
We have many people often querying metric1 and metric2 on different time grains
or
We use saved queries and exports to generate a pre-aggregated table with metric1 and metric2 aggregated to city grain
We have many people often querying metric1 and metric2 on country grain and global grain

Instead of trying to cache each different grain or letting people scan the underlying big fact tables, we could explicitly specify that aggregating the column in the export will result in exactly the same metric definition and value with smaller compute effort.

Building self-service datasets

I have an order table with metric1
I have an order event table (many events per order) with metric2 and metric3
I have an order pricing table (many rows per order) with metric4
I want to keep 100% of metric definitions in dbt metrics, nothing in SQL
Data users are asking for a self-service dataset on order grain

I would produce the self-service dataset with a saved query like this:

saved_queries:
  - name: mart_order
    description: order_id level data mart for self-service usage
    query_params:
      metrics:
        - metric1
        - metric2
        - metric3
        - metric4
      group_by:
        - Entity('order_id')
        - Dimension('order_id__order_attribute1')
        - Dimension('order_id__order_attribute2)

The output of this would be an order_id level table, that has columns with metric1, metric2, metric3, metric4.
By definition the order level data mart has fewer rows than the upstream order event and order pricing tables.

I want my data usage to be based on metrics. So I want people to look at some metrics glossary, find a metric and query it.
As of now, they will always be directed to the underlying fact tables. I would like to send them to this pre-calculated order level table, because it already contains many useful metrics for them, in one table.

Come to think of this, it's like a special case of the pre-aggregations, except the aggregation result is still rather detailed.

Specifications

Far from being fully figured out. If we had something like this:

semantic_models:
  - name: fact_order_event
    model: ref('fact_order_event')
    defaults:
      agg_time_dimension: order_event_created_date
    entities:
      - name: order_event
        expr: order_event_id
        type: primary
      - name: order
        expr: order_id
        type: foreign
      - name: customer
        expr: customer_id
        type: foreign
    measures:
      - name: count_order_events
        expr: 1
        agg: sum
        create_metric: true
    dimensions:
      - name: order_event_created_date
        type: time
        type_params:
          time_granularity: day
      - name: event_type
        type: categorical

metrics:
  # a regular metric defined on top of a fact table
  - name: count_order_rescheduled_events
    label: count_order_rescheduled_events
    type: simple
    type_params:
      measure: count_order_events
    filter: |
      {{ Dimension('order_event__event_type') }} = 'order_rescheduled'

And we would write a saved query for an export:

saved_queries:
  - name: mart_customer_metrics
    description: Saved query to export mart_customer_metrics
    query_params:
      metrics:
        - count_order_rescheduled_events
      group_by:
        - Entity('customer_id')
        - Entity('city_id')

And then define a "pass-through" metric on top of the data mart:

metrics:
  # a pass-through metric defined on top of an export
  - name: count_order_rescheduled_events
    label: count_order_rescheduled_events
    type: pass_through
    type_params:
      metrics: count_order_rescheduled_events
      agg: sum # could define here or pick up from the measure definition in the original metric
      table: mart_customer_metrics # could also call model, export etc

Come to think of it... the imaginary config here only contains one useful piece of information: the export name.
And the same export might have a lot of metrics inside. It would be really tedious to define all these metrics.

So turning this around, it could be as simple as specifying something extra in the saved query / export phase. Something like this:

saved_queries:
  - name: mart_customer_metrics
    description: Saved query to export mart_customer_metrics
    pass_through: enabled
    pass_through_include_metrics:
      # an optional list of metrics for which we would allow pass-through
      - metric1
      - metric2
    pass_through_exclude_metrics:
      # an optional list of metrics for which we would forbid pass-through (e.g. non-additive metrics)
      - metric3
      - metric4
    query_params:
      metrics:
        - count_order_rescheduled_events
      group_by:
        - Entity('customer_id')
        - Entity('city_id')

Based on this, MetricFlow could pick up on SQL compilation phase that whenever someone would query the count_order_rescheduled_events metric by customer, city or any attribute of these, the data should be read from the mart_customer_metrics table instead of the underlying fact table(s).

Concurrent pass-throughs

What if we have an export for customer-city grain and customer-product grain and we ask for customer grain metrics?
There would have to be some kind of a rule to decide which export to prefer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Pass-through" or "re-defined" metrics #1465

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

"Pass-through" or "re-defined" metrics #1465

siljamardla Oct 18, 2024

Feature

Use cases

Pre-aggregations

Building self-service datasets

Specifications

Concurrent pass-throughs

Replies: 0 comments

siljamardla
Oct 18, 2024