Beyond incrementality: Incrementing Incremental Models #10521

alittlesliceoftom · 2024-08-03T09:33:37Z

alittlesliceoftom
Aug 3, 2024

What:

This is a discussion to prompt thinking about how dbt implements 'incremental' materialisations going forward. It builds on previous similar discussions and also references a recent open sourcing of 'insert_by_timeperiod' =IBPT (by me @ M-KOPA), as well as other mentioned variants of the same such as 'insert_by_rank'= IBR.

Two key suggestions are:

In the short term development effort on the open source implementations will help a lot of data teams
In the mid term, dbt core should support some ability to chunk up inserts into large tables, either via sharding of the large tables or otherwise

Prior Discussions:

#4174 - "The insert_by_period materialization should graduate to part of the main project"
Key points:

insert_by_period is widely used
String replacement filter approach feels hacky
Most significantly, is many rows into one table the right problem or is an implementation of table sharding in a multi platform way more useful
- Recognising that this is a harder problem than making insert_by_period or insert_by_timeperiod useful for many projects
  Sharding discussion Feature Request: Sharded Tables #4457.
A request to add a batch reload key to incremental models, similar to what ibp/ibtp does in full refresh Allow incremental materialisation to handle batch full refreshes or introduce new materialisation to manage this behaviour. #8096 a dis

Why:

Most data that we load involves a stream of incoming data along some time dimension, a stream of sales or payments for example. To avoid re-processing all data that comes in dbt advises using incremental models. But incremental logic is hard to get right, particularly (as Tristan lays out when you have late arriving facts, or more simply mistakes in underlying data that are backfilled outside the incremental window.

Incremental models enable engineers to ignore data that has already been processed, but they don't enable you to backfill updates, or to backfill chunk by chunk. As an engineer you write code that must - build one table for 10 years, then insert 1 day at the end of it every day. The scale of this is so different that on some (I have a lot of experience of MS Synapse) warehouses the 10 year step becomes a challenge.

Demand

We know that demand for more advanced incrementalism is high. At M-KOPA we have built an internal (now open sourced implementation called insert by timeperiod) that has replaced many of our incremental models. This may partly be driven by our warehouse (Synapse), but the fact that ibp is the main source of dev on dbt-experiemental-features points to more demand.

Learnings from Implementing IBTP

There are some real challenges arising from turning a single query into multiple queries:

For example - we modify session_contexts to change things like compute allocation - a bug in the v0.1.2 implementation of ibtp doesn't redefine the session context each time, as effectively you need multiple pre-hooks
Limiting the practical size of the look back window, whilst we've extended ibp so that you can include window functions, your select statements need to be

Abstracting To: "Chunked Table Builder"

As mentioned in prior work, IBP/IBTP/IBR is similar to the idea of sharding out tables.
There's also an option to consider 'chunksize' and 'chunkgrain' as an option that could be applied to both table and incremental materialisations.

A possible generalised feature set:

Can choose some segmentation key (date in ibp/ibtp) in order to separate the table build into multiple batches (similar to e.g. batch_insert in pandas.df.to_sql())
Can vary the chunk size of segmentation key, e.g. day vs week
Generalise the segmentation key beyond dates only
Can parallelise those inserts at a desired thread rate (e.g. if your DAG is bottlenecked by 1 model, you may want to apply the maximum available threads to backfilling, alternately if it's only one model of many, you may prefer to stick to serialisation available in current implementations)
Can backfill an arbitary segmentation key range without needing to full refresh the model
- Can delete insert (or partition switch) rather than use deletion by unique key (which can be slow)
Can support window functions (i.e. select range to be bigger than the write range)
Supports config to call pre hooks before each query or before each model run
Can initialise the table with pre-set config, e.g. partitioning, generally we may want a create statement before the first insert statement, or attach config to the CTAS
- (I think this falls slightly out of scope as it's part of the create_table() macro and so we just need to pass the right info to that as long as the init calls the create table and passes args)
[Maybe] - support partition swapping on databases that enable it. Faster performance, and never have to insert into the table directly.
Can compose the whole table in full refresh before switching (present ibp and ibtp both drop the table then start from scratch rather than switching at the end!)

Options:

Core Implementation

I suspect this debate lies in the hands of the core maintainers, so putting this in to start a discussion

As Packages

Whilst that's going on, I think we can progress on testing out different features on the concrete implementations that exist
- I'd love contributions to the insert_by_timeperiod repo. I think if we get this working for multiple adapters it will be useful for community
- insert_by_period on dbt-experimental-features
- It wuld be amazing to get the people maintaining different versions of insert by period internally to start collaborating on a shared global tool , e.g. lmari-aalto mentioned they have an internal one in '22 on the previous discussion

alittlesliceoftom · 2024-08-15T10:54:25Z

alittlesliceoftom
Aug 15, 2024
Author

NB: https://medium.com/@AtheonAnalytics/supercharging-dbt-how-we-extended-dbts-insert-by-period-to-reduce-snowflake-costs-88384e1538db describes another solution implementing replacement of subsets of data.

0 replies

dataders · 2024-09-06T00:05:24Z

dataders
Sep 6, 2024
Maintainer

@alittlesliceoftom this discussion was referenced in #10672! please contribute and share thoughts there!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beyond incrementality: Incrementing Incremental Models #10521

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Beyond incrementality: Incrementing Incremental Models #10521

alittlesliceoftom Aug 3, 2024

What:

Prior Discussions:

Why:

Demand

Learnings from Implementing IBTP

Abstracting To: "Chunked Table Builder"

Options:

Core Implementation

As Packages

Replies: 2 comments

alittlesliceoftom Aug 15, 2024 Author

dataders Sep 6, 2024 Maintainer

alittlesliceoftom
Aug 3, 2024

alittlesliceoftom
Aug 15, 2024
Author

dataders
Sep 6, 2024
Maintainer