Beyond incrementality: Incrementing Incremental Models #10521
alittlesliceoftom
started this conversation in
Ideas
Replies: 2 comments
-
NB: https://medium.com/@AtheonAnalytics/supercharging-dbt-how-we-extended-dbts-insert-by-period-to-reduce-snowflake-costs-88384e1538db describes another solution implementing replacement of subsets of data. |
Beta Was this translation helpful? Give feedback.
0 replies
-
@alittlesliceoftom this discussion was referenced in #10672! please contribute and share thoughts there! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
What:
This is a discussion to prompt thinking about how dbt implements 'incremental' materialisations going forward. It builds on previous similar discussions and also references a recent open sourcing of 'insert_by_timeperiod' =IBPT (by me @ M-KOPA), as well as other mentioned variants of the same such as 'insert_by_rank'= IBR.
Two key suggestions are:
Prior Discussions:
#4174 - "The insert_by_period materialization should graduate to part of the main project"
Key points:
insert_by_period is widely used
String replacement filter approach feels hacky
Most significantly, is many rows into one table the right problem or is an implementation of table sharding in a multi platform way more useful
Sharding discussion Feature Request: Sharded Tables #4457.
A request to add a batch reload key to incremental models, similar to what ibp/ibtp does in full refresh Allow incremental materialisation to handle batch full refreshes or introduce new materialisation to manage this behaviour. #8096 a dis
Why:
Most data that we load involves a stream of incoming data along some time dimension, a stream of sales or payments for example. To avoid re-processing all data that comes in dbt advises using incremental models. But incremental logic is hard to get right, particularly (as Tristan lays out when you have late arriving facts, or more simply mistakes in underlying data that are backfilled outside the incremental window.
Incremental models enable engineers to ignore data that has already been processed, but they don't enable you to backfill updates, or to backfill chunk by chunk. As an engineer you write code that must - build one table for 10 years, then insert 1 day at the end of it every day. The scale of this is so different that on some (I have a lot of experience of MS Synapse) warehouses the 10 year step becomes a challenge.
Demand
We know that demand for more advanced incrementalism is high. At M-KOPA we have built an internal (now open sourced implementation called insert by timeperiod) that has replaced many of our incremental models. This may partly be driven by our warehouse (Synapse), but the fact that ibp is the main source of dev on dbt-experiemental-features points to more demand.
Learnings from Implementing IBTP
There are some real challenges arising from turning a single query into multiple queries:
Abstracting To: "Chunked Table Builder"
As mentioned in prior work, IBP/IBTP/IBR is similar to the idea of sharding out tables.
There's also an option to consider 'chunksize' and 'chunkgrain' as an option that could be applied to both table and incremental materialisations.
A possible generalised feature set:
segmentation key
(date in ibp/ibtp) in order to separate the table build into multiple batches (similar to e.g. batch_insert in pandas.df.to_sql())segmentation key
, e.g. day vs weeksegmentation key
beyond dates onlybackfill
an arbitary segmentation key range without needing to full refresh the modelwindow functions
(i.e. select range to be bigger than the write range)create_table()
macro and so we just need to pass the right info to that as long as the init calls the create table and passes args)Options:
Core Implementation
As Packages
Beta Was this translation helpful? Give feedback.
All reactions