Run incremental models piece by piece due to query timeout #1813
Replies: 2 comments
-
This is an interesting topic @franzoni315 🧠 -- thank you for bringing it up! How are you breaking your model into time slices? Are you creating one unique model per month (so 12 different models to span a full year? Or are you breaking it up some other way? As you may have seen already, our current guidance for BigQuery is column-based partitioning and incremental modeling. This guidance replaces the legacy functionality for date-sharded tables in BigQuery. Did you already read through merge behavior (incremental models) ? You might also want to take a look at partitioning by a date or timestamp, clustering keys in your model configuration, and this guide. |
Beta Was this translation helpful? Give feedback.
-
At my last company, we used Redshift but occasionally hit similar issues. I
I'm not sure if you have a model for each time slice (i.e. We made a model that looked like this: with source_data as (
select * from {{ ref('my_table) }}
where not is_deleted
{% if is_incremental() %}
and created_at >= dateadd(day, -1, (select max(date_day) from {{ this }}))
and created_at < dateadd(month, 1, (select max(date_day) from {{ this }}))
{% else %}
and created_at <= '2017-01-01'
{% endif %}
),
final as (
select
[...]
from source_data
)
select * from final Basically this behaves very similarly to insert_by_period. It has the benefit of using native functionality only (insert_by_period feels a bit clunky for me in my conservatism), but the downside that the initial backfill requires a lot of dbt invocations instead of being built in one go. |
Beta Was this translation helpful? Give feedback.
-
I have a large model that we decided to materialize as an incremental table in order to save costs. We use bigquery as our data warehouse and we opted for the pay-as-you-go pricing plan, that has limits for the execution time of the query. Thus, the problem we are dealing with is that the model does not finish to run, because it is huge and bigquery throws a timeout.
We currently have a workaround, which is to break the model into time slices (e.g. monthly) and manually run each piece with the incremental config turned on.
I would like to know if there a procedure that already does this manual processing or if there are better solutions available.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions