Run incremental models piece by piece due to query timeout #1813

franzoni315 · 2022-07-29T15:27:58Z

franzoni315
Jul 29, 2022

I have a large model that we decided to materialize as an incremental table in order to save costs. We use bigquery as our data warehouse and we opted for the pay-as-you-go pricing plan, that has limits for the execution time of the query. Thus, the problem we are dealing with is that the model does not finish to run, because it is huge and bigquery throws a timeout.

We currently have a workaround, which is to break the model into time slices (e.g. monthly) and manually run each piece with the incremental config turned on.

I would like to know if there a procedure that already does this manual processing or if there are better solutions available.

Thanks!

dbeatty10 · 2022-08-01T16:47:01Z

dbeatty10
Aug 1, 2022
Maintainer

This is an interesting topic @franzoni315 🧠 -- thank you for bringing it up!

How are you breaking your model into time slices? Are you creating one unique model per month (so 12 different models to span a full year? Or are you breaking it up some other way?

As you may have seen already, our current guidance for BigQuery is column-based partitioning and incremental modeling.

This guidance replaces the legacy functionality for date-sharded tables in BigQuery.

Did you already read through merge behavior (incremental models) ?

You might also want to take a look at partitioning by a date or timestamp, clustering keys in your model configuration, and this guide.

0 replies

joellabes · 2022-08-01T22:32:22Z

joellabes
Aug 1, 2022
Maintainer

At my last company, we used Redshift but occasionally hit similar issues. I ~~solved~~ worked around it by building an incremental model which only processed one month at a time. Based on what you said above

break the model into time slices (e.g. monthly) and manually run each piece with the incremental config turned on

I'm not sure if you have a model for each time slice (i.e. my_model_2022_01, my_model_2022_02 ...), or if you've got the same approach as me so apologies in advance if this is just the same strategy as you're already following!

We made a model that looked like this:

with source_data as (
    select * from {{ ref('my_table) }}
    where not is_deleted
    {% if is_incremental() %}
        and created_at >= dateadd(day, -1, (select max(date_day) from {{ this }}))
        and created_at < dateadd(month, 1, (select max(date_day) from {{ this }})) 
    {% else %}
        and created_at <= '2017-01-01'
    {% endif %}
),

final as (
    select 
        [...]
    from source_data
)

select * from final

Basically this behaves very similarly to insert_by_period. It has the benefit of using native functionality only (insert_by_period feels a bit clunky for me in my conservatism), but the downside that the initial backfill requires a lot of dbt invocations instead of being built in one go.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run incremental models piece by piece due to query timeout #1813

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Run incremental models piece by piece due to query timeout #1813

franzoni315 Jul 29, 2022

Replies: 2 comments

dbeatty10 Aug 1, 2022 Maintainer

joellabes Aug 1, 2022 Maintainer

franzoni315
Jul 29, 2022

dbeatty10
Aug 1, 2022
Maintainer

joellabes
Aug 1, 2022
Maintainer