Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure consistent current_time across microbatch models in an invocation #10830

Merged
merged 6 commits into from
Oct 15, 2024

Conversation

QMalcolm
Copy link
Contributor

@QMalcolm QMalcolm commented Oct 7, 2024

Resolves #10819

Problem

Different microbatch models in the same dbt run could end up with different current_time. This could cause a situation where two microbatch models operating on the same inputs could end up with more/less data if one microbatch model was executed significantly later than the other.

Solution

Make it so each microbatch model in an invocation is using the same current_time

Checklist

  • I have read the contributing guide and understand what's expected of me.
  • I have run this code in development, and it appears to resolve the stated issue.
  • This PR includes tests, or tests are not required or relevant for this PR.
  • This PR has no interface changes (e.g., macros, CLI, logs, JSON artifacts, config files, adapter interface, etc.) or this PR has already received feedback and approval from Product or DX.
  • This PR includes type annotations for new and modified functions.

@QMalcolm QMalcolm requested a review from a team as a code owner October 7, 2024 17:03
@cla-bot cla-bot bot added the cla:yes label Oct 7, 2024
Copy link

codecov bot commented Oct 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.15%. Comparing base (6b9c1da) to head (9cd6cea).
Report is 13 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10830      +/-   ##
==========================================
- Coverage   89.20%   89.15%   -0.05%     
==========================================
  Files         183      183              
  Lines       23402    23420      +18     
==========================================
+ Hits        20875    20880       +5     
- Misses       2527     2540      +13     
Flag Coverage Δ
integration 86.37% <100.00%> (-0.13%) ⬇️
unit 62.11% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Unit Tests 62.11% <100.00%> (+<0.01%) ⬆️
Integration Tests 86.37% <100.00%> (-0.13%) ⬇️

core/dbt/materializations/incremental/microbatch.py Outdated Show resolved Hide resolved
core/dbt/config/runtime.py Outdated Show resolved Hide resolved
@@ -18,6 +18,7 @@ def __init__(
is_incremental: bool,
event_time_start: Optional[datetime],
event_time_end: Optional[datetime],
batch_current_time: Optional[datetime] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be batch_invoked_at for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm down to rename the variable. However, I think batch_invoked_at would give the wrong impression. This datetime is not when the batch was actually invoked. The datetime will be before the batch is actually invoked, by how much is dependent on how long the project takes to run and when the microbatch model is being worked on within that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about batch_default_end_time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or default_end_time? 🤔 Saying batch feels kinda redundant

@@ -18,6 +18,7 @@ def __init__(
is_incremental: bool,
event_time_start: Optional[datetime],
event_time_end: Optional[datetime],
batch_current_time: Optional[datetime] = None,
):
if model.config.incremental_strategy != "microbatch":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we in the right branch for batch_current_time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand, can you elaborate on "in the right branch"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is in reference to batch vs microbatch. For better or worse, we've been kinda using them interchangeably 🙈

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cursory glance led me to believe that we have two branches – microbatch and default.
If we're not in the microbatch branch, I assume we're in the default branch.
If this is the case, does batch_current_time make sense here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR

batch_current_time (now renamed to default_end_time) is only relevant to microbatch models


This file (microbatch.py) and the class itself (MicrobatchBuilder) should only ever be invoked when running on a microbatch model. That logic for separating whether we in a "default" (a.k.a. non-microbatch) model or a microbatch model lives in run.py in the execute() method. Thus, if we're in this code (microbatch.py), we're inherently in a microbatch context / logic branch. Hope that all makes sense. I sometimes forget that the rest of the team doesn't automatically have the context of how microbatch models work 😅 Let me know if you want a walk through the microbatch code at large if that'd be useful 🙂

core/dbt/materializations/incremental/microbatch.py Outdated Show resolved Hide resolved
@QMalcolm QMalcolm requested a review from aranke October 7, 2024 19:48
Copy link
Member

@aranke aranke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@QMalcolm QMalcolm merged commit d18f50b into main Oct 15, 2024
61 of 62 checks passed
@QMalcolm QMalcolm deleted the qmalcolm--10819-consistent-current-time-for-microbatch branch October 15, 2024 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set a singular "current_time" for microbatch models per invocation
2 participants