-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure consistent current_time
across microbatch models in an invocation
#10830
Ensure consistent current_time
across microbatch models in an invocation
#10830
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #10830 +/- ##
==========================================
- Coverage 89.20% 89.15% -0.05%
==========================================
Files 183 183
Lines 23402 23420 +18
==========================================
+ Hits 20875 20880 +5
- Misses 2527 2540 +13
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@@ -18,6 +18,7 @@ def __init__( | |||
is_incremental: bool, | |||
event_time_start: Optional[datetime], | |||
event_time_end: Optional[datetime], | |||
batch_current_time: Optional[datetime] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be batch_invoked_at
for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm down to rename the variable. However, I think batch_invoked_at
would give the wrong impression. This datetime is not when the batch was actually invoked. The datetime will be before the batch is actually invoked, by how much is dependent on how long the project takes to run and when the microbatch model is being worked on within that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about batch_default_end_time
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or default_end_time
? 🤔 Saying batch
feels kinda redundant
@@ -18,6 +18,7 @@ def __init__( | |||
is_incremental: bool, | |||
event_time_start: Optional[datetime], | |||
event_time_end: Optional[datetime], | |||
batch_current_time: Optional[datetime] = None, | |||
): | |||
if model.config.incremental_strategy != "microbatch": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we in the right branch for batch_current_time
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand, can you elaborate on "in the right branch"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is in reference to batch
vs microbatch
. For better or worse, we've been kinda using them interchangeably 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A cursory glance led me to believe that we have two branches – microbatch
and default
.
If we're not in the microbatch
branch, I assume we're in the default
branch.
If this is the case, does batch_current_time
make sense here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR
batch_current_time
(now renamed to default_end_time
) is only relevant to microbatch models
This file (microbatch.py
) and the class itself (MicrobatchBuilder
) should only ever be invoked when running on a microbatch model. That logic for separating whether we in a "default" (a.k.a. non-microbatch) model or a microbatch model lives in run.py in the execute() method. Thus, if we're in this code (microbatch.py), we're inherently in a microbatch context / logic branch. Hope that all makes sense. I sometimes forget that the rest of the team doesn't automatically have the context of how microbatch models work 😅 Let me know if you want a walk through the microbatch code at large if that'd be useful 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Resolves #10819
Problem
Different microbatch models in the same
dbt run
could end up with differentcurrent_time
. This could cause a situation where two microbatch models operating on the same inputs could end up with more/less data if one microbatch model was executed significantly later than the other.Solution
Make it so each microbatch model in an invocation is using the same
current_time
Checklist