-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor PrimitiveGroupValueBuilder to use MaybeNullBufferBuilder
#12623
Conversation
} else { | ||
let elem = array.as_primitive::<T>().value(row); | ||
self.group_values.push(elem); | ||
self.nulls.push(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like this previous code manages self.nulls
even when there are and have been no nulls at all in the input
datafusion/physical-plan/src/aggregates/group_values/group_value_row.rs
Outdated
Show resolved
Hide resolved
Marking draft until I have benchmark numbers |
The benchmark numbers look good -- now I just need to debug one test and I will put this up for review |
has_null: bool, | ||
nullable: bool, | ||
/// Null state (when None, input is guaranteed not to have nulls) | ||
nulls: Option<MaybeNullBufferBuilder>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I collapsed nullable
and has_nulls
into an Option
and an Enum
which I think makes the intent much clearer (and harder to mess up the nullable
optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My benchmark shows a slowdown compared to the main branch. I experienced a similar slowdown before due to the use of Option, and I suspect that might be the case again with this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bencmarks seem to show a speedup, but I did not compare to just a bool. I'll make that change and rerun
I hacked up a version that checks the flag once per plan rather than once per batch with two separate column implementations in 36a2003. That way there is no Re-running benchmark to see if it makes a difference |
Avoiding the null check at all seems to have made things slightly faster
I think this PR is now ready for final review @jayzhan211 |
MaybeNullBufferBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thank you @jayzhan211 |
🚀 |
Nice 🚀 |
@@ -62,62 +62,96 @@ pub trait GroupColumn: Send + Sync { | |||
fn take_n(&mut self, n: usize) -> ArrayRef; | |||
} | |||
|
|||
/// Stores a collection of primitive group values which are known to have no nulls | |||
#[derive(Debug)] | |||
pub struct NonNullPrimitiveGroupValueBuilder<T: ArrowPrimitiveType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think NonNull can be a (boolean) generic parameter (const NonNull: bool
) as well to avoid code duplication...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look into it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added as a subtask of #12680
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which issue does this PR close?
Follow on to #12269 from @jayzhan211
Part of #12680
Rationale for this change
The initial implementation of the Column-wise group checking code (❤️ ) uses
Vec<bool>
to track null/validity but Arrow has a variety of optimized structures for doing so (BooleanBufferBuilder
, etc)I would like to get this code into great shape before we fill out support for other data types (like StringView and BinaryView)
I also felt that
PrimitiveGroupValueBuilder
could largely be a wrapper aroundPrimitiveBuilder
in arrow which would avoid the code duplication as well as take advantage of Arrow's optimizationsLet's use BooleanBufferBuilder to simplify the null handling in PrimitiveGroupValueBuilder as a start.
What changes are included in this PR?
MaybeNullBufferBuilder
to track nullsPrimitiveGroupValueBuilder
to useMaybeNullBufferBuilder
Planned follow ups:
ByteGroupValueBuilder
to useMaybeNullBufferBuilder
take_n
for the boolean builderAre these changes tested?
By existing CI and manual benchmarks
Are there any user-facing changes?
No
Performance results show significant performance for TPCH (where there are mostly non null values)
Clickbench also appears to have gotten marginally faster (though it is right on the noise margin)