-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid RowConverter for multi column grouping (10% faster clickbench queries) #12269
Conversation
@alamb The approach in this PR is to replace RowConverter and check the equality of the group by values by accessing the certain row and iterate all the group by expressions. The downside is that we need to have type-specific implementation but we could see it outperform Rows by eliminating the cost of I'm thinking of support only primitive, string, datetime those non-nested type. For other less common nested types maybe we just fallback to |
} | ||
|
||
impl<T: ArrowPrimitiveType> ArrayEq for PrimitiveGroupValueBuilder<T> { | ||
fn equal_to(&self, lhs_row: usize, array: &ArrayRef, rhs_row: usize) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equal_to
and append_val
are two core functions.
equal_to
is to compare the incoming row with the row in group value builder
append_val
is to add row into group value builder
Thanks @jayzhan211 -- I will try and review this over the next day or two (I am catching up from being out last week and I am not back full time until this Thursday) |
} | ||
|
||
for (i, group_val) in group_values_v2.iter().enumerate() { | ||
if !compare_equal(group_val.as_ref(), *group_idx, &cols[i], row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is called in a loop, this can be optimized/specialized for certain cases like: do the arrays have any nulls or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it how could I further optimize the loop based on nulls 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea would be change compare_equal to take advantage of cases when, for example, it was known the values couldn't be null (so checking Option
isn't needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR is I think this is a really neat idea @jayzhan211 -- in essence it seems to me this PR basically changes from Row comparison to Column by column comparison.
The theory with using RowCoverter at first I believe is that:
- It handled all possible types and combinations
- The theory was that time spent creating the Row would be paid back by faster comparisons by avoiding dynamic dispatch.
Your benchmark numbers seem to show different results 👌
I thought about how the performance could be so good and I suppose it does make sense because for most aggregate queries, many of the rows will go into an existing group -- so the cost of copying the input, just to find it isn't needed is outweighted
Also, this doesn't make sense to upstream to Arrow for me, it is group by specific implementation, so we need to maintain this in Datafusion. Would like an early feedback on this approach!
I looked at this and I think we could potentially reuse a lot of what is upstream in arrow-rs 's builders. I left comments
I am running the clickbench benchmarks to see if I can confirm the results. If so, I suggest we try and reuse the builders from arrow-rs as much as possible and see how elegant we can make this PR.
But all in all, really nicely done 👏
let mut group_values_v2 = self | ||
.group_values_v2 | ||
.take() | ||
.expect("Can not emit from empty rows"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a neat optimization as well -- as it saves a copy of the intermediate group values 👍
// } | ||
// } | ||
|
||
pub struct ByteGroupValueBuilderNaive<O> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my comment above, this looks very similar to the GenericBinaryBuilder in arrow -- it would be great if we could simply reuse that instead of a significant amount of copying 🤔
fn build(self: Box<Self>) -> ArrayRef; | ||
} | ||
|
||
pub struct PrimitiveGroupValueBuilder<T: ArrowPrimitiveType>(Vec<Option<T::Native>>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very similar to PrimitiveBuilder
in arrow-rs to me https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveBuilder.html (though I think PrimitiveBuilder
is likely faster / handles nulls better)
I wonder if you could implement ArrayEq
for PrimitiveBuilder
using the methods like https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveBuilder.html#method.values_slice
If so I think you would have a very compelling PR here
} | ||
|
||
for (i, group_val) in group_values_v2.iter().enumerate() { | ||
if !compare_equal(group_val.as_ref(), *group_idx, &cols[i], row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea would be change compare_equal to take advantage of cases when, for example, it was known the values couldn't be null (so checking Option
isn't needed)
@@ -35,9 +35,4 @@ SELECT "URL", COUNT(*) AS c FROM hits GROUP BY "URL" ORDER BY c DESC LIMIT 10; | |||
SELECT 1, "URL", COUNT(*) AS c FROM hits GROUP BY 1, "URL" ORDER BY c DESC LIMIT 10; | |||
SELECT "ClientIP", "ClientIP" - 1, "ClientIP" - 2, "ClientIP" - 3, COUNT(*) AS c FROM hits GROUP BY "ClientIP", "ClientIP" - 1, "ClientIP" - 2, "ClientIP" - 3 ORDER BY c DESC LIMIT 10; | |||
SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "URL" <> '' GROUP BY "URL" ORDER BY PageViews DESC LIMIT 10; | |||
SELECT "Title", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "Title" <> '' GROUP BY "Title" ORDER BY PageViews DESC LIMIT 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these queries removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I haven't implement DateTime builder, so couldn't pass the test
TLDR is I ran the benchmarks and it does appear to make a measurable performance improvement on several queries 👍
|
|
9d8dbea
to
5d904a3
Compare
@alamb I found Or maybe we should revert to the previous vector implementation since the performance doesn't differ a lot and it is easier for |
I think using Vec, if possible, is likely to be the best idea as Vec is well understood and highly optimized in Rust (and also has a fast conversion to Arrow arrays) |
@jayzhan211 BTW the results in #12269 (comment) are quite impressive This is a great example of using data driven decisions to come to a better internal architecture / approach (e.g showing with data that using single row comparisons is actually better than using RowCoverter). Very nice 👌 |
Query that compute with Datetime are slower 🤔 |
array: &ArrayRef, | ||
rhs_row: usize, | ||
) -> bool { | ||
array_row.equal_to(lhs_row, array, rhs_row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nulls are handled in the equal_to
, is there any other place we could further optimize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea I had is that you could defer actually copying the new rows into group_values so rather than calling the function once for each new group, you could call it once per batch, and it could insert all the new values in one function call
That would save some function call overhead as well as the downcasting of arrays and maybe would vectorize better
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to agree with @alamb really nice PR with impressive performance improvements.
Left some minor comment on things I noticed while looking through the code.
|
||
fn append_val(&mut self, array: &ArrayRef, row: usize) { | ||
// non-null fast path | ||
if !self.nullable || !array.is_null(row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me it also took sometime when reading to convince my self that ||
was correct. Maybe it would be easier to read if it we swaped the two cases and removed the !
? e.g
if self.nullable && array.is_null(row) {
self.group_values.push(T::default_value());
self.nulls.push(false);
self.has_null = true;
} else {
let elem = array.as_primitive::<T>().value(row);
self.group_values.push(elem);
self.nulls.push(true);
}
} | ||
|
||
fn size(&self) -> usize { | ||
let group_values_size = self.group_values.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that size should return allocated size in bytes, so this does not look correct for self.group_values
. I would expect us need to do something like
let group_values_size = self.group_values.iter().map(|column| column.size()).sum()
and add a size method to ArrayRowEq
so it can do some size calculation based on the field.
let len = cols.len(); | ||
let mut v = Vec::with_capacity(len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should give len
a better name like n_cols
or just "inline" it like
let len = cols.len(); | |
let mut v = Vec::with_capacity(len); | |
let mut v = Vec::with_capacity(cols.len()); |
because for me it currently just hurts readability.
let l = self.offsets[lhs_row].as_usize(); | ||
let r = self.offsets[lhs_row + 1].as_usize(); | ||
let existing_elem = unsafe { self.buffer.as_slice().get_unchecked(l..r) }; | ||
existing_elem.len() == rhs_elem.len() && rhs_elem == existing_elem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to check the length manually? Will that not just be a part of the ==
for slices. (https://doc.rust-lang.org/1.81.0/src/core/slice/cmp.rs.html#59)
existing_elem.len() == rhs_elem.len() && rhs_elem == existing_elem | |
rhs_elem == existing_elem |
fn take_n(&mut self, n: usize) -> ArrayRef { | ||
debug_assert!(self.len() >= n); | ||
|
||
let mut nulls_count = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we never use the result in nulls_count
? Could it just be removed?
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
I think the challenge of processing in batch is that if we got multiple same row, we should push the first one in group values but reject another n-1 ones as duplicated row values. The dependency is not vectorizable, since we need to check them iteratively. |
@alamb @eejbyfeldt Thanks for your review, could take another quick look then merge this PR. TODOs of follow on
|
I am running the benchmarks one more time |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR is this is great. I love it. Thank you so much @jayzhan211
I am sure we can make it better as follow on. Very impressive
🚀
let b = ByteGroupValueBuilder::<i64>::new(OutputType::Binary); | ||
v.push(Box::new(b) as _) | ||
} | ||
dt => todo!("{dt} not impl"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if this was an internal error rather than a panic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in #12620
/// are stored as a zero length string. | ||
offsets: Vec<O>, | ||
/// Null indexes in offsets, if `i` is in nulls, `offsets[i]` should be equals to `offsets[i+1]` | ||
nulls: Vec<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to use Vec rather than Vec<bool>
as used in PrimitiveGroupValueBuilder
?
Maybe as a follow on PR we can explore using BooleanBufferBuilder
from arrow-rs directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not easy to handle take_n
logic with BooleanBufferBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a shot at doing so in #12623 - I agree the take_n
logic was the trickiest
} | ||
} | ||
|
||
fn has_row_like_feature(data_type: &DataType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit -- but since this list of types needs to remain in sync with GroupValuesColumn
it might make sense to move it into that module too
Like GroupValuesColumn::is_supported
or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in #12620
I plan to spend some time tomorrow morning (my time) perhaps making some tweaks to this code (e,g no panics) and filing a ticket to track follow on work per @jayzhan211 's comment here: #12269 (comment) |
I plan to file a few PRs to try an improve the code a bit before I file additional tickets to add more features. I will update this list: |
…ueries) (apache#12269) * row like group values to avoid rowconverter Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * comment out unused Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * implement to Arrow's builder Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * switch back to vector Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * clippy Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * optimize for non-null Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * use truncate Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix first N bug Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix null check Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fast path null Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix bug Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix error Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * clippy Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * adjust spill mode max mem Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * revert test_create_external_table_with_terminator_with_newlines_in_values Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix null handle bug Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * support binary Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add binary test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * use Vec<T> instead of Option<Vec<T>> Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add test and doc Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * debug assert Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * mv & rename Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix take_n logic Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * address comment Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Which issue does this PR close?
Closes #9403
Rationale for this change
To avoid Row converter in multi group by clause, we add equality check for group values Array.
We can see a improvement on group by query (much more for string types). The downside is that this is type-specific design unlike Rows that covers all the types
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
Benchmark
Query after 37 is removed since DateTime is not supported