-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for Spilling of Distinct Aggregations #7791
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
8941cdc
to
ef8c5f6
Compare
ef8c5f6
to
c3752b8
Compare
7822210
to
641ff8c
Compare
@aditi-pandit Aditi, I see CI failures. Is this PR ready for review? How is it different from #8287 ? |
641ff8c
to
1baf80b
Compare
@mbasmanova : This PR is ready for review. Looking at the test failure. The code in this PR seemed simpler in both the DistinctAggregations and serialization logic for maintaining lengths. Also has more tests for ComplexType serialization. My apologies to @supermem613, this had been in the works for some time and I should've linked it in the issue. |
a50bf39
to
75a6721
Compare
Would you clarify a bit more how this PR works and how it is different / better than the other PR? PR description doesn't have any details and it will take significant amount of time and effort to reverse engineer the design from the code itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aditi-pandit Aditi, thank you for adding a description. This is very helpful. It sounds like it would be helpful to extract changes to SetAccumulator and AddressableNonNullValueList into 2 separate PRs.
velox/exec/SetAccumulator.h
Outdated
void addNullIndex(const char* buffer) { | ||
VELOX_CHECK(!nullIndex.has_value()); | ||
vector_size_t serializedNullIndex; | ||
memcpy(&serializedNullIndex, buffer, kVectorSizeT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are InputByteStream and OutputByteStream classes in velox/common/base/IOUtils.h which can be used here to make the code easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to use InputByteStream for deserialization functions. But since we random access during serialization(write), I didn't use OutputByteStream.
velox/exec/SetAccumulator.h
Outdated
HashStringAllocator* /*allocator*/) { | ||
VELOX_CHECK(!vector.isNullAt(i)); | ||
|
||
// The serialized value is the nullOffset (kNoNullIndex if no null is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to document the serialization format in one place and avoid repeating it in comments.
velox/exec/SetAccumulator.h
Outdated
// The serialized value is the nullOffset (kNoNullIndex if no null is | ||
// present) followed by the unique values ordered by index. | ||
auto serialized = vector.valueAt(i); | ||
auto size = serialized.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps, include number of entries in the serialized data for extra safety
6acb842
to
015ac0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, stamp the wrong PR
…f bytes (#8653) Summary: This is the first in a set of PRs to add support for spilling distinct aggregations (see full version in #7791). Spilling distinct aggregations needs support to spill SetAccumulators in which input values are cumulated. ComplexTypeSetAccumulators use AddressableNonNullValueList to serialize complex types. This PR adds new APIs to AddressableNonNullValueList so that it can copy/append a stream of bytes corresponding to a ComplexType value (array, map, struct). Pull Request resolved: #8653 Reviewed By: Yuhta Differential Revision: D53497000 Pulled By: mbasmanova fbshipit-source-id: 66d44d02a2c3bd5775725c8b8559feaed17c0813
…f bytes (facebookincubator#8653) Summary: This is the first in a set of PRs to add support for spilling distinct aggregations (see full version in facebookincubator#7791). Spilling distinct aggregations needs support to spill SetAccumulators in which input values are cumulated. ComplexTypeSetAccumulators use AddressableNonNullValueList to serialize complex types. This PR adds new APIs to AddressableNonNullValueList so that it can copy/append a stream of bytes corresponding to a ComplexType value (array, map, struct). Pull Request resolved: facebookincubator#8653 Reviewed By: Yuhta Differential Revision: D53497000 Pulled By: mbasmanova fbshipit-source-id: 66d44d02a2c3bd5775725c8b8559feaed17c0813
…f bytes (facebookincubator#8653) Summary: This is the first in a set of PRs to add support for spilling distinct aggregations (see full version in facebookincubator#7791). Spilling distinct aggregations needs support to spill SetAccumulators in which input values are cumulated. ComplexTypeSetAccumulators use AddressableNonNullValueList to serialize complex types. This PR adds new APIs to AddressableNonNullValueList so that it can copy/append a stream of bytes corresponding to a ComplexType value (array, map, struct). Pull Request resolved: facebookincubator#8653 Reviewed By: Yuhta Differential Revision: D53497000 Pulled By: mbasmanova fbshipit-source-id: 66d44d02a2c3bd5775725c8b8559feaed17c0813
…f bytes (facebookincubator#8653) Summary: This is the first in a set of PRs to add support for spilling distinct aggregations (see full version in facebookincubator#7791). Spilling distinct aggregations needs support to spill SetAccumulators in which input values are cumulated. ComplexTypeSetAccumulators use AddressableNonNullValueList to serialize complex types. This PR adds new APIs to AddressableNonNullValueList so that it can copy/append a stream of bytes corresponding to a ComplexType value (array, map, struct). Pull Request resolved: facebookincubator#8653 Reviewed By: Yuhta Differential Revision: D53497000 Pulled By: mbasmanova fbshipit-source-id: 66d44d02a2c3bd5775725c8b8559feaed17c0813
015ac0a
to
c1891b9
Compare
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions! |
c1891b9
to
d45f971
Compare
@xiaoxmeng, @Yuhta : This is the PR for Spilling of Distinct Aggregations. Lets continue the review here. |
d45f971
to
37751f9
Compare
66c7c1c
to
8739d2c
Compare
8739d2c
to
4c0515d
Compare
HashAggregation supports spilling in general. However, it didn't support spilling for a DISTINCT aggregation. e.g. Like SQL query
SELECT c1, count(DISTINCT c0) FROM tmp GROUP BY c1
Distinct aggregations work by capturing the column parameter values in a SetAccumulator to track only the DISTINCT values. The outer aggregation is computed over the contents of these SetAccumulators.
To support spilling in this context multiple changes are needed:
-- This class needed new methods to copy what appears externally a contiguous set of bytes into a ByteStream to extract a spill from it.
-- A new method to append a stream of previously serialized contents from the spill was also added.
-- Since we need to serialize the length of each ComplexType entry in the SetAccumulator the append method was enhanced as well to return the number of serialized bytes for it as well.