-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse hash #11708
Reuse hash #11708
Conversation
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
SELECT "SearchPhrase", COUNT(*) AS c FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10; | ||
SELECT "SearchPhrase", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY u DESC LIMIT 10; | ||
SELECT "SearchEngineID", "SearchPhrase", COUNT(*) AS c FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "SearchPhrase" ORDER BY c DESC LIMIT 10; | ||
SELECT "UserID", COUNT(*) FROM hits GROUP BY "UserID" ORDER BY COUNT(*) DESC LIMIT 10; | ||
SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no change compared to main
This comment was marked as outdated.
This comment was marked as outdated.
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
hashes_buffer: Default::default(), | ||
rows_buffer, | ||
random_state: Default::default(), | ||
random_state: RandomState::with_seeds(0, 0, 0, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With fixed seed to ensure the hash value is the same for the same group values
for (index, hash) in hash_buffer.iter().enumerate() { | ||
indices[(*hash % *partitions as u64) as usize] | ||
let hash_values = batch | ||
.column_by_name("hash_value") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
I wonder if we can make "reusing hash" an optional feature? |
It makes sense to me. And it is possible to make it optional |
If it turns out that reusing hash values isn't a good idea, I think we could close the ticket as "won't do" Another thought, similarly to what @andygrove has been exploring in #5944 (comment) and I am working on in #7957 is to combine what is currently in different operators into a single one Like maybe we could combine repartition and group by 🤔 but that might also be a crazy idea before i have had enough ☕ this morning |
Another point is that generating the hash value is most expensive for string columns -- so maybe we could only reuse hash values when there are string columns as part of the groupby exprs |
Let's find out if combining group + repartition is reasonable, if yes, whether to reuse hash should be trivial since we don't need the overhead of record batch conversion anymore |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #11680
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?