-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2873] [SQL] using ExternalAppendOnlyMap to resolve OOM when aggregating #2029
Conversation
Can one of the admins verify this patch? |
@marmbrus |
People usually just summarize the benchmark itself and the results in description of the PR. For example: #1439 |
|
it's very sad about the result of benchmark above. the size of AppendOnlyMap is according to the number of keys for values with the same key merged i think it's not a good way by using ExternalAppendOnlyMap,fot it is too expensive when records with the same key spill to disk over and over again. otherwise, user can easily avoid OOM by raising spark.sql.shuffle.partitions to reduce the key numbsers i think the logic of ExternalAppendOnlyMap should Optimize. join seems have similar problems. meanwhile, both left and right table put into ExternalAppendOnlyMap is expensive too |
What were the actual results of the benchmark? It is acceptable for there to be some performance hit here. In cases where there are too many keys, its much better to spill to disk than to OOM, though you have a good point about just adding more partitions. |
Can one of the admins verify this patch? |
I've run a micro benchmark in my local with 50000 records,1500 keys.
|
Thanks for working on this, but we are trying to clean up the PR queue (in order to make it easier for us to review). Thus, I think we should close this issue for now and reopen when its ready for review. |
A new PR clone from PR 1822
Fix numbers of problems
Reuse the CompactBuffer from Spark Core to save memory and pointer dereferences as PR 1993
Hive UDAF not support external aggregate, for hive AggregationBuffer need serializable and hive GenericUDAFEvaluator has no method implement to merge two evaluators