-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of small sums could be improved #921
Comments
The following PR partially addresses the issue. In the offline discussion, we concluded that providing an overload that takes a single segment size would be preferable. This overload would significantly reduce temporary storage size and improve performance. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I don't have a concrete high priority issue that needs solving, but it may be surprising to users that avoiding the cub segmented sum is much faster here.
The following cupy code uses CUB by default on newer versions (ensure with
CUPY_ACCELERATORS=cub
):Which means a factor of 35 slower than what would be close to optimal.
Now, as a NumPy dev, I accept that NumPy is also still bad at this: by about a factor of 10! CuPy without CUB was good at it, though.
But, maybe there is an easy win here that would remove the surprise of having to rewrite the code.
The text was updated successfully, but these errors were encountered: