-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA][HIP] Fix bug in guess local worksize funcs and improve local worksize guessing in HIP adapter #1326
[CUDA][HIP] Fix bug in guess local worksize funcs and improve local worksize guessing in HIP adapter #1326
Conversation
DPC++ PR: intel/llvm#12663 |
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1326 +/- ##
==========================================
- Coverage 14.82% 12.46% -2.36%
==========================================
Files 250 239 -11
Lines 36220 36080 -140
Branches 4094 4094
==========================================
- Hits 5369 4498 -871
- Misses 30800 31578 +778
+ Partials 51 4 -47 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocker, just a suggestion.
It maybe worth to add "tidies up the HIP adapter version of this func" to the commit message, or split the changes in two commits one per each adapter. The first one for the functional the changes in the Cuda adapter and the second one for the NFC/tidying-up changes in the HIP adapter.
Otherwise, thanks for addressing this inefficiency! LGTM
ec0ca91
to
4aceeda
Compare
Good shout @GeorgeWeb ! Change made |
Friendly ping @ldrumm have a few patches in SYCL RT that depend on this being fixed 😸 |
Should the HIP adapter have the same calculation as the CUDA one? |
I suppose it would be no harm to put this in HIP adapter as well |
786e114
to
254fdeb
Compare
Same logic now in HIP adapter. Friendly ping @ldrumm |
254fdeb
to
96c44da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes runtime behaviour for HIP, and introduces more abstract duplicate code. Thus, I don't really think it's a refactor.
I think the hipOccupancyMaxPotentialBlockSize
change should go in as a separate fix, and that you should revisit those lambdas
b430b82
to
96fb3cd
Compare
I have changed the PR description to show all changes that have been included in this PR.
I would argue that it's OK to keep HIP changes in this PR as well since in UR the commits are not squashed when PRs are merged, and given that I have constructed my commits such that there are two clear independent commits: one fixing a bug for the CUDA adapter and adding some common code, and another extending the functionality of the HIP adapter. In an ideal world maybe it would be good to split these into two PRs but with the slow pace of merging in UR I would argue to keep these changes together, especially since this PR is blocking other PRs so need to be prioritized. Let me know your thoughts.
Change made thanks |
86a14fe
to
ce33cfb
Compare
1d5263f
to
c2b46e0
Compare
void SetUp() override { | ||
program_name = "fill_2d"; | ||
UUR_RETURN_ON_FATAL_FAILURE(urKernelExecutionTest::SetUp()); | ||
#define ENQUEUE_KERNEL_LAUNCH_TEST_1D_SIZES(SIZE) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kbenzie let me know what you think about these added tests. Not ideal to introduce macro funcs but I think this is the most readable way to run these tests for different sizes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GoogleTest supports parameterized tests, it turns out similarly readable and doesn't require macros like this. I'd prefer to use GoogleTest features where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha I will do this instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's one example of use using it for urEnqueueUSMFill, there are others dotted around too.
How's the intel/llvm testing looking for this? |
4d7e829
to
2fbff0c
Compare
Getting some failures on AWS that I can't reproduce locally. Slow going! Will let you know when it's sorted |
5cb8823
to
85f7bfc
Compare
1bbe83b
to
12b3bf0
Compare
A bug in the CUDA adapter was sometimes generating Y and Z ranges that did not divide the global Y or Z dimension. This fixes that. Also moves some helper functions into ur/ur.hpp that may be reused by other adapters
The HIP adapter was only finding a good sg size in the X dim. This changes it so that it now chooses a sg size that divides the global dim in X, Y and Z dimensions. It also chooses a power of 2 sg size in the X dim, which is the same that the CUDA adapter does. This may give some performance improvements.
Dispatch kernels on lots of different configurations
12b3bf0
to
69c43b4
Compare
@kbenzie this is good to go |
okay cool, thanks |
Hi there, I think there is a problem here, which is the function declaration of |
oneapi-src/unified-runtime#1326 --------- Co-authored-by: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
Thanks @yingcong-wu see here #1460 |
oneapi-src/unified-runtime#1326 --------- Co-authored-by: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
The guessLocalWorkSize func for cuda adapter was erroneously giving large Y or Z factors, rounding up when it should not. This ensures that the Y and Z factors always divide the global Y or Z dimension.
It also tidies up the HIP adapter version of this func.