[CUDA][HIP] Fix bug in guess local worksize funcs and improve local worksize guessing in HIP adapter #1326

hdelan · 2024-02-08T15:50:40Z

The guessLocalWorkSize func for cuda adapter was erroneously giving large Y or Z factors, rounding up when it should not. This ensures that the Y and Z factors always divide the global Y or Z dimension.

It also tidies up the HIP adapter version of this func.

hdelan · 2024-02-08T16:09:24Z

DPC++ PR: intel/llvm#12663

codecov-commenter · 2024-02-08T16:35:46Z

Codecov Report

Attention: Patch coverage is 0% with 94 lines in your changes are missing coverage. Please review.

Project coverage is 12.46%. Comparing base (78ef1ca) to head (69c43b4).
Report is 153 commits behind head on main.

Files	Patch %	Lines
test/conformance/testing/include/uur/fixtures.h	0.00%	41 Missing ⚠️
test/conformance/enqueue/urEnqueueKernelLaunch.cpp	0.00%	30 Missing ⚠️
source/ur/ur.hpp	0.00%	23 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1326      +/-   ##
==========================================
- Coverage   14.82%   12.46%   -2.36%     
==========================================
  Files         250      239      -11     
  Lines       36220    36080     -140     
  Branches     4094     4094              
==========================================
- Hits         5369     4498     -871     
- Misses      30800    31578     +778     
+ Partials       51        4      -47

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

konradkusiak97

Looks good to me! 👍

GeorgeWeb

Non-blocker, just a suggestion.
It maybe worth to add "tidies up the HIP adapter version of this func" to the commit message, or split the changes in two commits one per each adapter. The first one for the functional the changes in the Cuda adapter and the second one for the NFC/tidying-up changes in the HIP adapter.

Otherwise, thanks for addressing this inefficiency! LGTM

hdelan · 2024-02-12T18:15:41Z

Non-blocker, just a suggestion. It maybe worth to add "tidies up the HIP adapter version of this func" to the commit message, or split the changes in two commits one per each adapter. The first one for the functional the changes in the Cuda adapter and the second one for the NFC/tidying-up changes in the HIP adapter.

Otherwise, thanks for addressing this inefficiency! LGTM

Good shout @GeorgeWeb ! Change made

hdelan · 2024-02-13T10:52:30Z

Friendly ping @ldrumm have a few patches in SYCL RT that depend on this being fixed 😸

npmiller · 2024-02-13T11:03:08Z

Should the HIP adapter have the same calculation as the CUDA one?

hdelan · 2024-02-13T11:09:00Z

Should the HIP adapter have the same calculation as the CUDA one?

I suppose it would be no harm to put this in HIP adapter as well

hdelan · 2024-02-16T15:34:03Z

Should the HIP adapter have the same calculation as the CUDA one?

Same logic now in HIP adapter. Friendly ping @ldrumm

ldrumm

This changes runtime behaviour for HIP, and introduces more abstract duplicate code. Thus, I don't really think it's a refactor.

I think the hipOccupancyMaxPotentialBlockSize change should go in as a separate fix, and that you should revisit those lambdas

source/adapters/cuda/enqueue.cpp

source/adapters/hip/enqueue.cpp

hdelan · 2024-02-20T10:37:40Z

This changes runtime behaviour for HIP, and introduces more abstract duplicate code. Thus, I don't really think it's a refactor.

I have changed the PR description to show all changes that have been included in this PR.

I think the hipOccupancyMaxPotentialBlockSize change should go in as a separate fix

I would argue that it's OK to keep HIP changes in this PR as well since in UR the commits are not squashed when PRs are merged, and given that I have constructed my commits such that there are two clear independent commits: one fixing a bug for the CUDA adapter and adding some common code, and another extending the functionality of the HIP adapter.

In an ideal world maybe it would be good to split these into two PRs but with the slow pace of merging in UR I would argue to keep these changes together, especially since this PR is blocking other PRs so need to be prioritized. Let me know your thoughts.

You should revisit those lambdas

Change made thanks

hdelan · 2024-03-07T15:06:55Z

test/conformance/enqueue/urEnqueueKernelLaunch.cpp

-    void SetUp() override {
-        program_name = "fill_2d";
-        UUR_RETURN_ON_FATAL_FAILURE(urKernelExecutionTest::SetUp());
+#define ENQUEUE_KERNEL_LAUNCH_TEST_1D_SIZES(SIZE)                              \


@kbenzie let me know what you think about these added tests. Not ideal to introduce macro funcs but I think this is the most readable way to run these tests for different sizes

GoogleTest supports parameterized tests, it turns out similarly readable and doesn't require macros like this. I'd prefer to use GoogleTest features where possible.

Aha I will do this instead

Here's one example of use using it for urEnqueueUSMFill, there are others dotted around too.

kbenzie · 2024-03-11T16:28:31Z

How's the intel/llvm testing looking for this?

hdelan · 2024-03-13T16:54:17Z

How's the intel/llvm testing looking for this?

Getting some failures on AWS that I can't reproduce locally. Slow going! Will let you know when it's sorted

A bug in the CUDA adapter was sometimes generating Y and Z ranges that did not divide the global Y or Z dimension. This fixes that. Also moves some helper functions into ur/ur.hpp that may be reused by other adapters

The HIP adapter was only finding a good sg size in the X dim. This changes it so that it now chooses a sg size that divides the global dim in X, Y and Z dimensions. It also chooses a power of 2 sg size in the X dim, which is the same that the CUDA adapter does. This may give some performance improvements.

Dispatch kernels on lots of different configurations

hdelan · 2024-03-18T13:30:24Z

@kbenzie this is good to go

kbenzie · 2024-03-18T13:46:27Z

@kbenzie this is good to go

okay cool, thanks

yingcong-wu · 2024-03-20T06:01:10Z

Hi there, I think there is a problem here, which is the function declaration of guessLocalWorkSize in source/adapters/cuda/enqueue.hpp is not updated accordingly.

See in https://github.com/oneapi-src/unified-runtime/blob/ed1f8bf618c88eaabea6bde0f6c06fc265f3b49f/source/adapters/cuda/enqueue.hpp#L23C52-L23C70

oneapi-src/unified-runtime#1326 --------- Co-authored-by: Kenneth Benzie (Benie) <k.benzie@codeplay.com>

hdelan · 2024-03-20T09:26:19Z

Hi there, I think there is a problem here, which is the function declaration of guessLocalWorkSize in source/adapters/cuda/enqueue.hpp is not updated accordingly.

See in https://github.com/oneapi-src/unified-runtime/blob/ed1f8bf618c88eaabea6bde0f6c06fc265f3b49f/source/adapters/cuda/enqueue.hpp#L23C52-L23C70

Thanks @yingcong-wu see here #1460

oneapi-src/unified-runtime#1326 --------- Co-authored-by: Kenneth Benzie (Benie) <k.benzie@codeplay.com>

hdelan requested review from a team as code owners February 8, 2024 15:50

hdelan requested a review from ldrumm February 8, 2024 15:50

This was referenced Feb 8, 2024

[CUDA][HIP] Refactor guess local worksize funcs #1323

Closed

[UR] CI for UR PR refactor-guess-local-worksize intel/llvm#12663

Merged

konradkusiak97 approved these changes Feb 12, 2024

View reviewed changes

GeorgeWeb reviewed Feb 12, 2024

View reviewed changes

hdelan force-pushed the refactor-guess-local-worksize branch 2 times, most recently from ec0ca91 to 4aceeda Compare February 12, 2024 18:15

hdelan requested a review from a team as a code owner February 16, 2024 15:25

hdelan force-pushed the refactor-guess-local-worksize branch from 786e114 to 254fdeb Compare February 16, 2024 15:27

hdelan force-pushed the refactor-guess-local-worksize branch from 254fdeb to 96c44da Compare February 16, 2024 15:35

ldrumm requested changes Feb 16, 2024

View reviewed changes

source/adapters/cuda/enqueue.cpp Outdated Show resolved Hide resolved

source/adapters/hip/enqueue.cpp Outdated Show resolved Hide resolved

source/adapters/hip/enqueue.cpp Outdated Show resolved Hide resolved

source/adapters/hip/enqueue.cpp Outdated Show resolved Hide resolved

hdelan mentioned this pull request Feb 19, 2024

[SYCL] Add force range rounding option and introduce new compiler flag intel/llvm#12715

Merged

hdelan force-pushed the refactor-guess-local-worksize branch 3 times, most recently from b430b82 to 96fb3cd Compare February 20, 2024 10:28

hdelan changed the title ~~[CUDA][HIP] Refactor guess local worksize funcs~~ [CUDA][HIP] Fix bug in guess local worksize funcs and improve local worksize guessing in HIP adapter Feb 20, 2024

hdelan force-pushed the refactor-guess-local-worksize branch 4 times, most recently from 86a14fe to ce33cfb Compare February 23, 2024 09:32

hdelan force-pushed the refactor-guess-local-worksize branch 2 times, most recently from 1d5263f to c2b46e0 Compare March 7, 2024 15:05

hdelan commented Mar 7, 2024

View reviewed changes

kbenzie added the v0.9.x Include in the v0.9.x release label Mar 11, 2024

hdelan force-pushed the refactor-guess-local-worksize branch 2 times, most recently from 4d7e829 to 2fbff0c Compare March 13, 2024 16:27

kbenzie removed the ready to merge Added to PR's which are ready to merge label Mar 13, 2024

hdelan force-pushed the refactor-guess-local-worksize branch from 5cb8823 to 85f7bfc Compare March 14, 2024 16:11

hdelan mentioned this pull request Mar 14, 2024

[Draft] Minimal change to test CI #1443

Closed

hdelan force-pushed the refactor-guess-local-worksize branch 2 times, most recently from 1bbe83b to 12b3bf0 Compare March 15, 2024 09:18

hdelan added 3 commits March 18, 2024 10:53

Fix bug in CUDA local range calculation

088f12a

A bug in the CUDA adapter was sometimes generating Y and Z ranges that did not divide the global Y or Z dimension. This fixes that. Also moves some helper functions into ur/ur.hpp that may be reused by other adapters

Add some tests for local range finding

69c43b4

Dispatch kernels on lots of different configurations

hdelan force-pushed the refactor-guess-local-worksize branch from 12b3bf0 to 69c43b4 Compare March 18, 2024 10:54

hdelan added the ready to merge Added to PR's which are ready to merge label Mar 18, 2024

kbenzie approved these changes Mar 19, 2024

View reviewed changes

kbenzie merged commit ed1f8bf into oneapi-src:main Mar 19, 2024
50 checks passed

steffenlarsen pushed a commit to intel/llvm that referenced this pull request Mar 20, 2024

[UR] CI for UR PR refactor-guess-local-worksize (#12663)

43f0963

oneapi-src/unified-runtime#1326 --------- Co-authored-by: Kenneth Benzie (Benie) <k.benzie@codeplay.com>

hdelan mentioned this pull request Mar 20, 2024

[HIP][CUDA] Remove function prototypes from enqueue.hpp #1460

Merged

hdelan mentioned this pull request Mar 22, 2024

CUDA_ERROR_ILLEGAL_ADDRESS when executing a simple program using CUDA backend (Windows) intel/llvm#11740

Closed

hdelan mentioned this pull request Apr 10, 2024

Poor performance for 2d/3d range kernels involving primes on CUDA PI intel/llvm#8018

Closed

kbenzie added a commit to kbenzie/llvm that referenced this pull request Apr 15, 2024

[UR] CI for UR PR refactor-guess-local-worksize (intel#12663)

0838aba

oneapi-src/unified-runtime#1326 --------- Co-authored-by: Kenneth Benzie (Benie) <k.benzie@codeplay.com>

kbenzie mentioned this pull request Apr 15, 2024

sycl-rel_5_2_0: [UR] CI for UR PR refactor-guess-local-worksize (#12663) intel/llvm#13407

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA][HIP] Fix bug in guess local worksize funcs and improve local worksize guessing in HIP adapter #1326

[CUDA][HIP] Fix bug in guess local worksize funcs and improve local worksize guessing in HIP adapter #1326

hdelan commented Feb 8, 2024

hdelan commented Feb 8, 2024

codecov-commenter commented Feb 8, 2024 •

edited

Loading

konradkusiak97 left a comment

GeorgeWeb left a comment •

edited

Loading

hdelan commented Feb 12, 2024

hdelan commented Feb 13, 2024

npmiller commented Feb 13, 2024

hdelan commented Feb 13, 2024

hdelan commented Feb 16, 2024

ldrumm left a comment

hdelan commented Feb 20, 2024 •

edited

Loading

hdelan Mar 7, 2024

kbenzie Mar 7, 2024

hdelan Mar 7, 2024

kbenzie Mar 7, 2024

kbenzie commented Mar 11, 2024

hdelan commented Mar 13, 2024

hdelan commented Mar 18, 2024

kbenzie commented Mar 18, 2024

yingcong-wu commented Mar 20, 2024 •

edited

Loading

hdelan commented Mar 20, 2024

[CUDA][HIP] Fix bug in guess local worksize funcs and improve local worksize guessing in HIP adapter #1326

[CUDA][HIP] Fix bug in guess local worksize funcs and improve local worksize guessing in HIP adapter #1326

Conversation

hdelan commented Feb 8, 2024

hdelan commented Feb 8, 2024

codecov-commenter commented Feb 8, 2024 • edited Loading

Codecov Report

konradkusiak97 left a comment

Choose a reason for hiding this comment

GeorgeWeb left a comment • edited Loading

Choose a reason for hiding this comment

hdelan commented Feb 12, 2024

hdelan commented Feb 13, 2024

npmiller commented Feb 13, 2024

hdelan commented Feb 13, 2024

hdelan commented Feb 16, 2024

ldrumm left a comment

Choose a reason for hiding this comment

hdelan commented Feb 20, 2024 • edited Loading

hdelan Mar 7, 2024

Choose a reason for hiding this comment

kbenzie Mar 7, 2024

Choose a reason for hiding this comment

hdelan Mar 7, 2024

Choose a reason for hiding this comment

kbenzie Mar 7, 2024

Choose a reason for hiding this comment

kbenzie commented Mar 11, 2024

hdelan commented Mar 13, 2024

hdelan commented Mar 18, 2024

kbenzie commented Mar 18, 2024

yingcong-wu commented Mar 20, 2024 • edited Loading

hdelan commented Mar 20, 2024

codecov-commenter commented Feb 8, 2024 •

edited

Loading

GeorgeWeb left a comment •

edited

Loading

hdelan commented Feb 20, 2024 •

edited

Loading

yingcong-wu commented Mar 20, 2024 •

edited

Loading