Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA][HIP] Fix bug in guess local worksize funcs and improve local worksize guessing in HIP adapter #1326

Merged
merged 3 commits into from
Mar 19, 2024

Conversation

hdelan
Copy link
Contributor

@hdelan hdelan commented Feb 8, 2024

The guessLocalWorkSize func for cuda adapter was erroneously giving large Y or Z factors, rounding up when it should not. This ensures that the Y and Z factors always divide the global Y or Z dimension.

It also tidies up the HIP adapter version of this func.

@hdelan
Copy link
Contributor Author

hdelan commented Feb 8, 2024

DPC++ PR: intel/llvm#12663

@codecov-commenter
Copy link

codecov-commenter commented Feb 8, 2024

Codecov Report

Attention: Patch coverage is 0% with 94 lines in your changes are missing coverage. Please review.

Project coverage is 12.46%. Comparing base (78ef1ca) to head (69c43b4).
Report is 153 commits behind head on main.

Files Patch % Lines
test/conformance/testing/include/uur/fixtures.h 0.00% 41 Missing ⚠️
test/conformance/enqueue/urEnqueueKernelLaunch.cpp 0.00% 30 Missing ⚠️
source/ur/ur.hpp 0.00% 23 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1326      +/-   ##
==========================================
- Coverage   14.82%   12.46%   -2.36%     
==========================================
  Files         250      239      -11     
  Lines       36220    36080     -140     
  Branches     4094     4094              
==========================================
- Hits         5369     4498     -871     
- Misses      30800    31578     +778     
+ Partials       51        4      -47     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@konradkusiak97 konradkusiak97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! 👍

Copy link
Contributor

@GeorgeWeb GeorgeWeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocker, just a suggestion.
It maybe worth to add "tidies up the HIP adapter version of this func" to the commit message, or split the changes in two commits one per each adapter. The first one for the functional the changes in the Cuda adapter and the second one for the NFC/tidying-up changes in the HIP adapter.

Otherwise, thanks for addressing this inefficiency! LGTM

@hdelan hdelan force-pushed the refactor-guess-local-worksize branch 2 times, most recently from ec0ca91 to 4aceeda Compare February 12, 2024 18:15
@hdelan
Copy link
Contributor Author

hdelan commented Feb 12, 2024

Non-blocker, just a suggestion. It maybe worth to add "tidies up the HIP adapter version of this func" to the commit message, or split the changes in two commits one per each adapter. The first one for the functional the changes in the Cuda adapter and the second one for the NFC/tidying-up changes in the HIP adapter.

Otherwise, thanks for addressing this inefficiency! LGTM

Good shout @GeorgeWeb ! Change made

@hdelan
Copy link
Contributor Author

hdelan commented Feb 13, 2024

Friendly ping @ldrumm have a few patches in SYCL RT that depend on this being fixed 😸

@npmiller
Copy link
Contributor

Should the HIP adapter have the same calculation as the CUDA one?

@hdelan
Copy link
Contributor Author

hdelan commented Feb 13, 2024

Should the HIP adapter have the same calculation as the CUDA one?

I suppose it would be no harm to put this in HIP adapter as well

@hdelan hdelan requested a review from a team as a code owner February 16, 2024 15:25
@hdelan hdelan force-pushed the refactor-guess-local-worksize branch from 786e114 to 254fdeb Compare February 16, 2024 15:27
@hdelan
Copy link
Contributor Author

hdelan commented Feb 16, 2024

Should the HIP adapter have the same calculation as the CUDA one?

Same logic now in HIP adapter. Friendly ping @ldrumm

@hdelan hdelan force-pushed the refactor-guess-local-worksize branch from 254fdeb to 96c44da Compare February 16, 2024 15:35
Copy link
Contributor

@ldrumm ldrumm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes runtime behaviour for HIP, and introduces more abstract duplicate code. Thus, I don't really think it's a refactor.

I think the hipOccupancyMaxPotentialBlockSize change should go in as a separate fix, and that you should revisit those lambdas

source/adapters/cuda/enqueue.cpp Outdated Show resolved Hide resolved
source/adapters/hip/enqueue.cpp Outdated Show resolved Hide resolved
source/adapters/hip/enqueue.cpp Outdated Show resolved Hide resolved
source/adapters/hip/enqueue.cpp Outdated Show resolved Hide resolved
@hdelan hdelan force-pushed the refactor-guess-local-worksize branch 3 times, most recently from b430b82 to 96fb3cd Compare February 20, 2024 10:28
@hdelan hdelan changed the title [CUDA][HIP] Refactor guess local worksize funcs [CUDA][HIP] Fix bug in guess local worksize funcs and improve local worksize guessing in HIP adapter Feb 20, 2024
@hdelan
Copy link
Contributor Author

hdelan commented Feb 20, 2024

This changes runtime behaviour for HIP, and introduces more abstract duplicate code. Thus, I don't really think it's a refactor.

I have changed the PR description to show all changes that have been included in this PR.

I think the hipOccupancyMaxPotentialBlockSize change should go in as a separate fix

I would argue that it's OK to keep HIP changes in this PR as well since in UR the commits are not squashed when PRs are merged, and given that I have constructed my commits such that there are two clear independent commits: one fixing a bug for the CUDA adapter and adding some common code, and another extending the functionality of the HIP adapter.

In an ideal world maybe it would be good to split these into two PRs but with the slow pace of merging in UR I would argue to keep these changes together, especially since this PR is blocking other PRs so need to be prioritized. Let me know your thoughts.

You should revisit those lambdas

Change made thanks

@hdelan hdelan force-pushed the refactor-guess-local-worksize branch 4 times, most recently from 86a14fe to ce33cfb Compare February 23, 2024 09:32
@hdelan hdelan force-pushed the refactor-guess-local-worksize branch 2 times, most recently from 1d5263f to c2b46e0 Compare March 7, 2024 15:05
void SetUp() override {
program_name = "fill_2d";
UUR_RETURN_ON_FATAL_FAILURE(urKernelExecutionTest::SetUp());
#define ENQUEUE_KERNEL_LAUNCH_TEST_1D_SIZES(SIZE) \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kbenzie let me know what you think about these added tests. Not ideal to introduce macro funcs but I think this is the most readable way to run these tests for different sizes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GoogleTest supports parameterized tests, it turns out similarly readable and doesn't require macros like this. I'd prefer to use GoogleTest features where possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha I will do this instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's one example of use using it for urEnqueueUSMFill, there are others dotted around too.

@kbenzie kbenzie added the v0.9.x Include in the v0.9.x release label Mar 11, 2024
@kbenzie
Copy link
Contributor

kbenzie commented Mar 11, 2024

How's the intel/llvm testing looking for this?

@hdelan hdelan force-pushed the refactor-guess-local-worksize branch 2 times, most recently from 4d7e829 to 2fbff0c Compare March 13, 2024 16:27
@hdelan
Copy link
Contributor Author

hdelan commented Mar 13, 2024

How's the intel/llvm testing looking for this?

Getting some failures on AWS that I can't reproduce locally. Slow going! Will let you know when it's sorted

@kbenzie kbenzie removed the ready to merge Added to PR's which are ready to merge label Mar 13, 2024
@hdelan hdelan force-pushed the refactor-guess-local-worksize branch from 5cb8823 to 85f7bfc Compare March 14, 2024 16:11
@hdelan hdelan force-pushed the refactor-guess-local-worksize branch 2 times, most recently from 1bbe83b to 12b3bf0 Compare March 15, 2024 09:18
A bug in the CUDA adapter was sometimes generating Y and Z ranges that did not divide the
global Y or Z dimension. This fixes that.

Also moves some helper functions into ur/ur.hpp that may be reused by other adapters
The HIP adapter was only finding a good sg size in the X dim. This
    changes it so that it now chooses a sg size that divides the global dim
    in X, Y and Z dimensions. It also chooses a power of 2 sg size in the X
    dim, which is the same that the CUDA adapter does. This may give some
    performance improvements.
Dispatch kernels on lots of different configurations
@hdelan hdelan force-pushed the refactor-guess-local-worksize branch from 12b3bf0 to 69c43b4 Compare March 18, 2024 10:54
@hdelan
Copy link
Contributor Author

hdelan commented Mar 18, 2024

@kbenzie this is good to go

@kbenzie
Copy link
Contributor

kbenzie commented Mar 18, 2024

@kbenzie this is good to go

okay cool, thanks

@hdelan hdelan added the ready to merge Added to PR's which are ready to merge label Mar 18, 2024
@kbenzie kbenzie merged commit ed1f8bf into oneapi-src:main Mar 19, 2024
50 checks passed
@yingcong-wu
Copy link
Contributor

yingcong-wu commented Mar 20, 2024

Hi there, I think there is a problem here, which is the function declaration of guessLocalWorkSize in source/adapters/cuda/enqueue.hpp is not updated accordingly.

See in https://github.com/oneapi-src/unified-runtime/blob/ed1f8bf618c88eaabea6bde0f6c06fc265f3b49f/source/adapters/cuda/enqueue.hpp#L23C52-L23C70

steffenlarsen pushed a commit to intel/llvm that referenced this pull request Mar 20, 2024
oneapi-src/unified-runtime#1326

---------

Co-authored-by: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
@hdelan
Copy link
Contributor Author

hdelan commented Mar 20, 2024

Hi there, I think there is a problem here, which is the function declaration of guessLocalWorkSize in source/adapters/cuda/enqueue.hpp is not updated accordingly.

See in https://github.com/oneapi-src/unified-runtime/blob/ed1f8bf618c88eaabea6bde0f6c06fc265f3b49f/source/adapters/cuda/enqueue.hpp#L23C52-L23C70

Thanks @yingcong-wu see here #1460

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready to merge Added to PR's which are ready to merge v0.9.x Include in the v0.9.x release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants