-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UR] Add default implementation for cooperative kernel functions #1246
[UR] Add default implementation for cooperative kernel functions #1246
Conversation
Cooperative kernels can synchronize using device-scope barriers. These kernels use `urKernelSuggestMaxCooperativeGroupCountExp` to ensure that all work groups can run concurrently. When the maximum number of work groups is 1, these kernels behave the same as regular kernels. This PR adds a default implementation for `urKernelSuggestMaxCooperativeGroupCountExp` that returns 1. Also, it adds a default implementation for `urEnqueueCooperativeKernelLaunchExp` that calls `urEnqueueKernelLaunch`. Signed-off-by: Michael Aziz <michael.aziz@intel.com>
I've created intel/llvm#12367 as a draft to test these API functions. They're not presently used in SYCL but my draft PR updates the SYCL runtime to use them for cooperative kernels. All tests, including the root group test, are presently passing. |
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1246 +/- ##
==========================================
- Coverage 14.82% 12.72% -2.10%
==========================================
Files 250 238 -12
Lines 36220 35346 -874
Branches 4094 4010 -84
==========================================
- Hits 5369 4498 -871
- Misses 30800 30844 +44
+ Partials 51 4 -47 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenCL changes LGTM
I think the main design considerations for this interface are the following.
Would be good to have a summary of your knowledge of the above to help with review. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for level-zero
I think CUDA and Level Zero support device-wide synchronization. I don't know about HIP and OpenCL would require an extension.
I'm not sure I understand this question. CUDA has the following two functions: L0 has two similar functions: In addition to the kernel function, the CUDA occupancy query also requires a block size and dynamic memory usage. I'm not sure how L0 implements a similar behavior without these parameters. The two API pairs seem otherwise equivalent. I believe these functions can be used to provide implementations for the CUDA and L0 UR adapters. |
OK thanks, I'll look into the HIP case. |
I found the following two HIP functions that seem relevant: I'm not familiar with HIP but I would expect them to have the same semantics as the CUDA functions. |
It looks like the UR function,
matches the interface of
I don't understand the semantics of the above ze function based on the limited documentation I read, and how it relates to the ze architecture design (and whether it can be considered a valid subset of the cuda/hip function semantics), but the semantics of the cuda From what I understand of nvidia/amd architecture, any suggested max group count wouldn't make sense, without the input of the above two mentioned parameters. I guess that it must do for ze, because that is how they designed their API. But for the general UR API, shouldn't it also have these two missing parameters, in order to properly support HIP/Nvidia devices? |
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
I think this makes sense. I've updated the extension definition and the default implementations to include these two additional parameters. I'll update my draft SYCL runtime PR to use the new UR API. |
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
(void)hKernel; | ||
(void)localWorkSize; | ||
(void)dynamicSharedMemorySize; | ||
*pGroupCountRet = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Personally I'd just mark these guys unimplemented for now, until they have complete implementations, since it could be confusing to return 1 for all input cases, which is giving bad information.
But probably not a big deal.
@oneapi-src/unified-runtime-maintain, could you please provide a review? |
since these are new entry point implementations you'll need to add them to the interface loader file for each adapter, like here for l0
|
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
Thanks for the feedback, @aarongreig! I've made the change you requested for all four adapters. Do you know why this might have caused the HIP CI check to fail? I don't believe I've modified the failing test cases or any related code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fail was a CI issue we were having, it should go away now
@kbenzie, can this PR be merged? I think it's ready. |
We've got a bit of a backlog of ready to merge we'll aim to get this merge ASAP. |
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
Cooperative kernels can synchronize using device-scope barriers. These kernels use
urKernelSuggestMaxCooperativeGroupCountExp
to ensure that all work groups can run concurrently. When the maximum number of work groups is 1, these kernels behave the same as regular kernels.This PR adds a default implementation for
urKernelSuggestMaxCooperativeGroupCountExp
that returns 1. Also, it adds a default implementation forurEnqueueCooperativeKernelLaunchExp
that callsurEnqueueKernelLaunch
.