[UR] Add default implementation for cooperative kernel functions #1246

0x12CC · 2024-01-12T01:40:33Z

Cooperative kernels can synchronize using device-scope barriers. These kernels use urKernelSuggestMaxCooperativeGroupCountExp to ensure that all work groups can run concurrently. When the maximum number of work groups is 1, these kernels behave the same as regular kernels.

This PR adds a default implementation for urKernelSuggestMaxCooperativeGroupCountExp that returns 1. Also, it adds a default implementation for urEnqueueCooperativeKernelLaunchExp that calls urEnqueueKernelLaunch.

Cooperative kernels can synchronize using device-scope barriers. These kernels use `urKernelSuggestMaxCooperativeGroupCountExp` to ensure that all work groups can run concurrently. When the maximum number of work groups is 1, these kernels behave the same as regular kernels. This PR adds a default implementation for `urKernelSuggestMaxCooperativeGroupCountExp` that returns 1. Also, it adds a default implementation for `urEnqueueCooperativeKernelLaunchExp` that calls `urEnqueueKernelLaunch`. Signed-off-by: Michael Aziz <michael.aziz@intel.com>

0x12CC · 2024-01-12T01:42:54Z

I've created intel/llvm#12367 as a draft to test these API functions. They're not presently used in SYCL but my draft PR updates the SYCL runtime to use them for cooperative kernels. All tests, including the root group test, are presently passing.

codecov-commenter · 2024-01-12T01:53:58Z

Codecov Report

Attention: 48 lines in your changes are missing coverage. Please review.

Comparison is base (78ef1ca) 14.82% compared to head (8a8d704) 12.72%.
Report is 11 commits behind head on main.

Files	Patch %	Lines
...e/common/umf_pools/disjoint_pool_config_parser.cpp	0.00%	15 Missing ⚠️
source/common/umf_helpers.hpp	0.00%	9 Missing ⚠️
include/ur_print.hpp	70.00%	6 Missing ⚠️
source/common/logger/ur_logger_details.hpp	61.53%	5 Missing ⚠️
test/usm/usmPoolManager.cpp	0.00%	5 Missing ⚠️
source/loader/layers/tracing/ur_trcddi.cpp	0.00%	3 Missing ⚠️
source/loader/ur_libapi.cpp	0.00%	2 Missing ⚠️
source/adapters/null/ur_nullddi.cpp	0.00%	1 Missing ⚠️
source/loader/layers/validation/ur_valddi.cpp	0.00%	1 Missing ⚠️
source/loader/ur_ldrddi.cpp	0.00%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1246      +/-   ##
==========================================
- Coverage   14.82%   12.72%   -2.10%     
==========================================
  Files         250      238      -12     
  Lines       36220    35346     -874     
  Branches     4094     4010      -84     
==========================================
- Hits         5369     4498     -871     
- Misses      30800    30844      +44     
+ Partials       51        4      -47

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kbenzie

OpenCL changes LGTM

JackAKirk · 2024-01-16T13:06:32Z

I think the main design considerations for this interface are the following.

Which backends currently support the feature?
I know cuda does from cuda 11 for at least ampere. amd does too: https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7e99d56223ca8e08a3b0281a8dc07a46 although from what I can tell amd only supports grid synchronization from mi300x https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf : see 2.2.2. Global Wave Sync (GWS)
Do the backends that support the feature have interfaces supported by this unified interface?
I think that hip interface matches cuda and I guess this doesn't change for mi300x: which them simplifies the problem to knowing whether cuda case is supported by this unified interface, so long as only those two backends support this feature. But I'm not 100% on these points. Do you know?

Would be good to have a summary of your knowledge of the above to help with review. Thanks

nrspruit

LGTM for level-zero

0x12CC · 2024-01-16T20:15:27Z

Which backends currently support the feature?

I think CUDA and Level Zero support device-wide synchronization. I don't know about HIP and OpenCL would require an extension.

Do the backends that support the feature have interfaces supported by this unified interface?

I'm not sure I understand this question. CUDA has the following two functions:

L0 has two similar functions:

In addition to the kernel function, the CUDA occupancy query also requires a block size and dynamic memory usage. I'm not sure how L0 implements a similar behavior without these parameters. The two API pairs seem otherwise equivalent. I believe these functions can be used to provide implementations for the CUDA and L0 UR adapters.

JackAKirk · 2024-01-17T10:36:09Z

Which backends currently support the feature?

I think CUDA and Level Zero support device-wide synchronization. I don't know about HIP and OpenCL would require an extension.

Do the backends that support the feature have interfaces supported by this unified interface?

I'm not sure I understand this question. CUDA has the following two functions:
* [`cudaLaunchCooperativeKernel`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g504b94170f83285c71031be6d5d15f73)

* [`cudaOccupancyMaxActiveBlocksPerMultiprocessor`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__HIGHLEVEL.html#group__CUDART__HIGHLEVEL_1g5a5d67a3c907371559ba692195e8a38c)
L0 has two similar functions:
* [`zeCommandListAppendLaunchCooperativeKernel`](https://spec.oneapi.io/level-zero/latest/core/api.html?highlight=coop#zecommandlistappendlaunchcooperativekernel)

* [`zeKernelSuggestMaxCooperativeGroupCount`](https://spec.oneapi.io/level-zero/latest/core/api.html?highlight=coop#ze__api_8h_1af0b050a6cc08132ef84a8618942ce125)
In addition to the kernel function, the CUDA occupancy query also requires a block size and dynamic memory usage. I'm not sure how L0 implements a similar behavior without these parameters. The two API pairs seem otherwise equivalent. I believe these functions can be used to provide implementations for the CUDA and L0 UR adapters.

OK thanks, I'll look into the HIP case.

0x12CC · 2024-01-18T04:00:13Z

OK thanks, I'll look into the HIP case.

I found the following two HIP functions that seem relevant:

I'm not familiar with HIP but I would expect them to have the same semantics as the CUDA functions.

JackAKirk · 2024-01-18T10:45:06Z

It looks like the UR function,

urKernelSuggestMaxCooperativeGroupCountExp(
    ur_kernel_handle_t hKernel, ///< [in] handle of the kernel object
    uint32_t *pGroupCountRet    ///< [out] pointer to maximum number of groups
)

matches the interface of

zeKernelSuggestMaxCooperativeGroupCount

I don't understand the semantics of the above ze function based on the limited documentation I read, and how it relates to the ze architecture design (and whether it can be considered a valid subset of the cuda/hip function semantics), but the semantics of the cuda cudaOccupancyMaxActiveBlocksPerMultiprocessor and hip equivalent look clear to me. They look useless unless you provide the input parameters that have been missed from the urKernelSuggestMaxCooperativeGroupCountExp, namely the block dimension and dynamic shared memory size: Importantly these two numbers determine occupancy because if you run out of warps to run a complete block, or run out of shared memory, a corresponding part of the Compute Unit is left idle. (A third number, registers per block is also relevant for "non register spilled occupancy", but since registers can be spilled it doesn't determine true occupancy: i.e. how many of a CUs warps/subgroups can run concurrently).

From what I understand of nvidia/amd architecture, any suggested max group count wouldn't make sense, without the input of the above two mentioned parameters. I guess that it must do for ze, because that is how they designed their API. But for the general UR API, shouldn't it also have these two missing parameters, in order to properly support HIP/Nvidia devices?

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

0x12CC · 2024-01-18T20:46:26Z

But for the general UR API, shouldn't it also have these two missing parameters, in order to properly support HIP/Nvidia devices?

I think this makes sense. I've updated the extension definition and the default implementations to include these two additional parameters. I'll update my draft SYCL runtime PR to use the new UR API.

scripts/core/exp-cooperative-kernels.yml

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

JackAKirk · 2024-01-19T18:26:13Z

source/adapters/cuda/kernel.cpp

+  (void)hKernel;
+  (void)localWorkSize;
+  (void)dynamicSharedMemorySize;
+  *pGroupCountRet = 1;


nit: Personally I'd just mark these guys unimplemented for now, until they have complete implementations, since it could be confusing to return 1 for all input cases, which is giving bad information.

But probably not a big deal.

0x12CC · 2024-01-23T22:03:28Z

@oneapi-src/unified-runtime-maintain, could you please provide a review?

aarongreig · 2024-01-24T10:25:30Z

since these are new entry point implementations you'll need to add them to the interface loader file for each adapter, like here for l0

unified-runtime/source/adapters/level_zero/ur_interface_loader.cpp

Line 449 in 1732c6b

pDdiTable->pfnCooperativeKernelLaunchExp = nullptr;

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

0x12CC · 2024-01-25T02:56:55Z

Thanks for the feedback, @aarongreig! I've made the change you requested for all four adapters. Do you know why this might have caused the HIP CI check to fail? I don't believe I've modified the failing test cases or any related code.

aarongreig

I think the fail was a CI issue we were having, it should go away now

0x12CC · 2024-01-25T15:03:42Z

@kbenzie, can this PR be merged? I think it's ready.

kbenzie · 2024-01-25T17:35:04Z

@kbenzie, can this PR be merged? I think it's ready.

We've got a bit of a backlog of ready to merge we'll aim to get this merge ASAP.

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

0x12CC requested review from a team as code owners January 12, 2024 01:40

0x12CC requested a review from JackAKirk January 12, 2024 01:40

kbenzie approved these changes Jan 16, 2024

View reviewed changes

nrspruit reviewed Jan 16, 2024

View reviewed changes

0x12CC added 2 commits January 18, 2024 12:42

Add parameters to cooperative kernel query

9eeda47

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

Merge branch 'main' into cooperative_kernel_functions

dd7c94c

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

0x12CC requested a review from a team as a code owner January 18, 2024 20:44

JackAKirk reviewed Jan 19, 2024

View reviewed changes

scripts/core/exp-cooperative-kernels.yml Show resolved Hide resolved

Merge branch 'main' into cooperative_kernel_functions

1732c6b

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

JackAKirk reviewed Jan 19, 2024

View reviewed changes

JackAKirk approved these changes Jan 19, 2024

View reviewed changes

0x12CC added 2 commits January 24, 2024 10:43

Update interface loaders

8e7b98d

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

Merge branch 'main' into cooperative_kernel_functions

dbbc8e0

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

Merge branch 'main' into cooperative_kernel_functions

699fb43

aarongreig approved these changes Jan 25, 2024

View reviewed changes

kbenzie added the ready to merge Added to PR's which are ready to merge label Jan 25, 2024

0x12CC and others added 3 commits February 12, 2024 11:59

Merge branch 'main' into cooperative_kernel_functions

607ea7b

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

Merge branch 'main' into cooperative_kernel_functions

6d5ba2e

Merge branch 'main' into cooperative_kernel_functions

8a8d704

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

kbenzie merged commit 3fd11f1 into oneapi-src:main Feb 19, 2024
52 checks passed

0x12CC deleted the cooperative_kernel_functions branch February 20, 2024 19:31

JackAKirk mentioned this pull request May 30, 2024

Clarify semantics of urKernelSuggestMaxCooperativeGroupCountExp #1687

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UR] Add default implementation for cooperative kernel functions #1246

[UR] Add default implementation for cooperative kernel functions #1246

0x12CC commented Jan 12, 2024

0x12CC commented Jan 12, 2024

codecov-commenter commented Jan 12, 2024 •

edited

Loading

kbenzie left a comment

JackAKirk commented Jan 16, 2024 •

edited

Loading

nrspruit left a comment

0x12CC commented Jan 16, 2024

JackAKirk commented Jan 17, 2024

0x12CC commented Jan 18, 2024

JackAKirk commented Jan 18, 2024 •

edited

Loading

0x12CC commented Jan 18, 2024

JackAKirk Jan 19, 2024 •

edited

Loading

0x12CC commented Jan 23, 2024

aarongreig commented Jan 24, 2024

0x12CC commented Jan 25, 2024

aarongreig left a comment

0x12CC commented Jan 25, 2024

kbenzie commented Jan 25, 2024

[UR] Add default implementation for cooperative kernel functions #1246

[UR] Add default implementation for cooperative kernel functions #1246

Conversation

0x12CC commented Jan 12, 2024

0x12CC commented Jan 12, 2024

codecov-commenter commented Jan 12, 2024 • edited Loading

Codecov Report

kbenzie left a comment

Choose a reason for hiding this comment

JackAKirk commented Jan 16, 2024 • edited Loading

nrspruit left a comment

Choose a reason for hiding this comment

0x12CC commented Jan 16, 2024

JackAKirk commented Jan 17, 2024

0x12CC commented Jan 18, 2024

JackAKirk commented Jan 18, 2024 • edited Loading

0x12CC commented Jan 18, 2024

JackAKirk Jan 19, 2024 • edited Loading

Choose a reason for hiding this comment

0x12CC commented Jan 23, 2024

aarongreig commented Jan 24, 2024

0x12CC commented Jan 25, 2024

aarongreig left a comment

Choose a reason for hiding this comment

0x12CC commented Jan 25, 2024

kbenzie commented Jan 25, 2024

codecov-commenter commented Jan 12, 2024 •

edited

Loading

JackAKirk commented Jan 16, 2024 •

edited

Loading

JackAKirk commented Jan 18, 2024 •

edited

Loading

JackAKirk Jan 19, 2024 •

edited

Loading