Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UR] Add default implementation for cooperative kernel functions #1246

Merged
merged 10 commits into from
Feb 19, 2024

Conversation

0x12CC
Copy link
Contributor

@0x12CC 0x12CC commented Jan 12, 2024

Cooperative kernels can synchronize using device-scope barriers. These kernels use urKernelSuggestMaxCooperativeGroupCountExp to ensure that all work groups can run concurrently. When the maximum number of work groups is 1, these kernels behave the same as regular kernels.

This PR adds a default implementation for urKernelSuggestMaxCooperativeGroupCountExp that returns 1. Also, it adds a default implementation for urEnqueueCooperativeKernelLaunchExp that calls urEnqueueKernelLaunch.

Cooperative kernels can synchronize using device-scope barriers. These kernels
use `urKernelSuggestMaxCooperativeGroupCountExp` to ensure that all work groups
can run concurrently. When the maximum number of work groups is 1, these kernels
behave the same as regular kernels.

This PR adds a default implementation for
`urKernelSuggestMaxCooperativeGroupCountExp` that returns 1. Also, it adds a
default implementation for `urEnqueueCooperativeKernelLaunchExp` that calls
`urEnqueueKernelLaunch`.

Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@0x12CC 0x12CC requested review from a team as code owners January 12, 2024 01:40
@0x12CC 0x12CC requested a review from JackAKirk January 12, 2024 01:40
@0x12CC
Copy link
Contributor Author

0x12CC commented Jan 12, 2024

I've created intel/llvm#12367 as a draft to test these API functions. They're not presently used in SYCL but my draft PR updates the SYCL runtime to use them for cooperative kernels. All tests, including the root group test, are presently passing.

@codecov-commenter
Copy link

codecov-commenter commented Jan 12, 2024

Codecov Report

Attention: 48 lines in your changes are missing coverage. Please review.

Comparison is base (78ef1ca) 14.82% compared to head (8a8d704) 12.72%.
Report is 11 commits behind head on main.

Files Patch % Lines
...e/common/umf_pools/disjoint_pool_config_parser.cpp 0.00% 15 Missing ⚠️
source/common/umf_helpers.hpp 0.00% 9 Missing ⚠️
include/ur_print.hpp 70.00% 6 Missing ⚠️
source/common/logger/ur_logger_details.hpp 61.53% 5 Missing ⚠️
test/usm/usmPoolManager.cpp 0.00% 5 Missing ⚠️
source/loader/layers/tracing/ur_trcddi.cpp 0.00% 3 Missing ⚠️
source/loader/ur_libapi.cpp 0.00% 2 Missing ⚠️
source/adapters/null/ur_nullddi.cpp 0.00% 1 Missing ⚠️
source/loader/layers/validation/ur_valddi.cpp 0.00% 1 Missing ⚠️
source/loader/ur_ldrddi.cpp 0.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1246      +/-   ##
==========================================
- Coverage   14.82%   12.72%   -2.10%     
==========================================
  Files         250      238      -12     
  Lines       36220    35346     -874     
  Branches     4094     4010      -84     
==========================================
- Hits         5369     4498     -871     
- Misses      30800    30844      +44     
+ Partials       51        4      -47     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@kbenzie kbenzie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenCL changes LGTM

@JackAKirk
Copy link
Contributor

JackAKirk commented Jan 16, 2024

I think the main design considerations for this interface are the following.

Would be good to have a summary of your knowledge of the above to help with review. Thanks

Copy link
Contributor

@nrspruit nrspruit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for level-zero

@0x12CC
Copy link
Contributor Author

0x12CC commented Jan 16, 2024

  • Which backends currently support the feature?

I think CUDA and Level Zero support device-wide synchronization. I don't know about HIP and OpenCL would require an extension.

  • Do the backends that support the feature have interfaces supported by this unified interface?

I'm not sure I understand this question. CUDA has the following two functions:

L0 has two similar functions:

In addition to the kernel function, the CUDA occupancy query also requires a block size and dynamic memory usage. I'm not sure how L0 implements a similar behavior without these parameters. The two API pairs seem otherwise equivalent. I believe these functions can be used to provide implementations for the CUDA and L0 UR adapters.

@JackAKirk
Copy link
Contributor

  • Which backends currently support the feature?

I think CUDA and Level Zero support device-wide synchronization. I don't know about HIP and OpenCL would require an extension.

  • Do the backends that support the feature have interfaces supported by this unified interface?

I'm not sure I understand this question. CUDA has the following two functions:

* [`cudaLaunchCooperativeKernel`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g504b94170f83285c71031be6d5d15f73)

* [`cudaOccupancyMaxActiveBlocksPerMultiprocessor`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__HIGHLEVEL.html#group__CUDART__HIGHLEVEL_1g5a5d67a3c907371559ba692195e8a38c)

L0 has two similar functions:

* [`zeCommandListAppendLaunchCooperativeKernel`](https://spec.oneapi.io/level-zero/latest/core/api.html?highlight=coop#zecommandlistappendlaunchcooperativekernel)

* [`zeKernelSuggestMaxCooperativeGroupCount`](https://spec.oneapi.io/level-zero/latest/core/api.html?highlight=coop#ze__api_8h_1af0b050a6cc08132ef84a8618942ce125)

In addition to the kernel function, the CUDA occupancy query also requires a block size and dynamic memory usage. I'm not sure how L0 implements a similar behavior without these parameters. The two API pairs seem otherwise equivalent. I believe these functions can be used to provide implementations for the CUDA and L0 UR adapters.

OK thanks, I'll look into the HIP case.

@0x12CC
Copy link
Contributor Author

0x12CC commented Jan 18, 2024

OK thanks, I'll look into the HIP case.

I found the following two HIP functions that seem relevant:

I'm not familiar with HIP but I would expect them to have the same semantics as the CUDA functions.

@JackAKirk
Copy link
Contributor

JackAKirk commented Jan 18, 2024

It looks like the UR function,

urKernelSuggestMaxCooperativeGroupCountExp(
    ur_kernel_handle_t hKernel, ///< [in] handle of the kernel object
    uint32_t *pGroupCountRet    ///< [out] pointer to maximum number of groups
)

matches the interface of

zeKernelSuggestMaxCooperativeGroupCount

I don't understand the semantics of the above ze function based on the limited documentation I read, and how it relates to the ze architecture design (and whether it can be considered a valid subset of the cuda/hip function semantics), but the semantics of the cuda cudaOccupancyMaxActiveBlocksPerMultiprocessor and hip equivalent look clear to me. They look useless unless you provide the input parameters that have been missed from the urKernelSuggestMaxCooperativeGroupCountExp, namely the block dimension and dynamic shared memory size: Importantly these two numbers determine occupancy because if you run out of warps to run a complete block, or run out of shared memory, a corresponding part of the Compute Unit is left idle. (A third number, registers per block is also relevant for "non register spilled occupancy", but since registers can be spilled it doesn't determine true occupancy: i.e. how many of a CUs warps/subgroups can run concurrently).

From what I understand of nvidia/amd architecture, any suggested max group count wouldn't make sense, without the input of the above two mentioned parameters. I guess that it must do for ze, because that is how they designed their API. But for the general UR API, shouldn't it also have these two missing parameters, in order to properly support HIP/Nvidia devices?

Signed-off-by: Michael Aziz <michael.aziz@intel.com>
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@0x12CC 0x12CC requested a review from a team as a code owner January 18, 2024 20:44
@0x12CC
Copy link
Contributor Author

0x12CC commented Jan 18, 2024

But for the general UR API, shouldn't it also have these two missing parameters, in order to properly support HIP/Nvidia devices?

I think this makes sense. I've updated the extension definition and the default implementations to include these two additional parameters. I'll update my draft SYCL runtime PR to use the new UR API.

Signed-off-by: Michael Aziz <michael.aziz@intel.com>
(void)hKernel;
(void)localWorkSize;
(void)dynamicSharedMemorySize;
*pGroupCountRet = 1;
Copy link
Contributor

@JackAKirk JackAKirk Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Personally I'd just mark these guys unimplemented for now, until they have complete implementations, since it could be confusing to return 1 for all input cases, which is giving bad information.

But probably not a big deal.

@0x12CC
Copy link
Contributor Author

0x12CC commented Jan 23, 2024

@oneapi-src/unified-runtime-maintain, could you please provide a review?

@aarongreig
Copy link
Contributor

since these are new entry point implementations you'll need to add them to the interface loader file for each adapter, like here for l0

pDdiTable->pfnCooperativeKernelLaunchExp = nullptr;

Signed-off-by: Michael Aziz <michael.aziz@intel.com>
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@0x12CC
Copy link
Contributor Author

0x12CC commented Jan 25, 2024

Thanks for the feedback, @aarongreig! I've made the change you requested for all four adapters. Do you know why this might have caused the HIP CI check to fail? I don't believe I've modified the failing test cases or any related code.

Copy link
Contributor

@aarongreig aarongreig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fail was a CI issue we were having, it should go away now

@0x12CC
Copy link
Contributor Author

0x12CC commented Jan 25, 2024

@kbenzie, can this PR be merged? I think it's ready.

@kbenzie kbenzie added the ready to merge Added to PR's which are ready to merge label Jan 25, 2024
@kbenzie
Copy link
Contributor

kbenzie commented Jan 25, 2024

@kbenzie, can this PR be merged? I think it's ready.

We've got a bit of a backlog of ready to merge we'll aim to get this merge ASAP.

0x12CC and others added 3 commits February 12, 2024 11:59
@kbenzie kbenzie merged commit 3fd11f1 into oneapi-src:main Feb 19, 2024
52 checks passed
@0x12CC 0x12CC deleted the cooperative_kernel_functions branch February 20, 2024 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready to merge Added to PR's which are ready to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants