Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA] Improve kernel launch error handling for out-of-registers #12604

Conversation

GeorgeWeb
Copy link
Contributor

@GeorgeWeb GeorgeWeb commented Feb 5, 2024

This PR improves the handling of errors by specializing PI_ERROR_OUT_OF_RESOURCES.

Previously, in the CUDA backend we handled the out of resources launch error (for exceeded registers) as invalid work group size error. Now pairing the new specialized handling with the UR adapter change oneapi-src/unified-runtime#1318 to return the correct error code, we no longer output a misleading error message to users.
Also, added a fallback message for the generic out of resources error codes returned from APIs (e.g. for kernel launch).

Fixes issue: oneapi-src/unified-runtime#1308

@GeorgeWeb GeorgeWeb force-pushed the georgi/sycl-cuda-out-of-resources-registers-error branch from 11f09c1 to 7606868 Compare February 5, 2024 15:33
@GeorgeWeb GeorgeWeb force-pushed the georgi/sycl-cuda-out-of-resources-registers-error branch from 4b28554 to 7e258e9 Compare February 6, 2024 13:27
@GeorgeWeb
Copy link
Contributor Author

Heyo @intel/llvm-reviewers-runtime. May I get a look at this one? Thank you!

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@kbenzie kbenzie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pull in the latest sycl branch changes then update the UNIFIED_RUNTIME_REPO and UNIFIED_RUNTIME_TAG variables as shown in this diff:

diff --git a/sycl/plugins/unified_runtime/CMakeLists.txt b/sycl/plugins/unified_runtime/CMakeLists.txt
index a081bdb010d8..536358cb36af 100644
--- a/sycl/plugins/unified_runtime/CMakeLists.txt
+++ b/sycl/plugins/unified_runtime/CMakeLists.txt
@@ -100,13 +100,13 @@ if(SYCL_PI_UR_USE_FETCH_CONTENT)
   endfunction()

   set(UNIFIED_RUNTIME_REPO "https://github.com/oneapi-src/unified-runtime.git")
-  # commit 9f783837089c970a22cda08f768aa3dbed38f0d3
-  # Merge: c015f892 b9442104
+  # commit 5083f4f96557672b7b6a55ea53347896d40549d7
+  # Merge: a97eed15 4c3f9abe
   # Author: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
-  # Date:   Fri May 31 10:20:23 2024 +0100
-  #     Merge pull request #1533 from AllanZyne/sanitizer-buffer
-  #     [DeviceSanitizer] Support detecting out-of-bounds errors on sycl::buffer
-  set(UNIFIED_RUNTIME_TAG 9f783837089c970a22cda08f768aa3dbed38f0d3)
+  # Date:   Fri May 31 17:20:01 2024 +0100
+  #     Merge pull request #1397 from GeorgeWeb/georgi/check-allocation-error-on-event-from-native-handle
+  #     [CUDA][HIP] Catch and report bad_alloc errors for event object creation
+  set(UNIFIED_RUNTIME_TAG 5083f4f96557672b7b6a55ea53347896d40549d7)

   fetch_adapter_source(level_zero
     ${UNIFIED_RUNTIME_REPO}

sycl/plugins/unified_runtime/CMakeLists.txt Outdated Show resolved Hide resolved
@GeorgeWeb GeorgeWeb force-pushed the georgi/sycl-cuda-out-of-resources-registers-error branch from deb7c3e to a868e0f Compare May 31, 2024 17:02
@GeorgeWeb GeorgeWeb marked this pull request as ready for review May 31, 2024 17:03
@kbenzie
Copy link
Contributor

kbenzie commented Jun 3, 2024

@intel/llvm-gatekeepers please merge

@sommerlukas sommerlukas merged commit 9f1cee5 into intel:sycl Jun 3, 2024
14 checks passed
using std::string_view_literals::operator""sv;
auto Msg = "Exceeded the number of registers available on the hardware."sv;
if (std::string(e.what()).find(Msg) != std::string::npos) {
auto Errc = sycl::make_error_code(sycl::errc::nd_range);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GeorgeWeb, I was looking at the PR and got confused by it. Could you please clarify what is the key change here? PR description says that we used to display confusing error to users, but check for the exception message wasn't changed by the PR.

What was the confusing part then? I see that in #12363 we had a bug report which contains invalid message, but this test had been introduced a year ago before that issue was submitted in #9106. What am I missing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In summary the wrong part of the message as per the PRs description was just the plugin error code description - PI_ERROR_INVALID_WORK_GROUP_SIZE should have been PI_ERROR_OUT_OF_LAUNCH_RESOURCES. The reason I've added the errc::nd_range error code check was just for more verbosity but it wasn't of importance here.


The real issue about "reporting a completely wrong message" in the OPs report (#12363) was due to a mistake on this line https://github.com/intel/llvm/pull/9106/files#diff-7525901710934f7bdb2ad36238c4b67163f112d3bd233db7af0b0078b5b01e80R3263 which was fixed by this UR cuda change oneapi-src/unified-runtime#1299

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants