Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA][HIP][L0][NATIVECPU][OpenCL] Remove usage of die/terminate #1127

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

martygrant
Copy link
Contributor

@martygrant martygrant commented Nov 27, 2023

For #1078.

This PR attempts to replace all die/terminate calls with simply returning an appropriate error code. There may be a better code, or some kind of assert/abort may be more appropriate in some cases so any suggestions are appreciated.

intel-llvm run intel/llvm#12076

@martygrant martygrant added the enhancement New feature or request label Nov 27, 2023
@martygrant martygrant requested review from a team as code owners November 27, 2023 15:30
@@ -248,17 +245,19 @@ urEventWait(uint32_t numEvents, const ur_event_handle_t *phEventWaitList) {
UR_APIEXPORT ur_result_t UR_APICALL urEventRetain(ur_event_handle_t hEvent) {
const auto RefCount = hEvent->incrementReferenceCount();

detail::ur::assertion(RefCount != 0,
"Reference count overflow detected in urEventRetain.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removes a specific error message with a generic one that does not relate to the specific error message.
If you want to keep specific errors then use setErrorMessage etc as e.g here: intel/llvm#10626

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few examples of this in this pr.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removes a specific error message with a generic one that does not relate to the specific error message.

Agree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jack for these example I'll restore the original error message but use a standard assert(message) from the cassert header.

return UR_RESULT_ERROR_UNSUPPORTED_FEATURE;
}

UR_APIEXPORT ur_result_t UR_APICALL
urUsmP2PDisablePeerAccessExp([[maybe_unused]] ur_device_handle_t commandDevice,
[[maybe_unused]] ur_device_handle_t peerDevice) {

cl_adapter::die(
"Experimental P2P feature is not implemented for OpenCL adapter.");
return UR_RESULT_ERROR_UNSUPPORTED_FEATURE;
Copy link
Contributor

@JackAKirk JackAKirk Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See here: https://github.com/intel/llvm/blob/88f1d0a2b64e2c61d208d4dcebf9b512f49f4860/sycl/plugins/unified_runtime/pi2ur.hpp#L108

This means that in dpc++ the error will simply be PI_ERROR_INVALID_VALUE which is useless to the user and is probably going to lead to a lot of confusion.

In the long run when it really could return the semantics of UR_RESULT_ERROR_UNSUPPORTED_FEATURE
it is probably worth considering whether this is perhaps also a degradation wrt end-user usefulness wrt the more specific "Experimental P2P feature is not implemented for OpenCL adapter."

There are many examples of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realise there wasn't a DPC++ equivalent error code for UR_RESULT_ERROR_UNSUPPORTED_FEATURE thanks for pointing this out. We would like to make a change to the SYCL runtime to handle these errors similar to a recent change made here intel/llvm@3756725 would this be acceptable?

Copy link
Contributor

@JackAKirk JackAKirk Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realise there wasn't a DPC++ equivalent error code for UR_RESULT_ERROR_UNSUPPORTED_FEATURE thanks for pointing this out. We would like to make a change to the SYCL runtime to handle these errors similar to a recent change made here intel/llvm@3756725 would this be acceptable?

I guess you could do that as things stand. Although from what I understand it would require that all backends that can call this API won't be able to return any other errors that map to PI_ERROR_INVALID_OPERATION, and this will have to be guaranteed to remain true into the future. But these are technical implementation questions, which I don't think are the key issue to consider.

Since also presumably all PI_ERROR_* will be removed at some point when PI is removed, maybe the above solution would only be very temporary, and I can see one argument that in the future it might be decided that UR_RESULT_ERROR_UNSUPPORTED_FEATURE will be returned for all such cases where a feature isn't supported. Then the runtime can pass a more specific message if it receives this error.

The above paragraph is more of a specification question which in my point of view is the right type of question to be asking at this stage, since there doesn't appear (tell me if I am wrong) a clear specification in UR for dealing with errors. For such a specification to be stable into the future I can think of some requirements it would need to satisfy (in a rough order of importance):

  • Support both adapter agnostic and adapter specific errors.
  • Be compatible with the error requirements of existing supported language runtime specifications (assuming that the language runtime specifications have error handling fit for purpose, and aren't themselves looking into changes in future versions (SYCL is one), in such a case the error requirements to take account of will be the future versions not the existing ones).
  • Be future proof for error requirements of adapters that might be supported in the future.

For sure the above requirements would take some amount of thought to satisfy, but I don't see (particularly the first two) as being impossible. And before solving this problem is properly undertaken I see it as likely that we will just largely go in circles, or at best zig-zag to a stable solution.

At the moment the process of e.g. this PR is:

step 1. make a change to move in a particular direction due to a single requirement: e.g. get rid of dies within UR
step 2: run into some constraints like e.g. the one in this particular thread.
step 3: solve the constraints and merge pr
step 4: realize that we didn't consider some other constraints and have to start the process over.

Of course it is just my opinion, and it doesn't block this PR, but I'd recommend going straight to the finish line and first plan an error handling spec that is fit for purpose.
The other issue is that without a clear spec people deal with errors in an inconsistent way. If once someone has ideally fully considered the requirements they add docs to https://oneapi-src.github.io/unified-runtime/core/INTRO.html#error-handling for contributors that would be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JackAKirk I think it's maybe best to continue the discussion for this in a new issue - #1161 as it would potentially make this a very large PR! I think we are going to go ahead with looking at handling the unsupported feature in the sycl runtime for now. Cheers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks.

@JackAKirk
Copy link
Contributor

I think there are quite a few degradations introduced in this PR wrt DPC++ usage. It would be good to consider the effect on the end-user for each category of changes made here.

source/adapters/hip/memory.cpp Outdated Show resolved Hide resolved
source/adapters/hip/common.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@jchlanda jchlanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the effort to get rid of asserting in favour of providing error message. I do think however, that greater care is needed to make sure that the error codes are meaningful. Admittedly, this is non-trivial and daunting task, but if we let it slip now, it's likely that it will never get fixed.

Similarly, like Jack pointed out, it would be nice to preserve messages (maybe with exception of those unimplemented API entry).

@@ -510,7 +557,9 @@ UR_APIEXPORT ur_result_t UR_APICALL urDeviceGetInfo(ur_device_handle_t hDevice,
UR_CHECK_ERROR(cuDeviceGetAttribute(
&ECCEnabled, CU_DEVICE_ATTRIBUTE_ECC_ENABLED, hDevice->get()));

detail::ur::assertion((ECCEnabled == 0) | (ECCEnabled == 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you are at it, should it be rewritten to use logical or operator? I'm not sure if bitwise was intended or an error in the first place? (in a couple of places in this file).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it was likely a typo in the original diff. I've changed it over now, thanks.

default:
detail::ur::die("Invalid image format.");
return 0;
return UR_RESULT_ERROR_INVALID_IMAGE_SIZE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would UR_RESULT_ERROR_INVALID_IMAGE_FORMAT_DESCRIPTOR fit better here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to this, thanks.

@@ -306,8 +306,7 @@ UR_APIEXPORT ur_result_t UR_APICALL urMemImageCreate(
PixelTypeSizeBytes = 4;
break;
default:
detail::ur::die(
"urMemImageCreate given unsupported image_channel_data_type");
return UR_RESULT_ERROR_UNSUPPORTED_ENUMERATION;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the switch is over ur_image_format_t maybe return: UR_RESULT_ERROR_INVALID_IMAGE_FORMAT_DESCRIPTOR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to this, thanks.

switch (Format) {
case HIP_AD_FORMAT_UNSIGNED_INT8:
case HIP_AD_FORMAT_SIGNED_INT8:
*Size = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above switch/break.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this, thanks.

switch (ArrayDesc.Format) {
case CU_AD_FORMAT_UNSIGNED_INT8:
case CU_AD_FORMAT_SIGNED_INT8:
return 1;
*Size = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need break?
https://godbolt.org/z/P766Pvaf1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I just spotted that too. The control will be transferred to the next statement and so on until it reaches a break or end of the switch. Maybe there are either no tests or not enabled tests to catch this now but it will be giving incorrect element byte sizes for when 8-bit and 16-bit formats are used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this, thanks.

@@ -843,23 +843,25 @@ UR_APIEXPORT ur_result_t UR_APICALL urEnqueueMemBufferFill(
}
}

static size_t imageElementByteSize(CUDA_ARRAY_DESCRIPTOR ArrayDesc) {
static ur_result_t imageElementByteSize(CUDA_ARRAY_DESCRIPTOR ArrayDesc,
int *Size) {
switch (ArrayDesc.Format) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need break?
https://godbolt.org/z/P766Pvaf1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this, thanks.

source/adapters/hip/common.cpp Outdated Show resolved Hide resolved
@@ -15,6 +15,16 @@
#include <hip/hip_runtime.h>
#include <ur/ur.hpp>

/**
* Call an UR API and, if the result is not UR_RESULT_SUCCESS, automatically
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Call an UR API and, if the result is not UR_RESULT_SUCCESS, automatically
* Call a UR API and, if the result is not UR_RESULT_SUCCESS, automatically

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this, thanks.

return ReturnValue(static_cast<uint64_t>(LocalMemSize));
}
case UR_DEVICE_INFO_ERROR_CORRECTION_SUPPORT: {
int EccEnabled = 0;
UR_CHECK_ERROR(hipDeviceGetAttribute(
&EccEnabled, hipDeviceAttributeEccEnabled, hDevice->get()));

detail::ur::assertion((EccEnabled == 0) | (EccEnabled == 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in CUDA re: binary/logical or.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this, thanks.

@@ -493,8 +546,9 @@ UR_APIEXPORT ur_result_t UR_APICALL urDeviceGetInfo(ur_device_handle_t hDevice,
// name instead, this is also what AMD OpenCL devices return.
if (strlen(Name) == 0) {
hipDeviceProp_t Props;
detail::ur::assertion(hipGetDeviceProperties(&Props, hDevice->get()) ==
hipSuccess);
if (hipGetDeviceProperties(&Props, hDevice->get()) != hipSuccess) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make some effort to try and translate native error returns to UR specific? There is a bunch of native calls (here and in CUDA: cuMemGetInfo, cuDeviceTotalMem etc.) that just hard code UR_RESULT_ERROR_INVALID_OPERATION/UR_RESULT_ERROR_INVALID_SIZE on failure. Previously they were wrapped up in assertion, so the value didn't really matter, but now, they will be served up all the way to the user.

@fabiomestre
Copy link
Contributor

I have updated the target branch of this PR from the adapters branch to the main branch.
Development in UR is moving back to main. The adapters branch will soon be deleted.

Copy link
Contributor

@Bensuo Bensuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to command buffers LGTM

Copy link
Contributor

@isaacault isaacault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bindless Images LGTM

… codes or replacing original

error message to use an assert from cassert header and some other minor fixes.
…checks with an assert instead of returning a generic UR error.
@kbenzie kbenzie added loader Loader related feature/bug level-zero L0 adapter specific issues cuda CUDA adapter specific issues hip HIP adapter specific issues opencl OpenCL adapter specific issues native-cpu Native CPU adapter specific issues and removed enhancement New feature or request labels Apr 10, 2024
ur_event_handle_t_ *Event = ur_cast<ur_event_handle_t_ *>(e);
if (!Event->hasExternalRefs())
die("urEventsWait must not be called for an internal event");
ur_event_handle_t_ *Event =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why drop the use of the variable e? then it will just be unused in the internals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA adapter specific issues hip HIP adapter specific issues level-zero L0 adapter specific issues loader Loader related feature/bug native-cpu Native CPU adapter specific issues opencl OpenCL adapter specific issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants