[UR][L0] Propagate OOM errors from `USMAllocationMakeResident` #1022

0x12CC · 2023-10-31T18:04:33Z

This change ensures that USM allocation APIs don't return UR_RESULT_SUCCESS when an out of memory (OOM) error occurs within USMAllocationMakeResident. It's similar to the change reverted in #972, but only forwards OOM error results to avoid causing regressions; The behavior for other result values is unchanged.

This change ensures that USM allocation APIs don't return `UR_RESULT_SUCCESS` when an error occurs within `USMAllocationMakeResident`. Signed-off-by: Michael Aziz <michael.aziz@intel.com>

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

0x12CC · 2023-10-31T18:17:01Z

@kbenzie, @jandres742, this PR should address the regressions discussed in intel/llvm#11312. I was able to reproduce the failures on an Intel Arc A770 device and they're related to other result values we don't handle.

Tests for this PR are in intel/llvm#11696.

source/adapters/level_zero/usm.cpp

jandres742 · 2023-10-31T22:12:16Z

@kbenzie, @jandres742, this PR should address the regressions discussed in intel/llvm#11312. I was able to reproduce the failures on an Intel Arc A770 device and they're related to other result values we don't handle.

Tests for this PR are in intel/llvm#11696.

hi @0x12CC . Your changes have now:

  auto Result = USMAllocationMakeResident(USMSharedAllocationForceResidency,
                                          Context, Device, *ResultPtr, Size);
  if (Result == UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY ||
      Result == UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY) {
    return Result;
  }

so it seems that if an error other than OUT_OF_DEVICE_MEMORY is returned, we need to mask it with SUCCESS, because we dont handle those errors. Could you elaborate on what cases you are seeing the error? I would say it is better we handle correctly those cases rather than masking the error with SUCCESS.

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

0x12CC · 2023-11-01T15:15:58Z

so it seems that if an error other than OUT_OF_DEVICE_MEMORY is returned, we need to mask it with SUCCESS, because we dont handle those errors. Could you elaborate on what cases you are seeing the error?

The matrix tests failing on Intel Arc are due to an unhandled error result. The call to zeContextMakeMemoryResident returns ZE_RESULT_ERROR_INVALID_ARGUMENT. This seems like a bug in L0 since zeContextMakeMemoryResident doesn't specify this value as a possible result. The argument values are not null so I'm not sure if there's anything we can do to handle this error.

I would say it is better we handle correctly those cases rather than masking the error with SUCCESS.

I agree. The current implementation masks all errors with success. The change I'm suggesting in this PR is to forward OOM errors rather than mask them. I think that this implementation can be updated to include other error results once the previously described L0 issue is resolved. I can add a TODO comment for handling other errors here if you think it's appropriate.

jandres742 · 2023-11-01T21:36:55Z

so it seems that if an error other than OUT_OF_DEVICE_MEMORY is returned, we need to mask it with SUCCESS, because we dont handle those errors. Could you elaborate on what cases you are seeing the error?

The matrix tests failing on Intel Arc are due to an unhandled error result. The call to zeContextMakeMemoryResident returns ZE_RESULT_ERROR_INVALID_ARGUMENT. This seems like a bug in L0 since zeContextMakeMemoryResident doesn't specify this value as a possible result. The argument values are not null so I'm not sure if there's anything we can do to handle this error.

I would say it is better we handle correctly those cases rather than masking the error with SUCCESS.

I agree. The current implementation masks all errors with success. The change I'm suggesting in this PR is to forward OOM errors rather than mask them. I think that this implementation can be updated to include other error results once the previously described L0 issue is resolved. I can add a TODO comment for handling other errors here if you think it's appropriate.

Thanks @0x12CC . I see it now. Ok, cool, please add a TODO, linking also the spec issue I just opened in the L0 spec so we fix that also there:

oneapi-src/level-zero-spec#240

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

0x12CC · 2023-11-06T21:00:29Z

@jandres742, I think this PR is ready to merge. I've added the requested TODOs and all tests are passing in intel/llvm#11696 except for the following:

********************
Failed Tests (1):
  SYCL :: Plugin/cuda-max-local-mem-size.cpp

This failure looks related to be53fb3 since it's in the CUDA adapter and the tests were passing before I merged the latest changes from the adapters branch.

jandres742 · 2023-11-07T06:02:22Z

@jandres742, I think this PR is ready to merge. I've added the requested TODOs and all tests are passing in intel/llvm#11696 except for the following:
********************
Failed Tests (1):
  SYCL :: Plugin/cuda-max-local-mem-size.cpp
This failure looks related to be53fb3 since it's in the CUDA adapter and the tests were passing before I merged the latest changes from the adapters branch.

thanks @0x12CC .

@kbenzie : please merge when possible.

kbenzie · 2023-11-07T13:04:45Z

@jandres742, I think this PR is ready to merge. I've added the requested TODOs and all tests are passing in intel/llvm#11696 except for the following:
********************
Failed Tests (1):
  SYCL :: Plugin/cuda-max-local-mem-size.cpp
This failure looks related to be53fb3 since it's in the CUDA adapter and the tests were passing before I merged the latest changes from the adapters branch.
thanks @0x12CC .

@kbenzie : please merge when possible.

There's a failure on intel/llvm#11696 for the CUDA E2E tests. I suspect this might be due to that PR not containing the latest changes from the sycl branch. @0x12CC coudl you update that PR to see if we can get it passing before merging?

0x12CC · 2023-11-07T13:23:30Z

@jandres742, I think this PR is ready to merge. I've added the requested TODOs and all tests are passing in intel/llvm#11696 except for the following:
********************
Failed Tests (1):
  SYCL :: Plugin/cuda-max-local-mem-size.cpp
This failure looks related to be53fb3 since it's in the CUDA adapter and the tests were passing before I merged the latest changes from the adapters branch.
thanks @0x12CC .
@kbenzie : please merge when possible.
There's a failure on intel/llvm#11696 for the CUDA E2E tests. I suspect this might be due to that PR not containing the latest changes from the sycl branch. @0x12CC coudl you update that PR to see if we can get it passing before merging?

This failure is not fixed on the latest sycl branch. I believe it's caused by be53fb3 and is not related to these L0 changes.

kbenzie · 2023-11-07T13:28:37Z

This failure is not fixed on the latest sycl branch.

Incorrect, intel/llvm#11454 fixed this yesterday. I'm also not going to merged this, as per the Adapter Change Process, until I see the intel/llvm checks all pass so please update that PR.

0x12CC · 2023-11-07T13:36:45Z

This failure is not fixed on the latest sycl branch.

Incorrect, intel/llvm#11454 fixed this yesterday. I'm also not going to merged this, as per the Adapter Change Process, until I see the intel/llvm checks all pass so please update that PR.

Sorry for the confusion. I missed these commits in my last sync. I'm re-running the tests now with the sycl branch at intel/llvm@aa0171d.

kbenzie · 2023-11-07T13:39:20Z

Thank you @0x12CC

0x12CC · 2023-11-07T17:27:19Z

@kbenzie, all of the tests are now passing in intel/llvm#11696. I think this PR is ready to merge.

kbenzie · 2023-11-08T12:15:45Z

I've opened intel/llvm#11811 combing this with #1033 and #1028 to merge into intel/llvm at the same time to accelerate merging.

Combines the following L0 changes: * oneapi-src/unified-runtime#1033 * oneapi-src/unified-runtime#1028 * oneapi-src/unified-runtime#1022

…hecking_2" This reverts commit ec7982b, reversing changes made to 62e6d2f.

0x12CC added 2 commits October 27, 2023 14:24

[UR][L0] Propagate errors from USMAllocationMakeResident

f2be823

This change ensures that USM allocation APIs don't return `UR_RESULT_SUCCESS` when an error occurs within `USMAllocationMakeResident`. Signed-off-by: Michael Aziz <michael.aziz@intel.com>

Fix error propagation

f056f97

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

0x12CC requested a review from a team as a code owner October 31, 2023 18:04

Merge branch 'adapters' into l0_usm_error_checking_2

b205652

jandres742 reviewed Oct 31, 2023

View reviewed changes

source/adapters/level_zero/usm.cpp Outdated Show resolved Hide resolved

Fix result checks

bc7c0f4

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

0x12CC requested a review from jandres742 November 1, 2023 15:16

jandres742 approved these changes Nov 1, 2023

View reviewed changes

Add TODO for handling other error results

fe469d7

Signed-off-by: Michael Aziz <michael.aziz@intel.com>

Merge branch 'adapters' into l0_usm_error_checking_2

5fb8292

kbenzie added the ready to merge Added to PR's which are ready to merge label Nov 8, 2023

This was referenced Nov 8, 2023

[UR][L0] Add support for urAdapterGetLastError in L0 #1033

Merged

[UR][L0] Add support for zeCommandListHostSynchronize #1028

Merged

kbenzie merged commit ec7982b into oneapi-src:adapters Nov 8, 2023
48 checks passed

kbenzie added a commit to kbenzie/llvm that referenced this pull request Nov 8, 2023

[UR] Bump to ec7982bac6cb3a6b9ed610cd6b7cb41fcbc780dc

7348bc5

Combines the following L0 changes: * oneapi-src/unified-runtime#1033 * oneapi-src/unified-runtime#1028 * oneapi-src/unified-runtime#1022

kbenzie mentioned this pull request Nov 8, 2023

[UR] Bump to ec7982bac6cb3a6b9ed610cd6b7cb41fcbc780dc intel/llvm#11811

Merged

0x12CC deleted the l0_usm_error_checking_2 branch November 8, 2023 14:47

againull pushed a commit to intel/llvm that referenced this pull request Nov 8, 2023

[UR] Bump to ec7982bac6cb3a6b9ed610cd6b7cb41fcbc780dc (#11811)

5cdc096

Combines the following L0 changes: * oneapi-src/unified-runtime#1033 * oneapi-src/unified-runtime#1028 * oneapi-src/unified-runtime#1022

kbenzie added a commit to kbenzie/unified-runtime that referenced this pull request Nov 9, 2023

Revert "Merge pull request oneapi-src#1022 from 0x12CC/l0_usm_error_c…

8a59336

…hecking_2" This reverts commit ec7982b, reversing changes made to 62e6d2f.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UR][L0] Propagate OOM errors from `USMAllocationMakeResident` #1022

[UR][L0] Propagate OOM errors from `USMAllocationMakeResident` #1022

0x12CC commented Oct 31, 2023

0x12CC commented Oct 31, 2023

jandres742 commented Oct 31, 2023

0x12CC commented Nov 1, 2023

jandres742 commented Nov 1, 2023

0x12CC commented Nov 6, 2023

jandres742 commented Nov 7, 2023

kbenzie commented Nov 7, 2023

0x12CC commented Nov 7, 2023 •

edited

Loading

kbenzie commented Nov 7, 2023

0x12CC commented Nov 7, 2023

kbenzie commented Nov 7, 2023

0x12CC commented Nov 7, 2023

kbenzie commented Nov 8, 2023

[UR][L0] Propagate OOM errors from USMAllocationMakeResident #1022

[UR][L0] Propagate OOM errors from USMAllocationMakeResident #1022

Conversation

0x12CC commented Oct 31, 2023

0x12CC commented Oct 31, 2023

jandres742 commented Oct 31, 2023

0x12CC commented Nov 1, 2023

jandres742 commented Nov 1, 2023

0x12CC commented Nov 6, 2023

jandres742 commented Nov 7, 2023

kbenzie commented Nov 7, 2023

0x12CC commented Nov 7, 2023 • edited Loading

kbenzie commented Nov 7, 2023

0x12CC commented Nov 7, 2023

kbenzie commented Nov 7, 2023

0x12CC commented Nov 7, 2023

kbenzie commented Nov 8, 2023

[UR][L0] Propagate OOM errors from `USMAllocationMakeResident` #1022

[UR][L0] Propagate OOM errors from `USMAllocationMakeResident` #1022

0x12CC commented Nov 7, 2023 •

edited

Loading