Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UR][L0] Propagate OOM errors from USMAllocationMakeResident #1022

Merged
merged 6 commits into from
Nov 8, 2023

Conversation

0x12CC
Copy link
Contributor

@0x12CC 0x12CC commented Oct 31, 2023

This change ensures that USM allocation APIs don't return UR_RESULT_SUCCESS when an out of memory (OOM) error occurs within USMAllocationMakeResident. It's similar to the change reverted in #972, but only forwards OOM error results to avoid causing regressions; The behavior for other result values is unchanged.

This change ensures that USM allocation APIs don't return
`UR_RESULT_SUCCESS` when an error occurs within
`USMAllocationMakeResident`.

Signed-off-by: Michael Aziz <michael.aziz@intel.com>
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@0x12CC 0x12CC requested a review from a team as a code owner October 31, 2023 18:04
@0x12CC
Copy link
Contributor Author

0x12CC commented Oct 31, 2023

@kbenzie, @jandres742, this PR should address the regressions discussed in intel/llvm#11312. I was able to reproduce the failures on an Intel Arc A770 device and they're related to other result values we don't handle.

Tests for this PR are in intel/llvm#11696.

@jandres742
Copy link

@kbenzie, @jandres742, this PR should address the regressions discussed in intel/llvm#11312. I was able to reproduce the failures on an Intel Arc A770 device and they're related to other result values we don't handle.

Tests for this PR are in intel/llvm#11696.

hi @0x12CC . Your changes have now:

  auto Result = USMAllocationMakeResident(USMSharedAllocationForceResidency,
                                          Context, Device, *ResultPtr, Size);
  if (Result == UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY ||
      Result == UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY) {
    return Result;
  }

so it seems that if an error other than OUT_OF_DEVICE_MEMORY is returned, we need to mask it with SUCCESS, because we dont handle those errors. Could you elaborate on what cases you are seeing the error? I would say it is better we handle correctly those cases rather than masking the error with SUCCESS.

Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@0x12CC
Copy link
Contributor Author

0x12CC commented Nov 1, 2023

so it seems that if an error other than OUT_OF_DEVICE_MEMORY is returned, we need to mask it with SUCCESS, because we dont handle those errors. Could you elaborate on what cases you are seeing the error?

The matrix tests failing on Intel Arc are due to an unhandled error result. The call to zeContextMakeMemoryResident returns ZE_RESULT_ERROR_INVALID_ARGUMENT. This seems like a bug in L0 since zeContextMakeMemoryResident doesn't specify this value as a possible result. The argument values are not null so I'm not sure if there's anything we can do to handle this error.

I would say it is better we handle correctly those cases rather than masking the error with SUCCESS.

I agree. The current implementation masks all errors with success. The change I'm suggesting in this PR is to forward OOM errors rather than mask them. I think that this implementation can be updated to include other error results once the previously described L0 issue is resolved. I can add a TODO comment for handling other errors here if you think it's appropriate.

@jandres742
Copy link

so it seems that if an error other than OUT_OF_DEVICE_MEMORY is returned, we need to mask it with SUCCESS, because we dont handle those errors. Could you elaborate on what cases you are seeing the error?

The matrix tests failing on Intel Arc are due to an unhandled error result. The call to zeContextMakeMemoryResident returns ZE_RESULT_ERROR_INVALID_ARGUMENT. This seems like a bug in L0 since zeContextMakeMemoryResident doesn't specify this value as a possible result. The argument values are not null so I'm not sure if there's anything we can do to handle this error.

I would say it is better we handle correctly those cases rather than masking the error with SUCCESS.

I agree. The current implementation masks all errors with success. The change I'm suggesting in this PR is to forward OOM errors rather than mask them. I think that this implementation can be updated to include other error results once the previously described L0 issue is resolved. I can add a TODO comment for handling other errors here if you think it's appropriate.

Thanks @0x12CC . I see it now. Ok, cool, please add a TODO, linking also the spec issue I just opened in the L0 spec so we fix that also there:

oneapi-src/level-zero-spec#240

Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@0x12CC
Copy link
Contributor Author

0x12CC commented Nov 6, 2023

@jandres742, I think this PR is ready to merge. I've added the requested TODOs and all tests are passing in intel/llvm#11696 except for the following:

********************
Failed Tests (1):
  SYCL :: Plugin/cuda-max-local-mem-size.cpp

This failure looks related to be53fb3 since it's in the CUDA adapter and the tests were passing before I merged the latest changes from the adapters branch.

@jandres742
Copy link

@jandres742, I think this PR is ready to merge. I've added the requested TODOs and all tests are passing in intel/llvm#11696 except for the following:

********************
Failed Tests (1):
  SYCL :: Plugin/cuda-max-local-mem-size.cpp

This failure looks related to be53fb3 since it's in the CUDA adapter and the tests were passing before I merged the latest changes from the adapters branch.

thanks @0x12CC .

@kbenzie : please merge when possible.

@kbenzie
Copy link
Contributor

kbenzie commented Nov 7, 2023

@jandres742, I think this PR is ready to merge. I've added the requested TODOs and all tests are passing in intel/llvm#11696 except for the following:

********************
Failed Tests (1):
  SYCL :: Plugin/cuda-max-local-mem-size.cpp

This failure looks related to be53fb3 since it's in the CUDA adapter and the tests were passing before I merged the latest changes from the adapters branch.

thanks @0x12CC .

@kbenzie : please merge when possible.

There's a failure on intel/llvm#11696 for the CUDA E2E tests. I suspect this might be due to that PR not containing the latest changes from the sycl branch. @0x12CC coudl you update that PR to see if we can get it passing before merging?

@0x12CC
Copy link
Contributor Author

0x12CC commented Nov 7, 2023

@jandres742, I think this PR is ready to merge. I've added the requested TODOs and all tests are passing in intel/llvm#11696 except for the following:

********************
Failed Tests (1):
  SYCL :: Plugin/cuda-max-local-mem-size.cpp

This failure looks related to be53fb3 since it's in the CUDA adapter and the tests were passing before I merged the latest changes from the adapters branch.

thanks @0x12CC .
@kbenzie : please merge when possible.

There's a failure on intel/llvm#11696 for the CUDA E2E tests. I suspect this might be due to that PR not containing the latest changes from the sycl branch. @0x12CC coudl you update that PR to see if we can get it passing before merging?

This failure is not fixed on the latest sycl branch. I believe it's caused by be53fb3 and is not related to these L0 changes.

@kbenzie
Copy link
Contributor

kbenzie commented Nov 7, 2023

This failure is not fixed on the latest sycl branch.

Incorrect, intel/llvm#11454 fixed this yesterday. I'm also not going to merged this, as per the Adapter Change Process, until I see the intel/llvm checks all pass so please update that PR.

@0x12CC
Copy link
Contributor Author

0x12CC commented Nov 7, 2023

This failure is not fixed on the latest sycl branch.

Incorrect, intel/llvm#11454 fixed this yesterday. I'm also not going to merged this, as per the Adapter Change Process, until I see the intel/llvm checks all pass so please update that PR.

Sorry for the confusion. I missed these commits in my last sync. I'm re-running the tests now with the sycl branch at intel/llvm@aa0171d.

@kbenzie
Copy link
Contributor

kbenzie commented Nov 7, 2023

Thank you @0x12CC

@0x12CC
Copy link
Contributor Author

0x12CC commented Nov 7, 2023

@kbenzie, all of the tests are now passing in intel/llvm#11696. I think this PR is ready to merge.

@kbenzie
Copy link
Contributor

kbenzie commented Nov 8, 2023

I've opened intel/llvm#11811 combing this with #1033 and #1028 to merge into intel/llvm at the same time to accelerate merging.

@kbenzie kbenzie merged commit ec7982b into oneapi-src:adapters Nov 8, 2023
48 checks passed
@0x12CC 0x12CC deleted the l0_usm_error_checking_2 branch November 8, 2023 14:47
kbenzie added a commit to kbenzie/unified-runtime that referenced this pull request Nov 9, 2023
…hecking_2"

This reverts commit ec7982b, reversing
changes made to 62e6d2f.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready to merge Added to PR's which are ready to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants