-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UR][L0] Propagate OOM errors from USMAllocationMakeResident
#1022
Conversation
This change ensures that USM allocation APIs don't return `UR_RESULT_SUCCESS` when an error occurs within `USMAllocationMakeResident`. Signed-off-by: Michael Aziz <michael.aziz@intel.com>
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@kbenzie, @jandres742, this PR should address the regressions discussed in intel/llvm#11312. I was able to reproduce the failures on an Intel Arc A770 device and they're related to other result values we don't handle. Tests for this PR are in intel/llvm#11696. |
hi @0x12CC . Your changes have now: auto Result = USMAllocationMakeResident(USMSharedAllocationForceResidency,
Context, Device, *ResultPtr, Size);
if (Result == UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY ||
Result == UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY) {
return Result;
} so it seems that if an error other than OUT_OF_DEVICE_MEMORY is returned, we need to mask it with SUCCESS, because we dont handle those errors. Could you elaborate on what cases you are seeing the error? I would say it is better we handle correctly those cases rather than masking the error with SUCCESS. |
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
The matrix tests failing on Intel Arc are due to an unhandled error result. The call to
I agree. The current implementation masks all errors with success. The change I'm suggesting in this PR is to forward OOM errors rather than mask them. I think that this implementation can be updated to include other error results once the previously described L0 issue is resolved. I can add a |
Thanks @0x12CC . I see it now. Ok, cool, please add a TODO, linking also the spec issue I just opened in the L0 spec so we fix that also there: |
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@jandres742, I think this PR is ready to merge. I've added the requested TODOs and all tests are passing in intel/llvm#11696 except for the following:
This failure looks related to be53fb3 since it's in the CUDA adapter and the tests were passing before I merged the latest changes from the |
thanks @0x12CC . @kbenzie : please merge when possible. |
There's a failure on intel/llvm#11696 for the CUDA E2E tests. I suspect this might be due to that PR not containing the latest changes from the |
This failure is not fixed on the latest |
Incorrect, intel/llvm#11454 fixed this yesterday. I'm also not going to merged this, as per the Adapter Change Process, until I see the intel/llvm checks all pass so please update that PR. |
Sorry for the confusion. I missed these commits in my last sync. I'm re-running the tests now with the |
Thank you @0x12CC |
@kbenzie, all of the tests are now passing in intel/llvm#11696. I think this PR is ready to merge. |
I've opened intel/llvm#11811 combing this with #1033 and #1028 to merge into intel/llvm at the same time to accelerate merging. |
Combines the following L0 changes: * oneapi-src/unified-runtime#1033 * oneapi-src/unified-runtime#1028 * oneapi-src/unified-runtime#1022
Combines the following L0 changes: * oneapi-src/unified-runtime#1033 * oneapi-src/unified-runtime#1028 * oneapi-src/unified-runtime#1022
This change ensures that USM allocation APIs don't return
UR_RESULT_SUCCESS
when an out of memory (OOM) error occurs withinUSMAllocationMakeResident
. It's similar to the change reverted in #972, but only forwards OOM error results to avoid causing regressions; The behavior for other result values is unchanged.