-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA] Fix synchronization issue in urEnqueueMemImageCopy #1104
[CUDA] Fix synchronization issue in urEnqueueMemImageCopy #1104
Conversation
c72458c
to
b414b50
Compare
b414b50
to
3302c4e
Compare
What is cuMemcpyAtoA ? Learning from the doc, cudaMemcpy2DArrayToArray is synchronous, isn't it ? Do you mean synchronous data transfers between devices could complete after the event returned by urEnqueueMemImageCopy finishes. Right ? |
For 1D images, urEnqueueMemImageCopy was using cuMemcpyAtoA which does not have an asynchronous version. This means that, when the MemCpy happens between two arrays in device memory, the call will be asynchronous and might complete after the event returned by urEnqueueMemImageCopy finishes. This commits fixes the issue by using cuMemcpy2DAsync to copy 1D images by setting the height to 1.
3302c4e
to
a513afb
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1104 +/- ##
==========================================
- Coverage 15.46% 15.46% -0.01%
==========================================
Files 238 238
Lines 33883 33883
Branches 3747 3747
==========================================
- Hits 5240 5239 -1
Misses 28593 28593
- Partials 50 51 +1 ☔ View full report in Codecov by Sentry. |
cuMemcpyAtoA is a low level API (driver API) that allows copying from one Array to another Array. cudaMemcpy2DArrayToArray is similar (but for 2D) and is part of the Runtime API which we don't use in UR. Both of the functions exhibit synchronous behaviour. But it doesn't mean that the functions are always synchronous. If the copy happens between 2 memory regions in the device, it will have asynchronous behaviour. More details in: https://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html So, the issue I'm trying to fix here is that cuMemcpyAtoA has asynchronous behaviour in some situations and, since there is no cuMemcpyAtoAAsync, it cannot be synchronized with the stream. So the only solution that seems to work is to stop using that API and rely on cuMemcpy2DAsync |
Thank you for the explanation ! |
@oneapi-src/unified-runtime-cuda-write Would appreciate if someone could have a look at this PR |
Do you know if this is what In short, are you sure this is the best solution? |
This APIs are a bit confusing but, my understanding is that For the entrypoint that this PR changes, the equivalent function in cuda would be
I don't think the previous behaviour is correct. I think that In addition, even if implicit synchronization is allowed, I struggled to synchronize So, I didn't run any benchmark because I couldn't find any alternative solution that behaves as expected. But I'm open to suggestions of other solutions for this issue. |
OK I see thanks. What is the corresponding test in test-e2e/cts for this api that requires this change to pass? |
That's |
I see the test. OK it all makes sense to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
For 1D images, urEnqueueMemImageCopy was using cuMemcpyAtoA which does not have an asynchronous version. This means that, when the MemCpy happens between two arrays in device memory, the call will be asynchronous and might complete after the event returned by urEnqueueMemImageCopy finishes.
This commits fixes the issue by using cuMemcpy2DAsync to copy 1D images by setting the height to 1.
E2E run: intel/llvm#11966