Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] Fix synchronization issue in urEnqueueMemImageCopy #1104

Merged
merged 4 commits into from
Jan 30, 2024

Conversation

fabiomestre
Copy link
Contributor

@fabiomestre fabiomestre commented Nov 21, 2023

For 1D images, urEnqueueMemImageCopy was using cuMemcpyAtoA which does not have an asynchronous version. This means that, when the MemCpy happens between two arrays in device memory, the call will be asynchronous and might complete after the event returned by urEnqueueMemImageCopy finishes.

This commits fixes the issue by using cuMemcpy2DAsync to copy 1D images by setting the height to 1.

E2E run: intel/llvm#11966

@jinz2014
Copy link

What is cuMemcpyAtoA ?

Learning from the doc, cudaMemcpy2DArrayToArray is synchronous, isn't it ?

Do you mean synchronous data transfers between devices could complete after the event returned by urEnqueueMemImageCopy finishes. Right ?

For 1D images, urEnqueueMemImageCopy was using cuMemcpyAtoA which does
not have an asynchronous version. This means that, when the MemCpy
happens between two arrays in device memory, the call will be
asynchronous and might complete after the event returned by
urEnqueueMemImageCopy finishes.

This commits fixes the issue by using cuMemcpy2DAsync to copy 1D images
by setting the height to 1.
@codecov-commenter
Copy link

codecov-commenter commented Dec 4, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (9b97a5f) 15.46% compared to head (ffe9a51) 15.46%.
Report is 2 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1104      +/-   ##
==========================================
- Coverage   15.46%   15.46%   -0.01%     
==========================================
  Files         238      238              
  Lines       33883    33883              
  Branches     3747     3747              
==========================================
- Hits         5240     5239       -1     
  Misses      28593    28593              
- Partials       50       51       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@fabiomestre
Copy link
Contributor Author

fabiomestre commented Dec 4, 2023

What is cuMemcpyAtoA ?

Learning from the doc, cudaMemcpy2DArrayToArray is synchronous, isn't it ?

Do you mean synchronous data transfers between devices could complete after the event returned by urEnqueueMemImageCopy finishes. Right ?

cuMemcpyAtoA is a low level API (driver API) that allows copying from one Array to another Array. cudaMemcpy2DArrayToArray is similar (but for 2D) and is part of the Runtime API which we don't use in UR.

Both of the functions exhibit synchronous behaviour. But it doesn't mean that the functions are always synchronous. If the copy happens between 2 memory regions in the device, it will have asynchronous behaviour. More details in: https://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html

So, the issue I'm trying to fix here is that cuMemcpyAtoA has asynchronous behaviour in some situations and, since there is no cuMemcpyAtoAAsync, it cannot be synchronized with the stream. So the only solution that seems to work is to stop using that API and rely on cuMemcpy2DAsync

@fabiomestre fabiomestre marked this pull request as ready for review December 4, 2023 17:48
@fabiomestre fabiomestre requested a review from a team as a code owner December 4, 2023 17:48
@jinz2014
Copy link

jinz2014 commented Dec 4, 2023

Thank you for the explanation !

@fabiomestre fabiomestre changed the base branch from adapters to main December 5, 2023 16:26
@fabiomestre
Copy link
Contributor Author

@oneapi-src/unified-runtime-cuda-write Would appreciate if someone could have a look at this PR

@JackAKirk
Copy link
Contributor

JackAKirk commented Dec 7, 2023

Do you know if this is what cudaMemcpyFromArrayAsync does? which seems to be the corresponding cuda runtime api. i.e. does it have similar behaviour to cuMemcpy2DAsync?
Did you benchmark the affect that this change in implementation has on previously correct behavior, as a function of e.g. array size?
For example, what is the performance comparison between cuMemcpy2DAsyncfollowed by a sync, and the syncronous version of cuMemcpyAtoA (between device and host)?

In short, are you sure this is the best solution?

@fabiomestre
Copy link
Contributor Author

fabiomestre commented Dec 7, 2023

Do you know if this is what cudaMemcpyFromArrayAsync does? which seems to be the corresponding cuda runtime api. i.e. does it have similar behaviour to cuMemcpy2DAsync?

This APIs are a bit confusing but, my understanding is that cudaMemcpyFromArrayAsync is used to copy memory from an array handle ( in the device). The UR equivalent of this function would be urEnqueueMemImageRead().

For the entrypoint that this PR changes, the equivalent function in cuda would be cudaMemcpyArrayToArray() which doesn't have an Async version.

Did you benchmark the affect that this change in implementation has on previously correct behavior, as a function of e.g. array size? For example, what is the performance comparison between cuMemcpy2DAsyncfollowed by a sync, and the syncronous version of cuMemcpyAtoA (between device and host)?

In short, are you sure this is the best solution?

I don't think the previous behaviour is correct. I think that urEnqueueMemImageCopy() is expected to behave asynchronously. It takes a queue that can be synchronized on later. The previous behaviour was doing implicit synchronization which doesn't sound right to me.

In addition, even if implicit synchronization is allowed, I struggled to synchronize cuMemcpyAtoA(). I tried to synchronize the null stream and a few other things but it still made the CTS test fail.

So, I didn't run any benchmark because I couldn't find any alternative solution that behaves as expected. But I'm open to suggestions of other solutions for this issue.

@JackAKirk
Copy link
Contributor

JackAKirk commented Dec 7, 2023

Do you know if this is what cudaMemcpyFromArrayAsync does? which seems to be the corresponding cuda runtime api. i.e. does it have similar behaviour to cuMemcpy2DAsync?

This APIs are a bit confusing but, my understanding is that cudaMemcpyFromArrayAsync is used to copy memory from an array handle ( in the device). The UR equivalent of this function would be urEnqueueMemImageRead().

For the entrypoint that this PR changes, the equivalent function in cuda would be cudaMemcpyArrayToArray() which doesn't have an Async version.

Did you benchmark the affect that this change in implementation has on previously correct behavior, as a function of e.g. array size? For example, what is the performance comparison between cuMemcpy2DAsyncfollowed by a sync, and the syncronous version of cuMemcpyAtoA (between device and host)?
In short, are you sure this is the best solution?

I don't think the previous behaviour is correct. I think that urEnqueueMemImageCopy() is expected to behave asynchronously. It takes a queue that can be synchronized on later. The previous behaviour was doing implicit synchronization which doesn't sound right to me.

In addition, even if implicit synchronization is allowed, I struggled to synchronize cuMemcpyAtoA(). I tried to synchronize the null stream and a few other things but it still made the CTS test fail.

So, I didn't run any benchmark because, I couldn't find any alternative solution that behaves as expected. But I'm open to suggestions of other solutions for this issue.

OK I see thanks. What is the corresponding test in test-e2e/cts for this api that requires this change to pass?

@fabiomestre
Copy link
Contributor Author

OK I see thanks. What is the corresponding test in test-e2e/cts for this api that requires this change to pass?

That's urEnqueueMemImageCopyTest:Success. It fails intermittently on my GPU (GT 1030)

@JackAKirk
Copy link
Contributor

OK I see thanks. What is the corresponding test in test-e2e/cts for this api that requires this change to pass?

That's urEnqueueMemImageCopyTest:Success. It fails intermittently on my GPU (GT 1030)

I see the test. OK it all makes sense to me.

Copy link
Contributor

@JackAKirk JackAKirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fabiomestre fabiomestre added ready to merge Added to PR's which are ready to merge and removed ready to merge Added to PR's which are ready to merge labels Jan 4, 2024
@fabiomestre fabiomestre added the ready to merge Added to PR's which are ready to merge label Jan 5, 2024
@kbenzie kbenzie merged commit edb281f into oneapi-src:main Jan 30, 2024
51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready to merge Added to PR's which are ready to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants