[CUDA] Fix synchronization issue in urEnqueueMemImageCopy #1104

fabiomestre · 2023-11-21T16:14:16Z

For 1D images, urEnqueueMemImageCopy was using cuMemcpyAtoA which does not have an asynchronous version. This means that, when the MemCpy happens between two arrays in device memory, the call will be asynchronous and might complete after the event returned by urEnqueueMemImageCopy finishes.

This commits fixes the issue by using cuMemcpy2DAsync to copy 1D images by setting the height to 1.

E2E run: intel/llvm#11966

jinz2014 · 2023-11-29T22:05:01Z

What is cuMemcpyAtoA ?

Learning from the doc, cudaMemcpy2DArrayToArray is synchronous, isn't it ?

Do you mean synchronous data transfers between devices could complete after the event returned by urEnqueueMemImageCopy finishes. Right ?

For 1D images, urEnqueueMemImageCopy was using cuMemcpyAtoA which does not have an asynchronous version. This means that, when the MemCpy happens between two arrays in device memory, the call will be asynchronous and might complete after the event returned by urEnqueueMemImageCopy finishes. This commits fixes the issue by using cuMemcpy2DAsync to copy 1D images by setting the height to 1.

codecov-commenter · 2023-12-04T17:25:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (9b97a5f) 15.46% compared to head (ffe9a51) 15.46%.
Report is 2 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1104      +/-   ##
==========================================
- Coverage   15.46%   15.46%   -0.01%     
==========================================
  Files         238      238              
  Lines       33883    33883              
  Branches     3747     3747              
==========================================
- Hits         5240     5239       -1     
  Misses      28593    28593              
- Partials       50       51       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fabiomestre · 2023-12-04T17:43:48Z

What is cuMemcpyAtoA ?

Learning from the doc, cudaMemcpy2DArrayToArray is synchronous, isn't it ?

Do you mean synchronous data transfers between devices could complete after the event returned by urEnqueueMemImageCopy finishes. Right ?

cuMemcpyAtoA is a low level API (driver API) that allows copying from one Array to another Array. cudaMemcpy2DArrayToArray is similar (but for 2D) and is part of the Runtime API which we don't use in UR.

Both of the functions exhibit synchronous behaviour. But it doesn't mean that the functions are always synchronous. If the copy happens between 2 memory regions in the device, it will have asynchronous behaviour. More details in: https://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html

So, the issue I'm trying to fix here is that cuMemcpyAtoA has asynchronous behaviour in some situations and, since there is no cuMemcpyAtoAAsync, it cannot be synchronized with the stream. So the only solution that seems to work is to stop using that API and rely on cuMemcpy2DAsync

jinz2014 · 2023-12-04T18:45:49Z

Thank you for the explanation !

fabiomestre · 2023-12-07T14:42:23Z

@oneapi-src/unified-runtime-cuda-write Would appreciate if someone could have a look at this PR

JackAKirk · 2023-12-07T14:53:30Z

Do you know if this is what cudaMemcpyFromArrayAsync does? which seems to be the corresponding cuda runtime api. i.e. does it have similar behaviour to cuMemcpy2DAsync?
Did you benchmark the affect that this change in implementation has on previously correct behavior, as a function of e.g. array size?
For example, what is the performance comparison between cuMemcpy2DAsyncfollowed by a sync, and the syncronous version of cuMemcpyAtoA (between device and host)?

In short, are you sure this is the best solution?

fabiomestre · 2023-12-07T16:55:38Z

Do you know if this is what cudaMemcpyFromArrayAsync does? which seems to be the corresponding cuda runtime api. i.e. does it have similar behaviour to cuMemcpy2DAsync?

This APIs are a bit confusing but, my understanding is that cudaMemcpyFromArrayAsync is used to copy memory from an array handle ( in the device). The UR equivalent of this function would be urEnqueueMemImageRead().

For the entrypoint that this PR changes, the equivalent function in cuda would be cudaMemcpyArrayToArray() which doesn't have an Async version.

Did you benchmark the affect that this change in implementation has on previously correct behavior, as a function of e.g. array size? For example, what is the performance comparison between cuMemcpy2DAsyncfollowed by a sync, and the syncronous version of cuMemcpyAtoA (between device and host)?

In short, are you sure this is the best solution?

I don't think the previous behaviour is correct. I think that urEnqueueMemImageCopy() is expected to behave asynchronously. It takes a queue that can be synchronized on later. The previous behaviour was doing implicit synchronization which doesn't sound right to me.

In addition, even if implicit synchronization is allowed, I struggled to synchronize cuMemcpyAtoA(). I tried to synchronize the null stream and a few other things but it still made the CTS test fail.

So, I didn't run any benchmark because I couldn't find any alternative solution that behaves as expected. But I'm open to suggestions of other solutions for this issue.

JackAKirk · 2023-12-07T16:58:47Z

Do you know if this is what cudaMemcpyFromArrayAsync does? which seems to be the corresponding cuda runtime api. i.e. does it have similar behaviour to cuMemcpy2DAsync?

This APIs are a bit confusing but, my understanding is that cudaMemcpyFromArrayAsync is used to copy memory from an array handle ( in the device). The UR equivalent of this function would be urEnqueueMemImageRead().

For the entrypoint that this PR changes, the equivalent function in cuda would be cudaMemcpyArrayToArray() which doesn't have an Async version.

Did you benchmark the affect that this change in implementation has on previously correct behavior, as a function of e.g. array size? For example, what is the performance comparison between cuMemcpy2DAsyncfollowed by a sync, and the syncronous version of cuMemcpyAtoA (between device and host)?
In short, are you sure this is the best solution?

I don't think the previous behaviour is correct. I think that urEnqueueMemImageCopy() is expected to behave asynchronously. It takes a queue that can be synchronized on later. The previous behaviour was doing implicit synchronization which doesn't sound right to me.

In addition, even if implicit synchronization is allowed, I struggled to synchronize cuMemcpyAtoA(). I tried to synchronize the null stream and a few other things but it still made the CTS test fail.

So, I didn't run any benchmark because, I couldn't find any alternative solution that behaves as expected. But I'm open to suggestions of other solutions for this issue.

OK I see thanks. What is the corresponding test in test-e2e/cts for this api that requires this change to pass?

fabiomestre · 2023-12-07T17:02:53Z

OK I see thanks. What is the corresponding test in test-e2e/cts for this api that requires this change to pass?

That's urEnqueueMemImageCopyTest:Success. It fails intermittently on my GPU (GT 1030)

JackAKirk · 2023-12-07T17:11:35Z

OK I see thanks. What is the corresponding test in test-e2e/cts for this api that requires this change to pass?

That's urEnqueueMemImageCopyTest:Success. It fails intermittently on my GPU (GT 1030)

I see the test. OK it all makes sense to me.

JackAKirk

LGTM

fabiomestre force-pushed the fabio/fix_cuda_intermittent branch from c72458c to b414b50 Compare November 21, 2023 16:15

fabiomestre force-pushed the fabio/fix_cuda_intermittent branch from b414b50 to 3302c4e Compare November 28, 2023 16:55

fabiomestre force-pushed the fabio/fix_cuda_intermittent branch from 3302c4e to a513afb Compare December 4, 2023 16:43

Merge branch 'adapters' into fabio/fix_cuda_intermittent

61b3ef9

fabiomestre marked this pull request as ready for review December 4, 2023 17:48

fabiomestre requested a review from a team as a code owner December 4, 2023 17:48

fabiomestre changed the base branch from adapters to main December 5, 2023 16:26

JackAKirk approved these changes Dec 7, 2023

View reviewed changes

fabiomestre added ready to merge Added to PR's which are ready to merge and removed ready to merge Added to PR's which are ready to merge labels Jan 4, 2024

Merge branch 'main' into fabio/fix_cuda_intermittent

428e1d2

fabiomestre added the ready to merge Added to PR's which are ready to merge label Jan 5, 2024

Merge branch 'main' into fabio/fix_cuda_intermittent

ffe9a51

kbenzie merged commit edb281f into oneapi-src:main Jan 30, 2024
51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Fix synchronization issue in urEnqueueMemImageCopy #1104

[CUDA] Fix synchronization issue in urEnqueueMemImageCopy #1104

fabiomestre commented Nov 21, 2023 •

edited

Loading

jinz2014 commented Nov 29, 2023

codecov-commenter commented Dec 4, 2023 •

edited

Loading

fabiomestre commented Dec 4, 2023 •

edited

Loading

jinz2014 commented Dec 4, 2023

fabiomestre commented Dec 7, 2023

JackAKirk commented Dec 7, 2023 •

edited

Loading

fabiomestre commented Dec 7, 2023 •

edited

Loading

JackAKirk commented Dec 7, 2023 •

edited by fabiomestre

Loading

fabiomestre commented Dec 7, 2023

JackAKirk commented Dec 7, 2023

JackAKirk left a comment

[CUDA] Fix synchronization issue in urEnqueueMemImageCopy #1104

[CUDA] Fix synchronization issue in urEnqueueMemImageCopy #1104

Conversation

fabiomestre commented Nov 21, 2023 • edited Loading

jinz2014 commented Nov 29, 2023

codecov-commenter commented Dec 4, 2023 • edited Loading

Codecov Report

fabiomestre commented Dec 4, 2023 • edited Loading

jinz2014 commented Dec 4, 2023

fabiomestre commented Dec 7, 2023

JackAKirk commented Dec 7, 2023 • edited Loading

fabiomestre commented Dec 7, 2023 • edited Loading

JackAKirk commented Dec 7, 2023 • edited by fabiomestre Loading

fabiomestre commented Dec 7, 2023

JackAKirk commented Dec 7, 2023

JackAKirk left a comment

Choose a reason for hiding this comment

fabiomestre commented Nov 21, 2023 •

edited

Loading

codecov-commenter commented Dec 4, 2023 •

edited

Loading

fabiomestre commented Dec 4, 2023 •

edited

Loading

JackAKirk commented Dec 7, 2023 •

edited

Loading

fabiomestre commented Dec 7, 2023 •

edited

Loading

JackAKirk commented Dec 7, 2023 •

edited by fabiomestre

Loading