-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA][HIP] Use device to get native context #425
Conversation
8243065
to
9669252
Compare
reading your changes, I have a question. For example, auto cudaDevice = sycl::get_nativesycl::backend::ext_oneapi_cuda(queue.get_device()); Is the type of cudaDevice "CUdevice" ? |
Hi @jinz2014 yes you are correct! |
cufft_run.txt |
396cd30
to
0ea3b84
Compare
AMD tests for lapack and blas all passing: 8 lapack nvidia test failing on GTX1050 but these tests are also failing on Nvidia blas tests passing |
I see all the buffer tests failing for the Logs: The failures are not because of the changes in this PR, but rather a recent change in the compiler. All these tests are expected to pass once oneapi-src/unified-runtime#1226 and intel/llvm#12297 are merged. |
c8a758d
to
878b981
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be able to attach test logs again? Also does this change compile with the 2024.0 icpx release?
// Getting the primary context also sets it as the active context | ||
CUDA_ERROR_FUNC(cuDevicePrimaryCtxRetain, err, &desired, cudaDevice); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we expect a performance cost from this change? From what I understand cuCtxSetCurrent
was expected to be called only once before, assuming the context active was not changed outside of oneMKL.
This constructor is called once before each calls to blas functions so I am wary that the cost may add up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cost of cuDevicePrimaryContextRetain
is minimal once the primary context is not being initialized for the first time, which it should not be here. A simple benchmark like this:
for (int i = 0; i < NUM_ITERATIONS; i++) {
CHECK(cuDevicePrimaryCtxRetain(&context, device));
CHECK(cuDevicePrimaryCtxRelease(device));
}
Gives 32ns per loop, so calls to these funcs are almost free.
Using setup:
$ nvidia-smi
Mon Mar 25 16:27:32 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1050 Ti On | 00000000:01:00.0 Off | N/A |
| 31% 24C P8 N/A / 75W | 14MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM. I've tested the rocFFT and cuFFT backends with DPC++ 2024.0's icpx.
It would be good to see logs again after the rebase as Rbiessy suggests.
SYCL contexts have a many to one mapping to native contexts. Therefore it is necessary to get the desired native context from a SYCL device, as SYCL devices have a one to one mapping to native contexts.
878b981
to
8d14c8a
Compare
Some test results: CUDAgtx1050.txt HIPgfx90a_oneMKL_test.txt I am not sure how to build/run the FFT tests. Are there some build/test instructions that I can follow? |
Fixed. I can successfully build this branch with icpx 2024.0.2 for CUDA |
Thanks a lot @hdelan ! The instructions are here but need to be improved. The short answer is that you should just need to add |
Thanks @Rbiessy ! Building rocFFT is broken for me but this PR does not touch that code. Building with cuFFT is OK. Here is updated tests for all oneMKL for CUDA including cuBLAS, cuFFT, cuRAND, cuSOLVER: |
Thanks! LGTM |
Since oneapi-src/unified-runtime#999 it is no longer valid to get the native context from the SYCL context on a multi GPU system. The get native func for contexts has been deprecated for this reason. See intel/llvm#10975
Similar ticket: oneapi-src/oneDNN#1765