Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA][HIP] Use device to get native context #425

Merged
merged 1 commit into from
Apr 1, 2024

Conversation

hdelan
Copy link
Contributor

@hdelan hdelan commented Dec 7, 2023

Since oneapi-src/unified-runtime#999 it is no longer valid to get the native context from the SYCL context on a multi GPU system. The get native func for contexts has been deprecated for this reason. See intel/llvm#10975

Similar ticket: oneapi-src/oneDNN#1765

@jinz2014
Copy link
Contributor

reading your changes, I have a question.

For example,

auto cudaDevice = sycl::get_nativesycl::backend::ext_oneapi_cuda(queue.get_device());

Is the type of cudaDevice "CUdevice" ?

@hdelan
Copy link
Contributor Author

hdelan commented Dec 11, 2023

Hi @jinz2014 yes you are correct!

@FMarno
Copy link
Contributor

FMarno commented Dec 11, 2023

cufft_run.txt
All the DFT changes look good to me and I've run the DFT tests successfully.
I'd like to see test logs for the other backends before I approve.

@hdelan
Copy link
Contributor Author

hdelan commented Dec 22, 2023

AMD tests for lapack and blas all passing:
test_amd.txt

8 lapack nvidia test failing on GTX1050 but these tests are also failing on develop branch:
test_cuda_lapack.txt

Nvidia blas tests passing
test_cuda_blas.txt

@muhammad-tanvir-1211
Copy link
Contributor

muhammad-tanvir-1211 commented Jan 12, 2024

I see all the buffer tests failing for the rocblas backend with PI_ERROR_INVALID_OPERATION.

Logs:
PR_425.txt

The failures are not because of the changes in this PR, but rather a recent change in the compiler. All these tests are expected to pass once oneapi-src/unified-runtime#1226 and intel/llvm#12297 are merged.

Copy link
Contributor

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be able to attach test logs again? Also does this change compile with the 2024.0 icpx release?

// Getting the primary context also sets it as the active context
CUDA_ERROR_FUNC(cuDevicePrimaryCtxRetain, err, &desired, cudaDevice);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expect a performance cost from this change? From what I understand cuCtxSetCurrent was expected to be called only once before, assuming the context active was not changed outside of oneMKL.
This constructor is called once before each calls to blas functions so I am wary that the cost may add up.

Copy link
Contributor Author

@hdelan hdelan Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost of cuDevicePrimaryContextRetain is minimal once the primary context is not being initialized for the first time, which it should not be here. A simple benchmark like this:

for (int i = 0; i < NUM_ITERATIONS; i++) {
    CHECK(cuDevicePrimaryCtxRetain(&context, device));
    CHECK(cuDevicePrimaryCtxRelease(device));
}

Gives 32ns per loop, so calls to these funcs are almost free.

Using setup:

$ nvidia-smi
Mon Mar 25 16:27:32 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050 Ti     On  | 00000000:01:00.0 Off |                  N/A |
| 31%   24C    P8              N/A /  75W |     14MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         

Copy link
Contributor

@hjabird hjabird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. I've tested the rocFFT and cuFFT backends with DPC++ 2024.0's icpx.
It would be good to see logs again after the rebase as Rbiessy suggests.

SYCL contexts have a many to one mapping to native contexts. Therefore
it is necessary to get the desired native context from a SYCL device, as
SYCL devices have a one to one mapping to native contexts.
@hdelan
Copy link
Contributor Author

hdelan commented Mar 26, 2024

Some test results:

CUDA

gtx1050.txt
Some failures due to precision also present on develop branch.

HIP

gfx90a_oneMKL_test.txt
Test failures in HIP are also present on the develop branch:
gfx90a_oneMKL_test_develop_branch.txt

I am not sure how to build/run the FFT tests. Are there some build/test instructions that I can follow?

@hdelan
Copy link
Contributor Author

hdelan commented Mar 26, 2024

In terms of building with icpx 2024.0.0 for CUDA. I am getting a segfault at linking with develop branch.

Fixed. LD_LIBRARY_PATH problems -_-

I can successfully build this branch with icpx 2024.0.2 for CUDA

@Rbiessy
Copy link
Contributor

Rbiessy commented Mar 27, 2024

Thanks a lot @hdelan ! The instructions are here but need to be improved.

The short answer is that you should just need to add -DENABLE_CUFFT_BACKEND=True -DENABLE_ROCFFT_BACKEND=True to also test the DFT domain with the native CUDA and HIP backends.
If you are explicitly setting -DTARGET_DOMAINS in your CMake command you will also need to append dft to the list, otherwise it will be enabled by default.
If you don't want to build and test the other domains again you can use -DTARGET_DOMAINS=dft.

@hdelan
Copy link
Contributor Author

hdelan commented Mar 27, 2024

Thanks @Rbiessy !

Building rocFFT is broken for me but this PR does not touch that code. Building with cuFFT is OK. Here is updated tests for all oneMKL for CUDA including cuBLAS, cuFFT, cuRAND, cuSOLVER:

gtx1050.txt

@ericlars
Copy link
Contributor

Thanks! LGTM

@Rbiessy
Copy link
Contributor

Rbiessy commented Mar 28, 2024

Thanks for the review. Let me know @lhuot or @mmeterel if you need more time, otherwise I will go ahead and merge this on Monday.

@Rbiessy Rbiessy merged commit 4635cad into oneapi-src:develop Apr 1, 2024
normallytangent pushed a commit to normallytangent/oneMKL that referenced this pull request Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants