-
oneAPI 2023.0 supports CUDA devices using the "oneAPI for NVIDIA GPUs 2023.0" plugin. I am starting this exploratory discussion to evaluate the requirements and scope of work to support CUDA in dpctl via the oneAPI plugin. Here are the findings from my initial exploration: System information: OS: Ubuntu 22.04 Jammy Initial setup steps: a) Installed oneAPI following the installation guide NOTE: Watch out for installation issues on Ubuntu 22.04 (cstddef.h not found etc.) to work around do b) I already had CUDA set up and I had followed the CUDA guide to install on my OS c) Downloaded the oneAPI for NVIDIA GPUs plugin and followed the installation guide NOTE: If you have multiple type of devices on the system (I have openCL GPU driver and L0 GPU driver for a gen9 integrated GPU, openCL CPU driver for a gen9 CPU, and CUDA), you can compile the Building dpctl with CUDA a) Build dpctl with the customized oneAPI. The process for me was just to run NOTE: be sure to remove the Testing the install a) After building and installing dpctl using the >>> import dpctl
>>> dpctl.lsplatform()
Intel(R) FPGA Emulation Platform for OpenCL(TM) OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3
Intel(R) OpenCL OpenCL 3.0 LINUX
Intel(R) OpenCL HD Graphics OpenCL 3.0
Intel(R) Level-Zero 1.3
NVIDIA CUDA BACKEND CUDA 11.4 So far so good, the CUDA GPU is detected as expected. b) Creating a CUDA stream: >>> q = dpctl.SyclQueue("cuda")
>>> q.sycl_device
<dpctl.SyclDevice [backend_type.cuda, device_type.gpu, NVIDIA GeForce GTX 1660 Ti] at 0x7f9e9287bf70>
>>> q.sycl_device.print_device_info()
Name NVIDIA GeForce GTX 1660 Ti
Driver version CUDA 11.4
Vendor NVIDIA Corporation
Filter string cuda:gpu:0 c) Try a basic tensor creation: >>> import dpctl.tensor as dpt
>>> a = dpt.empty(10, device="cuda")
>>> a.sycl_device
<dpctl.SyclDevice [backend_type.cuda, device_type.gpu, NVIDIA GeForce GTX 1660 Ti] at 0x7f9e92fc5e70>
>>> print(a)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>>> a
usm_ndarray([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
>>> a.usm_type
'device' Initial thoughts I went much farther than I had hoped to get. The plugin seamlessly exposed the CUDA device, queue creation works and even memory allocation seems to have succeeded. The next steps will be to test some basic operations on the tensor. @oleksandr-pavlyk can you suggest something? Although, I doubt that will work out of the box. I think we will need to build dpctl with |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
@diptorupd Try For
|
Beta Was this translation helpful? Give feedback.
-
@oleksandr-pavlyk As expected, running a kernel with a default compiled dpctl as-is will not work: >>> a = dpt.arange(30, device=dev); b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/diptorupd/Desktop/devel/dpctl/dpctl/tensor/_ctors.py", line 642, in arange
hev, _ = ti._linspace_step(_start, _step, res, sycl_queue)
RuntimeError: Native API failed. Native API returns: -42 (PI_ERROR_INVALID_BINARY) -42 (PI_ERROR_INVALID_BINARY) However, after the following small patch Author: Diptorup Deb <diptorup.deb@intel.com> 2023-03-14 23:47:18
Committer: Diptorup Deb <diptorup.deb@intel.com> 2023-03-14 23:47:18
Parent: 8f828f24ada9829ed4d9d5dc56e6d7f39dd9ac3c (Merge pull request #1118 from IntelPython/fix-build-break)
Branch: demo/cuda-support
Follows: 0.14.2
Precedes:
Compile with cuda support
----------------------------- dpctl/CMakeLists.txt -----------------------------
index 6ccca33dd..f8c08f105 100644
@@ -58,6 +58,7 @@ elseif(UNIX)
"${WARNING_FLAGS}"
"${SDL_FLAGS}"
"-fsycl "
+ "-fsycl-targets=nvptx64-nvidia-cuda,spir64-unknown-unknown "
)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -O3 ${CFLAGS}")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 ${CXXFLAGS}")
--------------------------- scripts/build_locally.py ---------------------------
index ff34c9d18..9c689ead9 100644
@@ -145,7 +145,7 @@ if __name__ == "__main__":
and args.compiler_root is None
):
args.c_compiler = "icx"
- args.cxx_compiler = "icpx" if "linux" in sys.platform else "icx"
+ args.cxx_compiler = "clang++" if "linux" in sys.platform else "icx"
args.compiler_root = None
else:
cr = args.compiler_root
@@ -153,7 +153,9 @@ if __name__ == "__main__":
if args.c_compiler is None:
args.c_compiler = "icx"
if args.cxx_compiler is None:
- args.cxx_compiler = "icpx" if "linux" in sys.platform else "icx"
+ args.cxx_compiler = (
+ "clang++" if "linux" in sys.platform else "icx"
+ )
else:
raise RuntimeError(
"Option 'compiler-root' must be provided when " There were a few warnings of the kind: >>> import dpctl
>>> import dpctl.tensor as dpt
>>> dev = "cuda"
>>> a = dpt.arange(30, device=dev)
>>> a
usm_ndarray([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])
>>> print(a)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29]
>>> a.sycl_device
<dpctl.SyclDevice [backend_type.cuda, device_type.gpu, NVIDIA GeForce GTX 1660 Ti] at 0x7f2e89f698b0>
>>> b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b]
>>> c
usm_ndarray([ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22])
>>> c.sycl_device
<dpctl.SyclDevice [backend_type.cuda, device_type.gpu, NVIDIA GeForce GTX 1660 Ti] at 0x7f2e89f698b0> |
Beta Was this translation helpful? Give feedback.
-
With #1411, one can build $ DPCTL_TARGET_CUDA=1 python scripts/build_locally.py --verbose This creates fat binary with SPV and PTX offload sections. Test suite passes using CUDA backend. Since the GPU at my disposal is weak (GT 1030) I must run each test file individually: $ ONEAPI_DEVICE_SELECTOR=cuda:gpu find dpctl/tests/ -name "test_*.py" | xargs -n 1 bash -c 'python -m pytest $0 --durations=3 || exit 255' With beefier GPU, running the test suite works out of the box: $ ONEAPI_DEVICE_SELECTOR=cuda:gpu pytest --pyargs dpctl |
Beta Was this translation helpful? Give feedback.
@oleksandr-pavlyk As expected, running a kernel with a default compiled dpctl as-is will not work:
However, after the following small patch