-
Notifications
You must be signed in to change notification settings - Fork 112
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #815 from hdelan/add-cuda-hip-usage-guides
Add initial CUDA and HIP usage guides
- Loading branch information
Showing
3 changed files
with
256 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,159 @@ | ||
<% | ||
OneApi=tags['$OneApi'] | ||
x=tags['$x'] | ||
X=x.upper() | ||
%> | ||
|
||
========================== | ||
CUDA UR Reference Document | ||
========================== | ||
|
||
This document gives general guidelines of how to use UR to load and build | ||
programs, and execute kernels on a CUDA device. | ||
|
||
Device code | ||
=========== | ||
|
||
A CUDA device image may be made of PTX and/or SASS, two different kinds of | ||
device code for NVIDIA GPUs. | ||
|
||
CUDA device images can be generated by a CUDA-capable compiler toolchain. Most | ||
CUDA compiler toolchains are capable of generating PTX, SASS and/or bundles of | ||
PTX and SASS. | ||
|
||
PTX | ||
--- | ||
|
||
PTX is a high level NVIDIA ISA which can be JIT compiled at runtime by the CUDA | ||
driver. In UR, this JIT compilation happens at ${x}ProgramBuild, where PTX is | ||
assembled into device specific SASS which then can run on device. | ||
|
||
PTX is forward compatible, so PTX generated for ``.target sm_52`` will be JIT | ||
compiled without issue for devices with a greater compute capability than | ||
``sm_52``. Whereas PTX generated for ``sm_80`` cannot be JIT compiled for an | ||
``sm_60`` device. | ||
|
||
An advantage of using PTX over SASS is that one code can run on multiple | ||
devices. However, PTX generated for an older arch may not give access to newer | ||
hardware instructions, such as new atomic operations, or tensor core | ||
instructions. | ||
|
||
JIT compilation has some overhead at ${x}ProgramBuild, especially if the program | ||
that is being loaded contains multiple kernels. The ``ptxjitcompiler`` keeps a | ||
JIT cache, however, so this overhead is only paid the first time that a program | ||
is built. JIT caching may be turned off by setting the environment variable | ||
``CUDA_CACHE_DISABLE=1``. | ||
|
||
SASS | ||
---- | ||
|
||
SASS is a device specific binary which may be produced by ``ptxas`` or some | ||
other tool. SASS is specific to an individual arch and is not portable across | ||
arches. | ||
|
||
A SASS file may be stored as a ``.cubin`` file by NVIDIA tools. | ||
|
||
UR Programs | ||
=========== | ||
|
||
A ${x}_program_handle_t has a one to one mapping with the CUDA driver object | ||
`CUModule <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MODULE.html#group__CUDA__MODULE>`_. | ||
|
||
In UR for CUDA, a ${x}_program_handle_t can be created using | ||
${x}ProgramCreateWithBinary with: | ||
|
||
* A single PTX module, stored as a null terminated ``uint8_t`` buffer. | ||
* A single SASS module, stored as an opaque ``uint8_t`` buffer. | ||
* A mixed PTX/SASS module, where the SASS module is the assembled PTX module. | ||
|
||
A ${x}_program_handle_t is valid only for a single architecture. If a CUDA | ||
compatible binary contains device code for multiple NVIDIA architectures, it is | ||
the user's responsibility to split these separate device images so that | ||
${x}ProgramCreateWithBinary is only called with a device binary for a single | ||
device arch. | ||
|
||
If a program is large and contains many kernels, loading and/or JIT compiling | ||
the program may have a high overhead. This can be mitigated by splitting a | ||
program into multiple smaller programs (corresponding to PTX/SASS files). In | ||
this way, an application will only pay the overhead of loading/compiling | ||
kernels that it will likely use. | ||
|
||
Using PTX Modules in UR | ||
----------------------- | ||
|
||
A PTX module will be loaded and JIT compiled for the necessary architecture at | ||
${x}ProgramBuild. If the PTX module has been generated for a compute capability | ||
greater than the compute capability of the device, then ${x}ProgramBuild will | ||
fail with the error ``CUDA_ERROR_NO_BINARY_FOR_GPU``. | ||
|
||
A PTX module passed to ${x}ProgramBuild must contain only one PTX file. | ||
Separate PTX files are to be handled separately. | ||
|
||
Arguments may be passed to the ``ptxjitcompiler`` via ${x}ProgramBuild. | ||
Currently ``maxrregcount`` is the only supported argument. | ||
|
||
.. parsed-literal:: | ||
${x}ProgramBuild(ctx, program, "maxrregcount=128"); | ||
Using SASS Modules in UR | ||
------------------------ | ||
|
||
A SASS module will be loaded and checked for compatibility at ${x}ProgramBuild. | ||
If the SASS module is incompatible with the device arch then ${x}ProgramBuild | ||
will fail with the error ``CUDA_ERROR_NO_BINARY_FOR_GPU``. | ||
|
||
Using Mixed PTX/SASS Bundles in UR | ||
---------------------------------- | ||
|
||
Mixed PTX/SASS modules can be used to make a program with | ||
${x}ProgramCreateWithBinary. At ${x}ProgramBuild the CUDA driver will check | ||
whether the bundled SASS is compatible with the active device. If the SASS is | ||
compatible then the ${x}_program_handle_t will be built from the SASS, and if | ||
not then the PTX will be used as a fallback and JIT compiled by the CUDA | ||
driver. If both PTX and SASS are incompatible with the active device then | ||
${x}ProgramBuild will fail with the error ``CUDA_ERROR_NO_BINARY_FOR_GPU``. | ||
|
||
UR Kernels | ||
========== | ||
|
||
Once ${x}ProgramCreateWithBinary and ${x}ProgramBuild have succeeded, kernels | ||
can be fetched from programs with ${x}KernelCreate. ${x}KernelCreate must be | ||
called with the exact name of the kernel in the PTX/SASS module. This name will | ||
depend on the mangling used when compiling the kernel, so it is recommended to | ||
examine the symbols in the PTX/SASS module before trying to extract kernels in | ||
UR. | ||
|
||
.. code-block:: console | ||
$ cuobjdump --dump-elf-symbols hello.cubin | grep mykernel | ||
_Z13mykernelv | ||
At present it is not possible to query the names of the kernels in a UR program | ||
for CUDA, so it is necessary to know the (mangled or otherwise) names of kernels | ||
in advance or by some other means. | ||
|
||
UR kernels can be dispatched with ${x}EnqueueKernelLaunch. The argument | ||
``pGlobalWorkOffset`` can only be used if the kernels have been instrumented to | ||
take the extra global offset argument. Use of the global offset is not | ||
recommended for non SYCL compiler toolchains. This parameter can be ignored if | ||
the user does not wish to use the global offset. | ||
|
||
Other Notes | ||
=========== | ||
|
||
- The environment variable ``SYCL_PI_CUDA_MAX_LOCAL_MEM_SIZE`` can be set in | ||
order to exceed the default max dynamic local memory size. More information | ||
can be found | ||
`here <https://intel.github.io/llvm-docs/EnvironmentVariables.html#controlling-dpc-cuda-plugin>`_. | ||
- The size of primitive datatypes may differ in host and device code. For | ||
instance, NVCC treats ``long double`` as 8 bytes for device and 16 bytes for | ||
host. | ||
- In kernel ``printf`` for NVPTX targets does not support the ``%z`` modifier. | ||
|
||
Contributors | ||
------------ | ||
|
||
* Hugh Delaney `hugh.delaney@codeplay.com <hugh.delaney@codeplay.com>`_ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
<% | ||
OneApi=tags['$OneApi'] | ||
x=tags['$x'] | ||
X=x.upper() | ||
%> | ||
|
||
============================= | ||
AMD HIP UR Reference Document | ||
============================= | ||
|
||
This document gives general guidelines of how to use UR to execute kernels on | ||
a AMD HIP device. | ||
|
||
Device code | ||
=========== | ||
|
||
Unlike the NVPTX platform, AMDGPU does not use a device IR that can be JIT | ||
compiled at runtime. Therefore, all device binaries must be precompiled for a | ||
particular arch. | ||
|
||
The naming of AMDGPU device code files may vary across different generations | ||
of devices. ``.hsa`` or ``.hsaco`` are common extensions as of 2023. | ||
|
||
HIPCC can generate device code for a particular arch using the ``--genco`` flag | ||
|
||
.. code-block:: console | ||
$ hipcc --genco hello.cu --amdgpu-target=gfx906 -o hello.hsaco | ||
UR Programs | ||
=========== | ||
|
||
A ${x}_program_handle_t has a one to one mapping with the HIP runtime object | ||
`hipModule_t <https://docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___module.html>`__ | ||
|
||
In UR for HIP, a ${x}_program_handle_t can be created using | ||
${x}ProgramCreateWithBinary with: | ||
|
||
* A single device code module | ||
|
||
A ${x}_program_handle_t is valid only for a single architecture. If a HIP | ||
compatible binary contains device code for multiple AMDGPU architectures, it is | ||
the user's responsibility to split these separate device images so that | ||
${x}ProgramCreateWithBinary is only called with a device binary for a single | ||
device arch. | ||
|
||
If the AMDGPU module is incompatible with the device arch then ${x}ProgramBuild | ||
will fail with the error ``hipErrorNoBinaryForGpu``. | ||
|
||
If a program is large and contains many kernels, loading the program may have a | ||
high overhead. This can be mitigated by splitting a program into multiple | ||
smaller programs. In this way, an application will only pay the overhead of | ||
loading kernels that it will likely use. | ||
|
||
Kernels | ||
======= | ||
|
||
Once ${x}ProgramCreateWithBinary and ${x}ProgramBuild have succeeded, kernels | ||
can be fetched from programs with ${x}KernelCreate. ${x}KernelCreate must be | ||
called with the exact name of the kernel in the AMDGPU device code module. This | ||
name will depend on the mangling used when compiling the kernel, so it is | ||
recommended to examine the symbols in the AMDGPU device code module before | ||
trying to extract kernels in UR code. | ||
|
||
``llvm-objdump`` or ``readelf`` may not correctly view the symbols in an AMDGPU | ||
device module. It may be necessary to call ``clang-offload-bundler`` first in | ||
order to extract the ``ELF`` file that can be passed to ``readelf``. | ||
|
||
.. code-block:: console | ||
$ clang-offload-bundler --unbundle --input=hello.hsaco --output=hello.o \ | ||
--targets=hipv4-amdgcn-amd-amdhsa--gfx906 --type=o | ||
$ readelf hello.o -s | grep mykernel | ||
_Z13mykernelv | ||
At present it is not possible to query the names of the kernels in a UR program | ||
for HIP, so it is necessary to know the (mangled or otherwise) names of kernels | ||
in advance or by some other means. | ||
|
||
UR kernels can be dispatched with ${x}EnqueueKernelLaunch. The argument | ||
``pGlobalWorkOffset`` can only be used if the kernels have been instrumented to | ||
take the extra global offset argument. Use of the global offset is not | ||
recommended for non SYCL compiler toolchains. This parameter can be ignored if | ||
the user does not wish to use the global offset. | ||
|
||
Other Notes | ||
=========== | ||
|
||
- In kernel ``printf`` may not work for certain ROCm versions. | ||
|
||
Contributors | ||
------------ | ||
|
||
* Hugh Delaney `hugh.delaney@codeplay.com <hugh.delaney@codeplay.com>`_ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,5 +14,7 @@ | |
core/INTRO.rst | ||
core/PROG.rst | ||
core/CONTRIB.rst | ||
core/CUDA.rst | ||
core/HIP.rst | ||
exp-features.rst | ||
api.rst |