[SYCL][HIP] Support of AMD matrix core instructions (#11485)

* Support one block AMD matrix core instructions for `__gfx90a__` architecture. * Supports `__builtin_amdgcn_mfma_i32_32x32x8i8`, `__builtin_amdgcn_mfma_i32_16x16x16i8`, `__builtin_amdgcn_mfma_f64_16x16x4f64`, `__builtin_amdgcn_mfma_f32_32x32x8bf16_1k`, `__builtin_amdgcn_mfma_f32_16x16x16bf16_1k`, `__builtin_amdgcn_mfma_f32_32x32x8f16` and `__builtin_amdgcn_mfma_f32_16x16x16f16` instructions. * Add HIP matrix core support into joint_matrix documentation. Should be merged after - #11215 --------- Co-authored-by: Bing1 Yu <bing1.yu@intel.com> Co-authored-by: mmoadeli <mahmoudmoadeli@codeplay.com>
intel · Oct 30, 2023 · 31481ce · 31481ce
1 parent 9c07b46
commit 31481ce
Show file tree

Hide file tree

Showing 16 changed files with 1,268 additions and 32 deletions.
diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -50,7 +50,7 @@ specification.*
 This extension is currently implemented in {dpcpp} only for devices
 that contain a matrix hardware, specifically Intel(R) Advanced Matrix
 Extensions (Intel(R) AMX), Intel(R) Xe Matrix Extensions (Intel(R)
-XMX) and Nvidia(R) Tensor Cores.
+XMX), Nvidia(R) Tensor Cores and AMD Matrix Cores(R).
 
 The `joint_matrix` type and the `joint_matrix_mad` function are
 optional kernel features as defined in section 5.7 of the core SYCL
@@ -67,8 +67,8 @@ implementation throws a synchronous exception with the
 
 == Overview
 Joint matrix is a SYCL extension for matrix hardware programming. It
-unifies targets like Intel AMX in CPUs, Intel XMX in Intel GPUs and
-Nvidia Tensor Cores. This provides a portable and performant API for
+unifies targets like Intel AMX in CPUs, Intel XMX in Intel GPUs,
+Nvidia Tensor Cores and AMD Matrix Cores(R). This provides a portable and performant API for
 users who want to build their own neural networks applications,
 perform custom optimizations, or experiment with new operations in a
 timely and performing manner.
@@ -921,7 +921,8 @@ the type of the A matrix must be the same as the type of the B
 matrix.
 
 IMPORTANT: When compiling for the `ext_oneapi_cuda` backend the target
-arch backend flag, `-Xsycl-target-backend --cuda-gpu-arch=sm_xx`, must
+arch backend flag, `-fsycl-targets=nvidia_gpu_sm_xx`
+(or equivalents, e.g. `-Xsycl-target-backend --cuda-gpu-arch=sm_xx`), must
 be used, where `sm_xx` must be a Compute Capability that is equal to
 or greater than the appropriate Minimum Compute Capability. When an
 executable has been compiled for `sm_xx`, if the executable is run on
@@ -971,6 +972,34 @@ multiple of 4 when `T` is `float`; where `T` is the type of the
 `joint_matrix` elements. When `T` is not `half` or `float` there are
 no restrictions to `stride`.
 
+==== AMD Matrix Cores Supported Combinations
+The complete set of matrix data types and dimensions that are supported by
+the `ext_oneapi_hip` backend are represented in the following
+table. In this architecture's implementation, A and B matrices must have the same type. 
+Similarly, C and D matrices must share the same type.
+
+IMPORTANT: The supported instructions may be run on GFX90A (MI200, MI210, MI250 and MI250X GPUs)
+architecture. When compiling for the `ext_oneapi_hip` backend the 
+target arch backend flag, `-fsycl-targets=amd_gpu_gfx90a`, must
+be used. An attempt to run the compiled code on an unsupported architecture will throw an error. 
+
+
+[frame="none",options="header"]
+|======================
+| A and B type | C and D type | M | N | K
+.2+| `matrix_type::fp16`  .2+| `matrix_type::fp32`
+|32 |32 |8 
+|16 |16 |16
+.2+| `matrix_type::sint8`  .2+| `matrix_type::sint32`
+|32 |32 |8 
+|16 |16 |16
+.2+|`matrix_type::bf16`  .2+|`matrix_type::fp32`
+|32 |32 |8 
+|16 |16 |16
+.1+|`matrix_type::fp64`  .1+| `matrix_type::fp64`
+|16 |16 |4
+|======================
+
 === Revision History
 
 [frame="none",options="header"]
@@ -990,4 +1019,5 @@ the Intel-specifics to a separate extension document
 type, runtime query, and supported combinations appendix for Intel AMX
 and Intel XMX
 |7   |2023-04-11 |Jack Kirk |Add Nvidia Tensor Cores supported combinations
+|8   |2023-10-05 |Mahmoud Moadeli |Add AMD Matrix Core supported combinations
 |======================
diff --git a/sycl/include/sycl/detail/defines.hpp b/sycl/include/sycl/detail/defines.hpp
@@ -39,9 +39,11 @@
 #define __SYCL_TYPE(x)
 #endif
 
-// joint matrix should only be included by default for SPIR or NVPTX backends
-#if defined __SPIR__ || defined __NVPTX__ || !defined __SYCL_DEVICE_ONLY__
+// joint matrix should only be included by default for SPIR, NVPTX or HIP(GFX90A
+// only) backends
+#if defined __SPIR__ || defined __NVPTX__ || !defined __SYCL_DEVICE_ONLY__ ||  \
+    defined __gfx90a__
 #ifndef SYCL_EXT_ONEAPI_MATRIX_VERSION
 #define SYCL_EXT_ONEAPI_MATRIX_VERSION 4
 #endif // SYCL_EXT_ONEAPI_MATRIX_VERSION
-#endif // __SPIR__ || __NVPTX__ || !__SYCL_DEVICE_ONLY
+#endif // __SPIR__ || __NVPTX__ || !__SYCL_DEVICE_ONLY || __gfx90a__