-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][Docs][Joint matrix] Add overloads and restrictions for the offset load store #15499
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -124,7 +124,7 @@ enum class layout { | |
|
||
Consequently, the layout argument `layout` in `joint_matrix_load` can | ||
take `ext_intel_packed` as argument to specify that the data has | ||
already been transformed into VNNI format. In this case, the `stride` | ||
already been transformed into VNNI format. In this case, the `Stride` | ||
argument of `joint_matrix_load` describes the number of elements | ||
between consecutive rows for packed layouts. | ||
|
||
|
@@ -148,28 +148,59 @@ template <typename Group, typename T, size_t Rows, size_t Cols, | |
access::decorated IsDecorated> | ||
void joint_matrix_store(Group g, | ||
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res, | ||
multi_ptr<T, Space, IsDecorated> dest, size_t stride); | ||
multi_ptr<T, Space, IsDecorated> dest, size_t Stride); | ||
|
||
template <typename Group, typename T, size_t Rows, size_t Cols, | ||
layout Layout, access::address_space Space, | ||
access::decorated IsDecorated> | ||
void joint_matrix_store(Group g, | ||
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res, | ||
multi_ptr<T, Space, IsDecorated> dest, size_t stride); | ||
multi_ptr<T, Space, IsDecorated> dest, size_t Stride); | ||
|
||
template <typename Group, typename T, size_t Rows, size_t Cols, | ||
layout Layout, typename PropertyListT> | ||
void joint_matrix_store(Group g, | ||
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res, | ||
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> dest, | ||
size_t stride); | ||
size_t Stride); | ||
|
||
template <typename Group, typename T, size_t Rows, size_t Cols, | ||
layout Layout, typename PropertyListT> | ||
void joint_matrix_store(Group g, | ||
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res, | ||
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> dest, | ||
size_t stride); | ||
size_t Stride); | ||
|
||
// Overloads for offset store | ||
template <typename Group, typename T, size_t Rows, size_t Cols, | ||
layout Layout, access::address_space Space, | ||
access::decorated IsDecorated> | ||
void joint_matrix_store(Group g, | ||
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res, | ||
multi_ptr<T, Space, IsDecorated> base_dest, size_t RowIndex, | ||
size_t ColIndex, size_t Stride); | ||
|
||
template <typename Group, typename T, size_t Rows, size_t Cols, | ||
layout Layout, access::address_space Space, | ||
access::decorated IsDecorated> | ||
void joint_matrix_store(Group g, | ||
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res, | ||
multi_ptr<T, Space, IsDecorated> base_dest, size_t RowIndex, | ||
size_t ColIndex, size_t Stride); | ||
|
||
template <typename Group, typename T, size_t Rows, size_t Cols, | ||
layout Layout, typename PropertyListT> | ||
void joint_matrix_store(Group g, | ||
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res, | ||
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> | ||
base_dest, size_t RowIndex, size_t ColIndex, size_t Stride); | ||
|
||
template <typename Group, typename T, size_t Rows, size_t Cols, | ||
layout Layout, typename PropertyListT> | ||
void joint_matrix_store(Group g, | ||
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res, | ||
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> | ||
base_dest, size_t RowIndex, size_t ColIndex, size_t Stride); | ||
|
||
} // namespace sycl::ext::intel::experimental::matrix | ||
``` | ||
|
@@ -462,6 +493,9 @@ The checked APIs are currently available in devices with the architecture | |
`architecture::intel_gpu_pvc`. The following restrictions apply to | ||
these checked APIs: | ||
|
||
- The `Stride` argument must be a multiple of 8 bytes. Also, `Stride` | ||
should not exceed `2^24^` bytes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See my comment in the other file about the wording of this restriction. |
||
|
||
- The base pointer must be 4 bytes aligned. | ||
|
||
- For 8 bits data type, `RowIndex` must be a multiple of 4. For 16 bits | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -216,7 +216,7 @@ template <typename Group, typename T1, typename T2, | |
access::address_space Space, access::decorated IsDecorated> | ||
void joint_matrix_load(Group g, | ||
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res, | ||
multi_ptr<T2, Space, IsDecorated> src, size_t stride, layout Layout); | ||
multi_ptr<T2, Space, IsDecorated> src, size_t Stride, layout Layout); | ||
|
||
// Only available when Layout != layout::dynamic | ||
// and when std::is_same_v<T1, std::remove_const_t<T2>> | ||
|
@@ -226,15 +226,15 @@ template <typename Group, typename T1, typename T2, | |
access::address_space Space, access::decorated IsDecorated> | ||
void joint_matrix_load(Group g, | ||
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res, | ||
multi_ptr<T2, Space, IsDecorated> src, size_t stride); | ||
multi_ptr<T2, Space, IsDecorated> src, size_t Stride); | ||
|
||
// Only available when std::is_same_v<T1, std::remove_const_t<T2>> | ||
template <typename Group, typename T1, typename T2, | ||
size_t Rows, size_t Cols, | ||
typename PropertyListT> | ||
void joint_matrix_load(Group g, | ||
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res, | ||
annotated_ptr<T2, PropertyListT> src, size_t stride, layout Layout); | ||
annotated_ptr<T2, PropertyListT> src, size_t Stride, layout Layout); | ||
|
||
// Only available when Layout != layout::dynamic | ||
// and when std::is_same_v<T1, std::remove_const_t<T2>> | ||
|
@@ -243,7 +243,7 @@ template <typename Group, typename T1, typename T2, | |
typename PropertyListT> | ||
void joint_matrix_load(Group g, | ||
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res, | ||
annotated_ptr<T2, PropertyListT> src, size_t stride); | ||
annotated_ptr<T2, PropertyListT> src, size_t Stride); | ||
|
||
} // namespace sycl::ext::oneapi::experimental::matrix | ||
``` | ||
|
@@ -261,7 +261,7 @@ The second overload without a memory layout must not be used with a | |
The base pointer `src` of type `T` here determines the starting address of the | ||
matrix to be loaded from. `Layout` determines whether the data is | ||
being read in a row (`row_major`), column major (`col_major`) | ||
fashion. `stride` describes the number of elements between consecutive | ||
fashion. `Stride` describes the number of elements between consecutive | ||
rows for the row major layout, or between columns for the column major | ||
layout. | ||
|
||
|
@@ -301,13 +301,13 @@ template <typename Group, typename T1, typename T2, size_t Rows, size_t Cols, | |
access::address_space Space, access::decorated IsDecorated> | ||
void joint_matrix_store(Group g, | ||
const joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res, | ||
multi_ptr<T2, Space, IsDecorated> dest, size_t stride, layout Layout); | ||
multi_ptr<T2, Space, IsDecorated> dest, size_t Stride, layout Layout); | ||
|
||
template <typename Group, typename T1, typename T2, size_t Rows, size_t Cols, | ||
typename PropertyListT> | ||
void joint_matrix_store(Group g, | ||
const joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res, | ||
annotated_ptr<T2, PropertyListT> dest, size_t stride, layout Layout); | ||
annotated_ptr<T2, PropertyListT> dest, size_t Stride, layout Layout); | ||
|
||
} // namespace sycl::ext::oneapi::experimental::matrix | ||
``` | ||
|
@@ -317,7 +317,7 @@ registers back to memory. | |
The base pointer `dest` here determines the starting address of the | ||
matrix to be stored. `Layout` determines whether the data is being | ||
written in a row (`row_major`), column major (`col_major`) | ||
fashion. `stride` describes the number of elements between consecutive | ||
fashion. `Stride` describes the number of elements between consecutive | ||
rows for the row major layout, or between columns for the column major layout. | ||
|
||
The second overload of `joint_matrix_store` takes | ||
|
@@ -326,6 +326,86 @@ of `sycl::multi_ptr`. The property list associated with the | |
`annotated_ptr` argument represents the compile-time constant | ||
properties for cache control included in the SYCL extenion link:../../proposed/sycl_ext_intel_cache_controls.asciidoc[sycl_ext_intel_cache_controls] | ||
|
||
|
||
==== Offset Load | ||
```c++ | ||
namespace sycl::ext::oneapi::experimental::matrix { | ||
|
||
// Only available when std::is_same_v<T1, std::remove_const_t<T2>> | ||
template <typename Group, typename T1, typename T2, | ||
size_t Rows, size_t Cols, | ||
access::address_space Space, access::decorated IsDecorated> | ||
void joint_matrix_load(Group g, | ||
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res, | ||
multi_ptr<T2, Space, IsDecorated> base_src, size_t RowIndex, | ||
size_t ColIndex, size_t Stride, layout Layout); | ||
|
||
// Only available when Layout != layout::dynamic | ||
// and when std::is_same_v<T1, std::remove_const_t<T2>> | ||
template <typename Group, typename T1, typename T2, | ||
size_t Rows, size_t Cols, | ||
use Use, layout Layout, | ||
access::address_space Space, access::decorated IsDecorated> | ||
void joint_matrix_load(Group g, | ||
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res, | ||
multi_ptr<T2, Space, IsDecorated> base_src, size_t RowIndex, | ||
size_t ColIndex size_t Stride); | ||
|
||
// Only available when std::is_same_v<T1, std::remove_const_t<T2>> | ||
template <typename Group, typename T1, typename T2, | ||
size_t Rows, size_t Cols, | ||
typename PropertyListT> | ||
void joint_matrix_load(Group g, | ||
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res, | ||
annotated_ptr<T2, PropertyListT> base_src, size_t RowIndex, size_t | ||
ColIndex, size_t Stride, layout Layout); | ||
|
||
// Only available when Layout != layout::dynamic | ||
// and when std::is_same_v<T1, std::remove_const_t<T2>> | ||
template <typename Group, typename T1, typename T2, | ||
size_t Rows, size_t Cols, use Use, layout Layout, | ||
typename PropertyListT> | ||
void joint_matrix_load(Group g, | ||
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res, | ||
annotated_ptr<T2, PropertyListT> base_src, size_t RowIndex, size_t | ||
ColIndex, size_t Stride); | ||
|
||
} // namespace sycl::ext::oneapi::experimental::matrix | ||
``` | ||
|
||
These overloads of `joint_matrix_load` takes the pointer `base_src` to | ||
designate the base pointer of the global memory matrix. The | ||
coordinates `RowIndex` and `ColIndex` into the global matrix to | ||
calculate the pointer offset to load/store are given as separate | ||
arguments. | ||
|
||
==== Offset Store | ||
```c++ | ||
namespace sycl::ext::oneapi::experimental::matrix { | ||
|
||
// T1 must be the same as T2 | ||
template <typename Group, typename T1, typename T2, size_t Rows, size_t Cols, | ||
access::address_space Space, access::decorated IsDecorated> | ||
void joint_matrix_store(Group g, | ||
const joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res, | ||
multi_ptr<T2, Space, IsDecorated> base_dest, size_t RowIndex, | ||
size_t ColIndex, size_t Stride, layout Layout); | ||
|
||
template <typename Group, typename T1, typename T2, size_t Rows, size_t Cols, | ||
typename PropertyListT> | ||
void joint_matrix_store(Group g, | ||
const joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res, | ||
annotated_ptr<T2, PropertyListT> base_dest, size_t RowIndex, size_t | ||
ColIndex, size_t Stride, layout Layout); | ||
|
||
} // namespace sycl::ext::oneapi::experimental::matrix | ||
``` | ||
These overloads of `joint_matrix_store` takes the pointer `base_dest` to | ||
designate the base pointer of the global memory matrix. The | ||
coordinates `RowIndex` and `ColIndex` into the global matrix to | ||
calculate the pointer offset to load/store are given as separate | ||
arguments. | ||
|
||
==== Multiply and Add | ||
|
||
```c++ | ||
|
@@ -472,7 +552,7 @@ namespace sycl::ext::oneapi::experimental::matrix { | |
|
||
template <size_t Rows, size_t Cols, typename Group, typename T, | ||
typename Properties = empty_properties_t> | ||
void joint_matrix_prefetch(Group g, T* ptr, size_t stride, layout Layout, | ||
void joint_matrix_prefetch(Group g, T* ptr, size_t Stride, layout Layout, | ||
Properties properties = {}); | ||
|
||
} // namespace sycl::ext::oneapi::experimental::matrix | ||
|
@@ -979,7 +1059,7 @@ for (int i = 0; sizeof(combinations); i++) { | |
} | ||
``` | ||
|
||
=== Appendix: Supported Combinations Per Hardware | ||
=== Appendix: Supported Combinations and Restrictions Per Hardware | ||
The table below provides a list of the combinations that | ||
`joint_matrix` implementations support on each of Intel AMX and Intel | ||
XMX hardware. Note that these can be returned using | ||
|
@@ -1065,6 +1145,25 @@ architecture::intel_gpu_dg2_g11, architecture::intel_gpu_dg2_g12` | |
`architecture::intel_gpu_pvc` | ||
|====================== | ||
|
||
===== Restrictions on `architecture::intel_gpu_pvc` | ||
|
||
- The `Stride` argument to `joint_matrix_load` and | ||
`joint_matrix_store` must be a multiple of 8 bytes. Also, `Stride` | ||
should not exceed `2^24^` bytes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The
|
||
|
||
- The base pointer argument to `joint_matrix_load` and | ||
`joint_matrix_store` must be 4 bytes aligned. | ||
|
||
- In the case of the offset overloads of `joint_matrix_load` and | ||
`joint_matrix_store`, for 8 bits data type, `RowIndex` must be a | ||
multiple of 4. For 16 bits data type, `RowIndex` must be a multiple | ||
of 2. So `RowIndex` must be a multiple of 4 divided by size of the | ||
element type (`4/sizeof(T)`). | ||
|
||
- If these restrictions are not satisfied, users can switch to slower | ||
implementations of `joint_matrix_load` and `joint_matrix_store` by | ||
setting the driver flag `IGC_JointMatrixLoadStoreOpt=1`. | ||
|
||
==== Nvidia Tensor Cores Supported Combinations | ||
The complete set of matrix data types and shapes that are supported by | ||
the `ext_oneapi_cuda` backend are represented in the following | ||
|
@@ -1118,11 +1217,11 @@ supported parameter combination is specified in the following table. | |
| `matrix_type::fp64` | `matrix_type::fp64` | `matrix_type::fp64` |8 |8 |4 | ||
|====================== | ||
|
||
IMPORTANT: The `stride` argument to `joint_matrix_load` and | ||
IMPORTANT: The `Stride` argument to `joint_matrix_load` and | ||
`joint_matrix_store` must be a multiple of 8 when `T` is `half`, and a | ||
multiple of 4 when `T` is `float`; where `T` is the type of the | ||
`joint_matrix` elements. When `T` is not `half` or `float` there are | ||
no restrictions to `stride`. | ||
no restrictions to `Stride`. | ||
|
||
IMPORTANT: For some devices it is important to use the sm version | ||
(Compute Capability) corresponding to the device that will run the | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why capitalize this parameter name? All the other parameter names start with a lower case letter. Our style is that function parameter names are lower case (snake_case) while template parameter names are upper case (CamelCase).
I see below that you have added parameter names
RowIndex
andColIndex
. These should berow_index
andcol_index
to be consistent.