Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Docs][Joint matrix] Add overloads and restrictions for the offset load store #15499

Merged
merged 4 commits into from
Oct 22, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ enum class layout {

Consequently, the layout argument `layout` in `joint_matrix_load` can
take `ext_intel_packed` as argument to specify that the data has
already been transformed into VNNI format. In this case, the `stride`
already been transformed into VNNI format. In this case, the `Stride`
argument of `joint_matrix_load` describes the number of elements
between consecutive rows for packed layouts.

Expand All @@ -148,28 +148,59 @@ template <typename Group, typename T, size_t Rows, size_t Cols,
access::decorated IsDecorated>
void joint_matrix_store(Group g,
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
multi_ptr<T, Space, IsDecorated> dest, size_t stride);
multi_ptr<T, Space, IsDecorated> dest, size_t Stride);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why capitalize this parameter name? All the other parameter names start with a lower case letter. Our style is that function parameter names are lower case (snake_case) while template parameter names are upper case (CamelCase).

I see below that you have added parameter names RowIndex and ColIndex. These should be row_index and col_index to be consistent.


template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, access::address_space Space,
access::decorated IsDecorated>
void joint_matrix_store(Group g,
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
multi_ptr<T, Space, IsDecorated> dest, size_t stride);
multi_ptr<T, Space, IsDecorated> dest, size_t Stride);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, typename PropertyListT>
void joint_matrix_store(Group g,
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> dest,
size_t stride);
size_t Stride);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, typename PropertyListT>
void joint_matrix_store(Group g,
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> dest,
size_t stride);
size_t Stride);

// Overloads for offset store
template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, access::address_space Space,
access::decorated IsDecorated>
void joint_matrix_store(Group g,
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
multi_ptr<T, Space, IsDecorated> base_dest, size_t RowIndex,
size_t ColIndex, size_t Stride);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, access::address_space Space,
access::decorated IsDecorated>
void joint_matrix_store(Group g,
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
multi_ptr<T, Space, IsDecorated> base_dest, size_t RowIndex,
size_t ColIndex, size_t Stride);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, typename PropertyListT>
void joint_matrix_store(Group g,
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
ext::oneapi::experimental::annotated_ptr<T, PropertyListT>
base_dest, size_t RowIndex, size_t ColIndex, size_t Stride);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, typename PropertyListT>
void joint_matrix_store(Group g,
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
ext::oneapi::experimental::annotated_ptr<T, PropertyListT>
base_dest, size_t RowIndex, size_t ColIndex, size_t Stride);

} // namespace sycl::ext::intel::experimental::matrix
```
Expand Down Expand Up @@ -462,6 +493,9 @@ The checked APIs are currently available in devices with the architecture
`architecture::intel_gpu_pvc`. The following restrictions apply to
these checked APIs:

- The `Stride` argument must be a multiple of 8 bytes. Also, `Stride`
should not exceed `2^24^` bytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment in the other file about the wording of this restriction.


- The base pointer must be 4 bytes aligned.

- For 8 bits data type, `RowIndex` must be a multiple of 4. For 16 bits
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,7 @@ template <typename Group, typename T1, typename T2,
access::address_space Space, access::decorated IsDecorated>
void joint_matrix_load(Group g,
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
multi_ptr<T2, Space, IsDecorated> src, size_t stride, layout Layout);
multi_ptr<T2, Space, IsDecorated> src, size_t Stride, layout Layout);

// Only available when Layout != layout::dynamic
// and when std::is_same_v<T1, std::remove_const_t<T2>>
Expand All @@ -226,15 +226,15 @@ template <typename Group, typename T1, typename T2,
access::address_space Space, access::decorated IsDecorated>
void joint_matrix_load(Group g,
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res,
multi_ptr<T2, Space, IsDecorated> src, size_t stride);
multi_ptr<T2, Space, IsDecorated> src, size_t Stride);

// Only available when std::is_same_v<T1, std::remove_const_t<T2>>
template <typename Group, typename T1, typename T2,
size_t Rows, size_t Cols,
typename PropertyListT>
void joint_matrix_load(Group g,
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
annotated_ptr<T2, PropertyListT> src, size_t stride, layout Layout);
annotated_ptr<T2, PropertyListT> src, size_t Stride, layout Layout);

// Only available when Layout != layout::dynamic
// and when std::is_same_v<T1, std::remove_const_t<T2>>
Expand All @@ -243,7 +243,7 @@ template <typename Group, typename T1, typename T2,
typename PropertyListT>
void joint_matrix_load(Group g,
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res,
annotated_ptr<T2, PropertyListT> src, size_t stride);
annotated_ptr<T2, PropertyListT> src, size_t Stride);

} // namespace sycl::ext::oneapi::experimental::matrix
```
Expand All @@ -261,7 +261,7 @@ The second overload without a memory layout must not be used with a
The base pointer `src` of type `T` here determines the starting address of the
matrix to be loaded from. `Layout` determines whether the data is
being read in a row (`row_major`), column major (`col_major`)
fashion. `stride` describes the number of elements between consecutive
fashion. `Stride` describes the number of elements between consecutive
rows for the row major layout, or between columns for the column major
layout.

Expand Down Expand Up @@ -301,13 +301,13 @@ template <typename Group, typename T1, typename T2, size_t Rows, size_t Cols,
access::address_space Space, access::decorated IsDecorated>
void joint_matrix_store(Group g,
const joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
multi_ptr<T2, Space, IsDecorated> dest, size_t stride, layout Layout);
multi_ptr<T2, Space, IsDecorated> dest, size_t Stride, layout Layout);

template <typename Group, typename T1, typename T2, size_t Rows, size_t Cols,
typename PropertyListT>
void joint_matrix_store(Group g,
const joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
annotated_ptr<T2, PropertyListT> dest, size_t stride, layout Layout);
annotated_ptr<T2, PropertyListT> dest, size_t Stride, layout Layout);

} // namespace sycl::ext::oneapi::experimental::matrix
```
Expand All @@ -317,7 +317,7 @@ registers back to memory.
The base pointer `dest` here determines the starting address of the
matrix to be stored. `Layout` determines whether the data is being
written in a row (`row_major`), column major (`col_major`)
fashion. `stride` describes the number of elements between consecutive
fashion. `Stride` describes the number of elements between consecutive
rows for the row major layout, or between columns for the column major layout.

The second overload of `joint_matrix_store` takes
Expand All @@ -326,6 +326,86 @@ of `sycl::multi_ptr`. The property list associated with the
`annotated_ptr` argument represents the compile-time constant
properties for cache control included in the SYCL extenion link:../../proposed/sycl_ext_intel_cache_controls.asciidoc[sycl_ext_intel_cache_controls]


==== Offset Load
```c++
namespace sycl::ext::oneapi::experimental::matrix {

// Only available when std::is_same_v<T1, std::remove_const_t<T2>>
template <typename Group, typename T1, typename T2,
size_t Rows, size_t Cols,
access::address_space Space, access::decorated IsDecorated>
void joint_matrix_load(Group g,
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
multi_ptr<T2, Space, IsDecorated> base_src, size_t RowIndex,
size_t ColIndex, size_t Stride, layout Layout);

// Only available when Layout != layout::dynamic
// and when std::is_same_v<T1, std::remove_const_t<T2>>
template <typename Group, typename T1, typename T2,
size_t Rows, size_t Cols,
use Use, layout Layout,
access::address_space Space, access::decorated IsDecorated>
void joint_matrix_load(Group g,
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res,
multi_ptr<T2, Space, IsDecorated> base_src, size_t RowIndex,
size_t ColIndex size_t Stride);

// Only available when std::is_same_v<T1, std::remove_const_t<T2>>
template <typename Group, typename T1, typename T2,
size_t Rows, size_t Cols,
typename PropertyListT>
void joint_matrix_load(Group g,
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
annotated_ptr<T2, PropertyListT> base_src, size_t RowIndex, size_t
ColIndex, size_t Stride, layout Layout);

// Only available when Layout != layout::dynamic
// and when std::is_same_v<T1, std::remove_const_t<T2>>
template <typename Group, typename T1, typename T2,
size_t Rows, size_t Cols, use Use, layout Layout,
typename PropertyListT>
void joint_matrix_load(Group g,
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res,
annotated_ptr<T2, PropertyListT> base_src, size_t RowIndex, size_t
ColIndex, size_t Stride);

} // namespace sycl::ext::oneapi::experimental::matrix
```

These overloads of `joint_matrix_load` takes the pointer `base_src` to
designate the base pointer of the global memory matrix. The
coordinates `RowIndex` and `ColIndex` into the global matrix to
calculate the pointer offset to load/store are given as separate
arguments.

==== Offset Store
```c++
namespace sycl::ext::oneapi::experimental::matrix {

// T1 must be the same as T2
template <typename Group, typename T1, typename T2, size_t Rows, size_t Cols,
access::address_space Space, access::decorated IsDecorated>
void joint_matrix_store(Group g,
const joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
multi_ptr<T2, Space, IsDecorated> base_dest, size_t RowIndex,
size_t ColIndex, size_t Stride, layout Layout);

template <typename Group, typename T1, typename T2, size_t Rows, size_t Cols,
typename PropertyListT>
void joint_matrix_store(Group g,
const joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
annotated_ptr<T2, PropertyListT> base_dest, size_t RowIndex, size_t
ColIndex, size_t Stride, layout Layout);

} // namespace sycl::ext::oneapi::experimental::matrix
```
These overloads of `joint_matrix_store` takes the pointer `base_dest` to
designate the base pointer of the global memory matrix. The
coordinates `RowIndex` and `ColIndex` into the global matrix to
calculate the pointer offset to load/store are given as separate
arguments.

==== Multiply and Add

```c++
Expand Down Expand Up @@ -472,7 +552,7 @@ namespace sycl::ext::oneapi::experimental::matrix {

template <size_t Rows, size_t Cols, typename Group, typename T,
typename Properties = empty_properties_t>
void joint_matrix_prefetch(Group g, T* ptr, size_t stride, layout Layout,
void joint_matrix_prefetch(Group g, T* ptr, size_t Stride, layout Layout,
Properties properties = {});

} // namespace sycl::ext::oneapi::experimental::matrix
Expand Down Expand Up @@ -979,7 +1059,7 @@ for (int i = 0; sizeof(combinations); i++) {
}
```

=== Appendix: Supported Combinations Per Hardware
=== Appendix: Supported Combinations and Restrictions Per Hardware
The table below provides a list of the combinations that
`joint_matrix` implementations support on each of Intel AMX and Intel
XMX hardware. Note that these can be returned using
Expand Down Expand Up @@ -1065,6 +1145,25 @@ architecture::intel_gpu_dg2_g11, architecture::intel_gpu_dg2_g12`
`architecture::intel_gpu_pvc`
|======================

===== Restrictions on `architecture::intel_gpu_pvc`

- The `Stride` argument to `joint_matrix_load` and
`joint_matrix_store` must be a multiple of 8 bytes. Also, `Stride`
should not exceed `2^24^` bytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stride parameter is the number of elements, not the number of bytes. It would be better to reword this like:

The stride parameter to joint_matrix_load and joint_matrix_store has the following restrictions:

  • The value stride * sizeof(T1) must be a multiple of 8, and
  • The value of stride * sizeof(T1) must not exceed 224.


- The base pointer argument to `joint_matrix_load` and
`joint_matrix_store` must be 4 bytes aligned.

- In the case of the offset overloads of `joint_matrix_load` and
`joint_matrix_store`, for 8 bits data type, `RowIndex` must be a
multiple of 4. For 16 bits data type, `RowIndex` must be a multiple
of 2. So `RowIndex` must be a multiple of 4 divided by size of the
element type (`4/sizeof(T)`).

- If these restrictions are not satisfied, users can switch to slower
implementations of `joint_matrix_load` and `joint_matrix_store` by
setting the driver flag `IGC_JointMatrixLoadStoreOpt=1`.

==== Nvidia Tensor Cores Supported Combinations
The complete set of matrix data types and shapes that are supported by
the `ext_oneapi_cuda` backend are represented in the following
Expand Down Expand Up @@ -1118,11 +1217,11 @@ supported parameter combination is specified in the following table.
| `matrix_type::fp64` | `matrix_type::fp64` | `matrix_type::fp64` |8 |8 |4
|======================

IMPORTANT: The `stride` argument to `joint_matrix_load` and
IMPORTANT: The `Stride` argument to `joint_matrix_load` and
`joint_matrix_store` must be a multiple of 8 when `T` is `half`, and a
multiple of 4 when `T` is `float`; where `T` is the type of the
`joint_matrix` elements. When `T` is not `half` or `float` there are
no restrictions to `stride`.
no restrictions to `Stride`.

IMPORTANT: For some devices it is important to use the sm version
(Compute Capability) corresponding to the device that will run the
Expand Down