Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] [DOC] Group sorting algorithm design review #11974

Open
wants to merge 5 commits into
base: sycl
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
220 changes: 220 additions & 0 deletions sycl/doc/design/GroupSort.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
# Group sort algorithm

Group sorting algorithms are needed to sort data without calling additional kernels
andreyfe1 marked this conversation as resolved.
Show resolved Hide resolved
They are described by SYCL 2020 Extension specification:
[direct link to the specification's extension][group_sort_spec].

[group_sort_spec]: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_group_sort.asciidoc

Example usage:

```cpp
#include <sycl/sycl.hpp>

namespace oneapi_exp = sycl::ext::oneapi::experimental;
sycl::range<1> local_range{256};
// predefine radix_sorter to calculate local memory size
using RSorter = oneapi_exp::radix_sorter<T, oneapi_exp::sorting_order::descending>;
// calculate required local memory size
size_t temp_memory_size =
RSorter::memory_required(sycl::memory_scope::work_group, local_range);
q.submit([&](sycl::handler& h) {
auto acc = sycl::accessor(buf, h);
auto scratch = sycl::local_accessor<std::byte, 1>( {temp_memory_size}, h);
h.parallel_for(
sycl::nd_range<1>{ local_range, local_range },
[=](sycl::nd_item<1> id) {
acc[id.get_local_id()] =
oneapi_exp::sort_over_group(
id.get_group(),
acc[id.get_local_id()],
RSorter(sycl::span{scratch.get_pointer(), temp_memory_size})
);
});
});
...
```

## Design objectives

In DPC++ Headers/DPC++ RT we don't know which sorting algorithm is better for
different architectures. Backends have more capability to optimize the sorting algorithm
using low-level instructions.

Data types that should be supported by backends: arithmetic types
(https://en.cppreference.com/w/c/language/arithmetic_types), `sycl::half`.

Comparators that should be supported by backends: `std::less`, `std::greater`,
custom comparators
andreyfe1 marked this conversation as resolved.
Show resolved Hide resolved

## Design

Overall, for backend support we need to have the following:
- Fallback implementation of sorting algorithms for user's types, comparators and/or sorters.

- Backend implementation for types, comparators and/or sorters
that can be optimized using backend specific instructions.

**NOTE**: It was decided that `radix_sorter` will be implemented only in DPC++ Headers since
it's difficult to support such algorithm at backends' level.

- Fallback implementation in case if backends don't have more optimized implementations yet.

- Level Zero extension for `memory_required` functions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not directly call into Level Zero from SYCL RT, but instead we go through unified runtime. Therefore, if we need a new API to do queries to low-level runtimes, then UR should also be updated


The following should be implemented:

- [x] Sorter classes and their `operator()` including sorting algorithms
- [x] Default sorter.
- [x] Radix sorter.
- [x] `joint_sort` and `sort_over_group` functions.
- [x] Traits to distinguish interfaces with `Compare` and `Sorter` parameters.
- [x] Checks when radix sort is applicable (arithmetic types only).
- [x] The `radix_order` enum class.
- [x] `group_with_scratchpad` predefined group helper.
- [x] `SYCL_EXT_ONEAPI_GROUP_SORT` feature macro.
- [ ] `sort_over_group` with `span`-based parameters.
- [ ] Level Zero extension for `memory_required` functions
- [ ] Specification.
- [ ] Implementation.
- [ ] Backend support for sorting algorithms.
- [ ] Default sorter
- [ ] Fallback library if device doesn't implement functions.

**Note**: The "tick" means that corresponding feature is implemented.

Sections below describe each component in more details.

### DPC++ Headers

DPC++ Headers contain the following:
- required definitions of `joint_sort`, `sort_over_group` functions, `radix_order` enum class,
`default_sorter`, `radix_sorter` classes with corresponding `operator()`
as well as other classes and methods.

- Checks if radix sort is applicable for provided data types.

- Traits to distinguish interfaces with `Compare` and `Sorter` parameters.

- Fallback solution for user's types, user's comparators and/or user's sorters.

### Level Zero

To implement `memory_required` methods for sorters we need to calculate
how much temporary memory is needed.
However, we don't have an information how much memory is needed by backend compiler.
andreyfe1 marked this conversation as resolved.
Show resolved Hide resolved
That's why we need a Level Zero function that calls a function from the backend and
provide actual value to the SYCL code.

Required interfaces:
```cpp
// Returns whether default work-group or sub-group sort is present in builtins
virtual bool DefaultGroupSortSupported(GroupSortMemoryScope::MemoryScope_t scope,
GroupSortKeyType::KeyType_t keyType,
bool isKeyValue,
bool isJointSort) const;

// Returns required amount of memory for default joint work-group or sub-group sort
// devicelib builtin function in bytes per workgroup (or sub-group), >= 0
// or -1 if the algorithm for the specified parameters is not implemented
//
// totalItems -- number of elements to sort
// rangeSize -- work-group or sub-group size respectively
//
// For key-only sort pass valueTypeSizeInBytes = 0
virtual long DefaultGroupJointSortMemoryRequired(GroupSortMemoryScope::MemoryScope_t scope,
long totalItems,
long rangeSize,
long keyTypeSizeInBytes,
long valueTypeSizeInBytes) const;

// Returns required amount of memory for default private memory work-group or sub-group sort
// devicelib builtin function in bytes per workgroup (or sub-group), >= 0
// or -1 if the algorithm for the specified parameters is not implemented
//
// itemsPerWorkItem -- number of elements in private array to sort
// rangeSize -- work-group or sub-group size respectively
//
// For key-only sort pass valueTypeSizeInBytes = 0
virtual long DefaultGroupPrivateSortMemoryRequired(GroupSortMemoryScope::MemoryScope_t scope,
long itemsPerWorkItem,
long rangeSize,
long keyTypeSizeInBytes,
long valueTypeSizeInBytes) const;
```

### Fallback SPIR-V library
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is expected that device compiler implements those functions as part of some "extension" so that SYCL RT can query if there is native support for that functionality and link fallback libraries if there is not. See extension spec, which is more of a design doc.

What should be the name of this "library"/"extension"? Should there be several of them so we can only link-in those libraries which are actually used (in case they would be huge)?


If backend compilers can generate optimized implementations based on low-level instructions,
we need a function that they can take and optimize.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bader, @AlexeySachkov, and I were just talking about the "__devicelib" functions recently. I think we want to stop using these as the "contract" between DPC++ and IGC. In fact, the IGC team has complained that there is no formal specification for these "__devicelib" functions.

If we need to rely on optimized support in IGC, we should instead define a SPIR-V extension, and we should write a formal specification as we do for other SPIR-V extensions. This provides a more precise contract between DPC++ and IGC, and it also provides a formal specification that other backend vendors could implement if a third party wanted to implement an OpenCL (or even Level Zero) backend to DPC++.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmlueck,
Does it relate to sorting functions only or to all functions in device lib like cmath, complex,...?

+@jinge90

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To all functions. The conversation that @bader, @AlexeySachkov, and I had earlier was about the existing usage in cmath, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'm afraid it requires a lot of efforts to rewrite API for IGC, CPU backend, and other components. That's great that multiple teams have committed to make a lot of changes for their code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is IGC currently providing implementations for these "__devicelib" functions, or are we relying on the fallback implementations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was implemented in IGC and CPU backend both long time ago. They also have tests for such API


If there are no implementations in a backend yet,
implementations from the fallback library will be called.

Interface for the library and backends:

```cpp
// for default sorting algorithm
void __devicelib_default_work_group_joint_sort_ascending_<encoded_param_types>(T* first, uint n, byte* scratch);

void __devicelib_default_work_group_joint_sort_descending_<encoded_param_types>(T* first, uint n, byte* scratch);

// for fixed-size arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

language extension spec does not mention any special handling for fixed-size arrays and therefore it is not clear to me where built-ins from this section are going to be used - can it be clarified?

void __devicelib_default_work_group_private_sort_close_ascending_<encoded_param_types>(T* first, uint n, byte* scratch);

void __devicelib_default_work_group_private_sort_close_descending_<encoded_param_types>(T* first, uint n, byte* scratch);

void __devicelib_default_work_group_private_sort_spread_ascending_<encoded_param_types>(T* first, uint n, byte* scratch);

void __devicelib_default_work_group_private_sort_spread_descending_<encoded_param_types>(T* first, uint n, byte* scratch);

// for sub-groups
T __devicelib_default_sub_group_private_sort_ascending_<encoded_scalar_param_type>(T value);

T __devicelib_default_sub_group_private_sort_descending_<encoded_scalar_param_type>(T value);

// for key value sorting using the default algorithm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as previous comments: there is no mention of key-value sorting in the language spec and some implementation detail is clearly implied here, even though I don't understand which one - I think that it should be explicitly spelled out that high-level SYCL functions are built on top these low-level functions using the following mapping ...

void __devicelib_default_work_group_joint_sort_ascending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch);

void __devicelib_default_work_group_joint_sort_descending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch);

// for key value sorting using fixed-size arrays
void __devicelib_default_work_group_private_sort_close_ascending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch);

void __devicelib_default_work_group_private_sort_close_descending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch);

void __devicelib_default_work_group_private_sort_spread_ascending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch);

void __devicelib_default_work_group_private_sort_spread_descending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch);

```

Notes:
- `T`, `U` are from the following list `i8`, `i16`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For functions which accept T* and U* which combinations of different types should be implemented in the fallback library? All 11*11, or only some sub-set?

`i32`, `i64`, `u8`, `u16`, `u32`, `u64`, `f16`, `f32`, `f64`.
- `encoded_param_types` is `T` prepended with `p1` for global/private address
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But not U?

space and `p3` for shared local memory.
- `first` is a pointer to the actual data for sorting.
- The type of `n` (number of elements) is u32.
- `keys_first` points to "keys" for key-value sorting.
"Keys" are comparing and moving during the sorting.
- `scratch` is a temporary storage (local or global) that can be used by backends.
The type of `scratch` is always `byte*`.
- `values_first` points to "values" for key-value sorting. "Keys" are only moving
corresponding the "keys" order during the sorting.

Examples:
```cpp
void __devicelib_default_work_group_joint_sort_ascending_p1i32_u32_p3i8(int* first, uint n, byte* scratch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have p1/p3 in the "mangling" but actual operands are generic address space, is that expected?

Also, just an idea to discuss, we use the same device compiler on Lin/Win so the C++ mangling is stable. Can we use a normal C++ template with "extern template" to avoid manual mangling?

void __devicelib_default_work_group_joint_sort_descending_p1u32_u32_p1i8(uint* first, uint n, byte* scratch);
void __devicelib_default_work_group_joint_sort_ascending_p3u32_p3u32_u32_p1i8(uint* first_keys, uint* first_values, uint n, byte* scratch);
void __devicelib_default_work_group_private_sort_close_ascending_p1u32_p1u32_u32_p1i8(uint* first_keys, uint* first_values, uint n, byte* scratch);
double __devicelib_default_sub_group_private_sort_ascending_f64(double value);
```

## Alternative Design

If it's proved that no specific improvements can be done at backends' level (e.g. special
andreyfe1 marked this conversation as resolved.
Show resolved Hide resolved
instructions, hardware dispatch) comparing to high-level SYCL code then implementations
andreyfe1 marked this conversation as resolved.
Show resolved Hide resolved
of sorting functions can be placed in DPC++ Headers
(no hardware backends, no Level Zero support will be needed in such cases).
1 change: 1 addition & 0 deletions sycl/doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ Design Documents for the oneAPI DPC++ Compiler
design/SYCLNativeCPU
design/CommandGraph
design/OffloadDesign
design/GroupSort
New OpenCL Extensions <https://github.com/intel/llvm/tree/sycl/sycl/doc/design/opencl-extensions>
New SPIR-V Extensions <https://github.com/intel/llvm/tree/sycl/sycl/doc/design/spirv-extensions>

Expand Down