-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] [DOC] Group sorting algorithm design review #11974
base: sycl
Are you sure you want to change the base?
Conversation
Signed-off-by: Fedorov, Andrey <andrey.fedorov@intel.com>
@andreyfe1 can you please fix this error?
|
Hi @intel/dpcpp-doc-reviewers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a handful of nits, otherwise it looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not ready to approve it, even though I'm only a codeowner of index.rst
All discussions are done within #3754. So, this PR just needs to be merged.
I do not see any approvals in that old PR. Tagging @gmlueck, @aelovikov-intel, @jinge90 to make a one more review pass and confirm that they agree with the proposed design
|
||
- Fallback implementation in case if backends don't have more optimized implementations yet. | ||
|
||
- Level Zero extension for `memory_required` functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not directly call into Level Zero from SYCL RT, but instead we go through unified runtime. Therefore, if we need a new API to do queries to low-level runtimes, then UR should also be updated
long valueTypeSizeInBytes) const; | ||
``` | ||
|
||
### Fallback SPIR-V library |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is expected that device compiler implements those functions as part of some "extension" so that SYCL RT can query if there is native support for that functionality and link fallback libraries if there is not. See extension spec, which is more of a design doc.
What should be the name of this "library"/"extension"? Should there be several of them so we can only link-in those libraries which are actually used (in case they would be huge)?
|
||
void __devicelib_default_work_group_joint_sort_descending_<encoded_param_types>(T* first, uint n, byte* scratch); | ||
|
||
// for fixed-size arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
language extension spec does not mention any special handling for fixed-size arrays and therefore it is not clear to me where built-ins from this section are going to be used - can it be clarified?
|
||
T __devicelib_default_sub_group_private_sort_descending_<encoded_scalar_param_type>(T value); | ||
|
||
// for key value sorting using the default algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as previous comments: there is no mention of key-value sorting in the language spec and some implementation detail is clearly implied here, even though I don't understand which one - I think that it should be explicitly spelled out that high-level SYCL functions are built on top these low-level functions using the following mapping ...
Notes: | ||
- `T`, `U` are from the following list `i8`, `i16`, | ||
`i32`, `i64`, `u8`, `u16`, `u32`, `u64`, `f16`, `f32`, `f64`. | ||
- `encoded_param_types` is `T` prepended with `p1` for global/private address |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But not U
?
``` | ||
|
||
Notes: | ||
- `T`, `U` are from the following list `i8`, `i16`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For functions which accept T*
and U*
which combinations of different types should be implemented in the fallback library? All 11*11, or only some sub-set?
|
||
Examples: | ||
```cpp | ||
void __devicelib_default_work_group_joint_sort_ascending_p1i32_u32_p3i8(int* first, uint n, byte* scratch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have p1
/p3
in the "mangling" but actual operands are generic address space, is that expected?
Also, just an idea to discuss, we use the same device compiler on Lin/Win so the C++ mangling is stable. Can we use a normal C++ template with "extern template" to avoid manual mangling?
### Fallback SPIR-V library | ||
|
||
If backend compilers can generate optimized implementations based on low-level instructions, | ||
we need a function that they can take and optimize. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bader, @AlexeySachkov, and I were just talking about the "__devicelib" functions recently. I think we want to stop using these as the "contract" between DPC++ and IGC. In fact, the IGC team has complained that there is no formal specification for these "__devicelib" functions.
If we need to rely on optimized support in IGC, we should instead define a SPIR-V extension, and we should write a formal specification as we do for other SPIR-V extensions. This provides a more precise contract between DPC++ and IGC, and it also provides a formal specification that other backend vendors could implement if a third party wanted to implement an OpenCL (or even Level Zero) backend to DPC++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To all functions. The conversation that @bader, @AlexeySachkov, and I had earlier was about the existing usage in cmath, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I'm afraid it requires a lot of efforts to rewrite API for IGC, CPU backend, and other components. That's great that multiple teams have committed to make a lot of changes for their code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is IGC currently providing implementations for these "__devicelib" functions, or are we relying on the fallback implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was implemented in IGC and CPU backend both long time ago. They also have tests for such API
Replacing accidentally closed #3754.
All discussions are done within #3754. So, this PR just needs to be merged.
Signed-off-by: Fedorov, Andrey andrey.fedorov@intel.com