[ESIMD] Optimize the simd stride constructor #12553

v-klochkov · 2024-01-31T04:33:36Z

simd(base, stride) calls previously were lowered into a long sequence of INSERT and ADD operations. That sequence is replaced with a vector equivalent:
vbase = broadcast base
vstride = broadcast stride
vstride_coef = {0, 1, 2, 3, ... N-1}
vec_result = vbase + vstride * vstride_coef;

github-actions · 2024-02-01T21:57:27Z

✅ With the latest revision this PR passed the C/C++ code formatter.

simd(base, stride) calls previously were lowered into a long sequence of INSERT and ADD operations. That sequence is replaced with a vector equivalent: vbase = broadcast base vstride = broadcast stride vstride_coef = {0, 1, 2, 3, ... N-1} vec_result = vbase + vstride * vstride_coef; Signed-off-by: Klochkov, Vyacheslav N <vyacheslav.n.klochkov@intel.com>

sarnex · 2024-02-05T17:05:04Z

sycl/include/sycl/ext/intel/esimd/detail/simd_obj_impl.hpp

-                                               std::index_sequence<Is...>) {
-  return vector_type_t<T, N>{(T)(Base + ((T)Is) * Stride)...};
+constexpr auto make_vector_impl(T Base, T Stride, std::index_sequence<Is...>) {
+  using CppT = typename element_type_traits<T>::EnclosingCppT;


I remember you considering optimizing this for low values of N, did that end up not being worth it?

I did the initial research for float types and found such tuning worthless.

Just to answer your question here and to show the IR I used int type this time,
and found 2 cases where the old code is 1 instruction faster/shorter:

old {num math ops : ops} new {num math ops : ops} simd<int, 1> 0: 0: simd<int, 2> * 1: 1xADD 2: 1xADD, 1xMUL simd<int, 3> * 3: 2xADD, 1xSHL 4: 2xADD, 2xMUL (it split 3-elem vec to 2-elem vec + 1-elem vec) simd<int, 4> 5: 3xADD, 1xSHL, 1xMUL 2: 1xADD, 1xMUL simd<float, 1> 0: 0: simd<float, 2> 1: 1xADD 1: 1xMAD simd<float, 3> 2: 2xADD 2: 2xMAD (3-elem vector ops were split -> 2-elem + 1-elem) simd<float, 4> 3: 3xADD 1: 1xMAD

I added few lines of code to tune for integral types and N <= 3: ea002b5

The old sequence prodices 1 less instruction in the final gpu code. Signed-off-by: Klochkov, Vyacheslav N <vyacheslav.n.klochkov@intel.com>

…inefficiency

v-klochkov temporarily deployed to WindowsCILock January 31, 2024 04:33 — with GitHub Actions Inactive

v-klochkov had a problem deploying to WindowsCILock January 31, 2024 04:52 — with GitHub Actions Failure

v-klochkov force-pushed the esimd_fix_insert_inefficiency branch from bdd5a06 to 1e75f5b Compare February 1, 2024 21:53

v-klochkov had a problem deploying to WindowsCILock February 1, 2024 21:53 — with GitHub Actions Error

v-klochkov force-pushed the esimd_fix_insert_inefficiency branch from 1e75f5b to 8868ab1 Compare February 1, 2024 22:09

v-klochkov temporarily deployed to WindowsCILock February 1, 2024 22:10 — with GitHub Actions Inactive

v-klochkov temporarily deployed to WindowsCILock February 1, 2024 23:03 — with GitHub Actions Inactive

v-klochkov force-pushed the esimd_fix_insert_inefficiency branch from 8868ab1 to ef137d4 Compare February 2, 2024 01:38

v-klochkov temporarily deployed to WindowsCILock February 2, 2024 01:38 — with GitHub Actions Inactive

v-klochkov temporarily deployed to WindowsCILock February 2, 2024 01:59 — with GitHub Actions Inactive

v-klochkov force-pushed the esimd_fix_insert_inefficiency branch from ef137d4 to e96865e Compare February 3, 2024 03:19

v-klochkov marked this pull request as ready for review February 3, 2024 03:20

v-klochkov requested a review from a team as a code owner February 3, 2024 03:20

v-klochkov temporarily deployed to WindowsCILock February 3, 2024 03:29 — with GitHub Actions Inactive

v-klochkov temporarily deployed to WindowsCILock February 3, 2024 03:49 — with GitHub Actions Inactive

sarnex reviewed Feb 5, 2024

View reviewed changes

v-klochkov added 2 commits February 5, 2024 13:15

Use the old sequence for integral types and N <= 3

ea002b5

The old sequence prodices 1 less instruction in the final gpu code. Signed-off-by: Klochkov, Vyacheslav N <vyacheslav.n.klochkov@intel.com>

Merge remote-tracking branch 'intel_llvm/sycl' into esimd_fix_insert_…

86047f4

…inefficiency

v-klochkov temporarily deployed to WindowsCILock February 5, 2024 21:22 — with GitHub Actions Inactive

v-klochkov temporarily deployed to WindowsCILock February 5, 2024 21:44 — with GitHub Actions Inactive

sarnex approved these changes Feb 5, 2024

View reviewed changes

v-klochkov merged commit e9a1ace into intel:sycl Feb 5, 2024
12 checks passed

v-klochkov deleted the esimd_fix_insert_inefficiency branch February 5, 2024 23:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ESIMD] Optimize the simd stride constructor #12553

[ESIMD] Optimize the simd stride constructor #12553

v-klochkov commented Jan 31, 2024

github-actions bot commented Feb 1, 2024 •

edited

Loading

sarnex Feb 5, 2024

v-klochkov Feb 5, 2024 •

edited

Loading

v-klochkov Feb 5, 2024

[ESIMD] Optimize the simd stride constructor #12553

[ESIMD] Optimize the simd stride constructor #12553

Conversation

v-klochkov commented Jan 31, 2024

github-actions bot commented Feb 1, 2024 • edited Loading

sarnex Feb 5, 2024

Choose a reason for hiding this comment

v-klochkov Feb 5, 2024 • edited Loading

Choose a reason for hiding this comment

v-klochkov Feb 5, 2024

Choose a reason for hiding this comment

github-actions bot commented Feb 1, 2024 •

edited

Loading

v-klochkov Feb 5, 2024 •

edited

Loading