New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Re-implement SYCL backend `parallel_for` to improve bandwidth utilization #1976

Open

mmichel11 wants to merge 47 commits into main from dev/mmichel11/parallel_for_vectorize

Contributor

mmichel11 commented Dec 19, 2024 •

edited

Loading

High Level Description
This PR improves hardware bandwidth utilization of oneDPL's SYCL backend parallel for pattern through two ideas:

Process multiple input iterations per work-item which involves a switch to a nd_range kernel combined with a sub / work group strided indexing approach.
To generate wide loads for small types, implement a path that vectorizes loads / stores by processing adjacent indices within a single work item. This is combined with the above approach to maximize hardware bandwidth utilization. Vectorization is only applied to fundamental types of size less than 4 (e.g. uint16_t, uint8_t) under a contiguous container.

Implementation Details

Parallel for bricks have been reworked in the following manner:
- Each brick contains a pack of ranges within its template parameters to define tuning parameters.
- The following static integral members are defined (implemented with inheritance):
  - __can_vectorize
  - __preferred_vector_size (1 if __can_vectorize is false)
  - __preferred_iters_per_item
- The following public member functions are defined
  - __scalar_path (for small input sizes this member function is explicitly called)
  - __vector_path (optional for algorithms that are not vectorizable e.g. binary_search)
  - An overloaded function call operator which dispatches to the appropriate strategy

To implement this approach, the parallel for kernel rewrite from #1870 was adopted with additional changes to handle vectorization paths. Additionally, generic vectorization and strided loop utilities have been defined with the intention for these to be applicable in other portions of the codebase as well. Tests have been expanded to ensure coverage of vectorization paths.

This PR will supersedes #1870. Initially, the plan was to merge this PR into 1870 but after comparing the diff, I believe the most straightforward approach will be to target this directly to main.

mmichel11 added the enhancement label

mmichel11 added this to the 2022.8.0 milestone

mmichel11 marked this pull request as ready for review

December 19, 2024 19:17

mmichel11 requested review from timmiesmith, akukanov, MikeDvorskiy, SergeyKopienko, danhoeflinger, adamfidel and dmitriy-sobolev

December 19, 2024 19:18

mmichel11 changed the title ~~[Draft] Re-implement SYCL backend parallel_for to improve bandwidth utilization~~ Re-implement SYCL backend parallel_for to improve bandwidth utilization

mmichel11 added 19 commits

December 19, 2024 16:01


          Optimize memory transactions in SYCL backend parallel for

9764a57

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          clang-format

c836b1d

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Correct comment and error handling.

55f33a4

128 byte memory operations are performed instead of 512 after inspecting
the assembly. Processing 512 bytes per sub-group still seems to be the
best value after experimentation.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          __num_groups bugfix

adadd56

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Introduce stride recommender for different targets and better distrib…

71d7bcc

…ute work for small inputs

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Cleanup

ebb3d56

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Unroll loop if possible

2c4ecd0

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Revert "Unroll loop if possible"

dc6bd0c

This reverts commit e4cbceb. Small
sizes slightly slower and for horizontal vectorization no "real" benefit is
observed.


          Use a small and large kernel in parallel for

d5126b2

Small but measurable overheads can be observed for small inputs where
runtime dispatch in the kernel is present to check for the correct path
to take. Letting the compiler handle the the small input case in the
original kernel shows the best performance.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Improve __iters_per_work_item heuristic.

6433a50

We now flatten the user-provided ranges and find the minimum sized type
to estimate the best __iters_per_work_item. This benefits performance in
calls that wrap multiple buffers in a single input / output through a
zip_iterator (e.g. dpct::scatter_if in SYCLomatic compatibility headers).

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Code cleanup

d376124

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Clang format

a7c7606

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Update comments

b8aa15c

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Bugfix in comment

b45a7c2

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          More cleanup and better handle non-full case

4f9a360

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Rename __ndi to __item for consistency with codebase

7bb1d2b

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Update all comments on kernel naming trick

a2ad920

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Handle non-full case in a cleaner way

47fe214

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Switch min tuple type utility to return size of type

79a18e9

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 added 15 commits

December 19, 2024 16:12


          clang-format

7990bc1

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Fix ordering to __vector_load call

4aaa81f

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Add support for vectorization with C++20 parallel range APIs

65e4a68

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Add device copyable specializations for new walk patterns

b4657a6

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Align vector_walk implementation with other vector functors

3086dd3

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Add back non-spirv path

df17673

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Further improve test coverage

fd4e2c3

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Restore original shift_left due to implicit implementation requiremen…

58fd466

…t that for pattern launches exactly n work items

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Fix issues in vectorized rotate

094124f

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Fix fpga parallel for compilation issues

82135f6

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Restore initial shift_left_right.pass.cpp

e979118

Due to the revert of the vectorization path the original test provides
sufficient coverage.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Fix test side issue when unnamed lambdas are disabled

4bfaada

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Add a vector path specialization for std::swap_ranges

8ae18db

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          General code cleanup

6cb11c7

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Bugfix with __pattern_swap using nanoranges

505bdf3

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 force-pushed the dev/mmichel11/parallel_for_vectorize branch from 085eaf5 to 505bdf3 Compare

December 19, 2024 22:13

mmichel11 added 2 commits

December 19, 2024 16:23


          clang-format

114924d

Signed-off-by: Matthew Michel <matthew.michel@intel.com>


          Address applicable comments from PR #1870

845de21

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

dmitriy-sobolev mentioned this pull request

Fix sub-group load/store extension disabling logic #1979

Open

SergeyKopienko reviewed

View reviewed changes

include/oneapi/dpl/pstl/utils.h

+              {
+                  template <typename _Tp>
+                  void
+                  operator()(__lazy_ctor_storage<_Tp> __storage) const

Contributor

SergeyKopienko Dec 23, 2024

Why you pass __storage parameter by value?

SergeyKopienko reviewed

View reviewed changes

include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h

-                                         __par_backend_hetero::access_mode::read_write>(
-                      __tag, ::std::forward<_ExecutionPolicy>(__exec), __first1, __last1, __first2, __f);
+                  auto __n = __last1 - __first1;
+                  if (__n <= 0)

Contributor

SergeyKopienko Dec 23, 2024

What is the case when __n < 0 is true?

SergeyKopienko reviewed

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

+              // Path that intentionally disables vectorization for algorithms with a scattered access pattern (e.g. binary_search)
+              template <typename... _Ranges>
+              class walk_scalar_base

Contributor

SergeyKopienko Dec 23, 2024

Why class walk_scalar_base declared as class but

template <typename _ExecutionPolicy, typename _F, typename _Range>
struct walk1_vector_or_scalar : public walk_vector_or_scalar_base<_Range>

declared as struct ?

SergeyKopienko reviewed

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

+                  void
+                  __scalar_path(_IsFull, const _ItemId __idx, _Range __rng) const
+                  {

Contributor

SergeyKopienko Dec 23, 2024

Empty string probably isn't required here.

SergeyKopienko reviewed

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

+                  __vector_path(_IsFull __is_full, const _ItemId __idx, _Range __rng) const
+                  {
+                      // This is needed to enable vectorization
+                      auto __raw_ptr = __rng.begin();

Contributor

SergeyKopienko Dec 23, 2024

I think that __raw_ptr isn't very good name because begin() usually linked in mind with iterator. But raw usually is some pointer.
Do we really need to have here local variable __raw_ptr ? Can we pass __rng.begin() instead of that variable into __vector_walk call?

Contributor

SergeyKopienko commented Dec 23, 2024 •

edited

Loading

So now we have 3 entity with defined constexpr static bool __can_vectorize :

class walk_vector_or_scalar_base
class walk_scalar_base
struct __brick_shift_left

Does these constexpr-variables really has different semantic?

And if the semantic of these entities are the same, may be make sense to make some re-design to have only one entity __can_vectorize ?

Contributor

SergeyKopienko commented Dec 23, 2024

In some moments implementation details remind me tag-dispatching which were designed by @rarutyun.
But with some differences: for example the walk2_vectors_or_scalars has not only information about vectorization or parallelization should be executed, but also two variant of functional staff and operator() with compile-time condition check to run one code or another code.

But what if we instead of two different functions

    template <typename _IsFull, typename _ItemId>
    void
    __vector_path(_IsFull __is_full, const _ItemId __idx, _Range __rng) const
    {
        // This is needed to enable vectorization
        auto __raw_ptr = __rng.begin();
        oneapi::dpl::__par_backend_hetero::__vector_walk<__base_t::__preferred_vector_size>{__n}(__is_full, __idx, __f,
                                                                                                 __raw_ptr);
    }

    // _IsFull is ignored here. We assume that boundary checking has been already performed for this index.
    template <typename _IsFull, typename _ItemId>
    void
    __scalar_path(_IsFull, const _ItemId __idx, _Range __rng) const
    {

        __f(__rng[__idx]);
    }

we will have some two functions with the same name and the format excepting the first parameter type which will be used as some tag ?

Please take a look at __parallel_policy_tag_selector_t for details.

SergeyKopienko reviewed

View reviewed changes

include/oneapi/dpl/pstl/utils.h

@@ @@ -784,6 +785,32 @@ union __lazy_ctor_storage @@
                   }
               };
+              // Utility to explicitly call the destructor of __lazy_ctor_storage as a callback functor
+              struct __lazy_ctor_storage_deleter

Contributor

SergeyKopienko Dec 23, 2024

Probably I don't understand something, but why this struct has name lazy?
It's looks like some kind of visitor pattern implementation, which call destroy() for each element in container.
What is exactly the lazy functional here?

Contributor

danhoeflinger Dec 31, 2024 •

edited

Loading

I believe it is a callable deleter for __lazy_ctor_storage which is storage that has a delayed "lazy" constructor. Perhaps it would be better to instead add a static member function to the __lazy_ctor_storage union, get_deleter_callable(), which returns a lambda to delete a __lazy_ctor_storage& passed as an argument. This would remove any confusion, and group these together.

SergeyKopienko reviewed

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

+                  void
+                  operator()(std::false_type, _IdxType __start_idx, _LoadOp __load_op, _Acc... __acc) const
+                  {
+                      std::uint8_t __elements = std::min(std::size_t{__vec_size}, std::size_t{__n - __start_idx});

Contributor

SergeyKopienko Dec 23, 2024

We assume here that std::min(std::size_t{}, std::size_t{}) will always fit into std::uint8_t type?

Contributor

danhoeflinger Dec 31, 2024

I think it makes sense... __vec_size is 4 or less, but __n - __start_idx can only be assumed to fit within size_t (and you don't want to overflow before the min). The result will be 4 or less, which fits in 8 bits.

Contributor

SergeyKopienko commented Dec 23, 2024

One more point: __vector_path and __scalar_path tell me about some path but not imp.
May be better to rename them to ..._impl ?

danhoeflinger reviewed

View reviewed changes

Contributor

danhoeflinger left a comment

First round of review. I've not gotten to all the details yet, but this is enough to be interesting.

include/oneapi/dpl/pstl/utils.h

Comment on lines +808 to +809

		template <template <typename...> typename _WrapperType, typename... _Ts>
		struct __min_nested_type_size<_WrapperType<_Ts...>>

Contributor

danhoeflinger Dec 31, 2024

I wonder if this formulation leaves us open for bugs in the future with no restrictions on what _WrapperType could be.
What we probably want is something like tuple-like from c++23.

Would we be better off limiting this to std::tuple and and onedpl's tuple with explicit partial specializations? or limit it via some enable_if magic?

Right now any templated type is reduced to its template arguments, which isn't always the case. Imagine a contrived user provided type for their input range which has a template argument which isn't used as a member field.

template <typename T>
struct __my_converting_type{
    std::uint8_t var;
    T get_conversion(){ return T{var};}
};

This would match the _WrapperType flavor I think, and return the wrong result if I understand the intention correctly. We would want such a type to use sizeof.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h

Comment on lines +132 to +136

+                      // To ensure that the large submitter gets tested on all devices, set the switch point to 10,000 only when compiling
+                      // oneDPL tests.
+              #if TEST_FOR_ALGORITHM_LARGE_SUBMITTER
+                      return 10000;
+              #else

Contributor

danhoeflinger Dec 31, 2024

I think we try to avoid letting testing specific code seep into the main repo, though I understand the need here to gain coverage.
Can we instead perhaps add one large test to the "normal" test suite which would hit the large submitter and enable it only under the same circumstances. I understand the desire to limit the test time of the suite, but this both infects the main repo with test specifics, but also adds coverage of this code in situations it will never encounter in the wild, and doesn't cover any real sizes.

Id really prefer not to, but if we do have to have this, I'd suggest uglifying the name.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h

+                          const std::uint32_t __sub_group_size = __sub_group.get_local_linear_range();
+                          const std::uint32_t __sub_group_id = __sub_group.get_group_linear_id();
+                          const std::uint32_t __sub_group_local_id = __sub_group.get_local_linear_id();
+                          const std::size_t __work_group_id = __item.get_group().get_group_linear_id();

Contributor

danhoeflinger Dec 31, 2024

Seems like we could move this out of the branch and use in both sides.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h

Comment on lines +91 to +94

+                  static inline std::tuple<std::size_t, std::size_t, bool>
+                  __stride_recommender(const sycl::nd_item<1>& __item, std::size_t __count, std::size_t __iters_per_work_item,
+                                       std::size_t __adj_elements_per_work_item, std::size_t __work_group_size)
+                  {

Contributor

danhoeflinger Dec 31, 2024

Is this a general utility which might have utility for other commutative operations beyond just parallel_for or is there a reason you believe this to be specific to this algorithm / kernel?

If we think it might be useful, we could lift this to a general utility level. Obviously we don't need to incorporate it elsewhere in this PR. An alternative is to add an issue to explore this and only lift it if we find utility.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

+                  void
+                  operator()(std::false_type, _IdxType __start_idx, _LoadOp __load_op, _Acc... __acc) const
+                  {
+                      std::uint8_t __elements = std::min(std::size_t{__vec_size}, std::size_t{__n - __start_idx});

Contributor

danhoeflinger Dec 31, 2024

I think it makes sense... __vec_size is 4 or less, but __n - __start_idx can only be assumed to fit within size_t (and you don't want to overflow before the min). The result will be 4 or less, which fits in 8 bits.

include/oneapi/dpl/pstl/hetero/algorithm_ranges_impl_hetero.h

-                          std::forward<_Range2>(__result))
+                          unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _CopyBrick, std::decay_t<_Range1>,
+                                                                  std::decay_t<_Range2>>{
+                              {}, _CopyBrick{}, static_cast<std::size_t>(__n)},

Contributor

danhoeflinger Dec 31, 2024

I really dislike having to pass {} as the first argument here. I'm not sure I even really understand why its necessary, is this for the base class?

Can we just define constructors which accepts only the brick and size to avoid this issue?

include/oneapi/dpl/internal/binary_search_impl.h

-              struct custom_brick
+              #if _ONEDPL_BACKEND_SYCL
+              template <typename Comp, typename T, typename _Range, search_algorithm func>
+              struct custom_brick : oneapi::dpl::unseq_backend::walk_scalar_base<_Range>

Contributor

danhoeflinger Dec 31, 2024

Lets fix the naming of this while were touching all its instances __custom_brick

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

+                  void
+                  __scalar_path(_IsFull, const _Idx __idx, const _Range1 __rng1, _Range2 __rng2) const
+                  {

Contributor

danhoeflinger Dec 31, 2024

Suggested change

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

+                      auto __raw_ptr3 = __rng3.begin();
+                      oneapi::dpl::__internal::__lazy_ctor_storage<_ValueType1> __rng1_vector[__base_t::__preferred_vector_size];
+                      oneapi::dpl::__internal::__lazy_ctor_storage<_ValueType2> __rng2_vector[__base_t::__preferred_vector_size];

Contributor

danhoeflinger Dec 31, 2024 •

edited

Loading

I think it should be possible to combine walk*_vectors_or_scalars together with some complicated fold instructions, lambdas, tuples, and std::apply.

Take a look at the first answer of https://stackoverflow.com/questions/7230621/how-can-i-iterate-over-a-packed-variadic-template-argument-list. I think you should do something similar, chaining together instructions by returning tuples and then with std::apply.

Here is an example I was playing with.
https://godbolt.org/z/vc8dK4ed6

In the end, I'm not sure if (1) its actually possible and (2) its worth the complexity to consolidate these structs, but its worth considering...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

danhoeflinger danhoeflinger left review comments

SergeyKopienko SergeyKopienko left review comments

timmiesmith Awaiting requested review from timmiesmith

akukanov Awaiting requested review from akukanov

MikeDvorskiy Awaiting requested review from MikeDvorskiy

adamfidel Awaiting requested review from adamfidel

dmitriy-sobolev Awaiting requested review from dmitriy-sobolev

At least 1 approving review is required to merge this pull request.

Labels