Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement SYCL backend parallel_for to improve bandwidth utilization #1976

Open
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

mmichel11
Copy link
Contributor

@mmichel11 mmichel11 commented Dec 19, 2024

High Level Description
This PR improves hardware bandwidth utilization of oneDPL's SYCL backend parallel for pattern through two ideas:

  • Process multiple input iterations per work-item which involves a switch to a nd_range kernel combined with a sub / work group strided indexing approach.
  • To generate wide loads for small types, implement a path that vectorizes loads / stores by processing adjacent indices within a single work item. This is combined with the above approach to maximize hardware bandwidth utilization. Vectorization is only applied to fundamental types of size less than 4 (e.g. uint16_t, uint8_t) under a contiguous container.

Implementation Details

  • Parallel for bricks have been reworked in the following manner:
    • Each brick contains a pack of ranges within its template parameters to define tuning parameters.
    • The following static integral members are defined (implemented with inheritance):
      • __can_vectorize
      • __preferred_vector_size (1 if __can_vectorize is false)
      • __preferred_iters_per_item
    • The following public member functions are defined
      • __scalar_path (for small input sizes this member function is explicitly called)
      • __vector_path (optional for algorithms that are not vectorizable e.g. binary_search)
      • An overloaded function call operator which dispatches to the appropriate strategy

To implement this approach, the parallel for kernel rewrite from #1870 was adopted with additional changes to handle vectorization paths. Additionally, generic vectorization and strided loop utilities have been defined with the intention for these to be applicable in other portions of the codebase as well. Tests have been expanded to ensure coverage of vectorization paths.

This PR will supersedes #1870. Initially, the plan was to merge this PR into 1870 but after comparing the diff, I believe the most straightforward approach will be to target this directly to main.

@mmichel11 mmichel11 added this to the 2022.8.0 milestone Dec 19, 2024
@mmichel11 mmichel11 marked this pull request as ready for review December 19, 2024 19:17
@mmichel11 mmichel11 changed the title [Draft] Re-implement SYCL backend parallel_for to improve bandwidth utilization Re-implement SYCL backend parallel_for to improve bandwidth utilization Dec 19, 2024
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
128 byte memory operations are performed instead of 512 after inspecting
the assembly. Processing 512 bytes per sub-group still seems to be the
best value after experimentation.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…ute work for small inputs

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
This reverts commit e4cbceb. Small
sizes slightly slower and for horizontal vectorization no "real" benefit is
observed.
Small but measurable overheads can be observed for small inputs where
runtime dispatch in the kernel is present to check for the correct path
to take. Letting the compiler handle the the small input case in the
original kernel shows the best performance.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
We now flatten the user-provided ranges and find the minimum sized type
to estimate the best __iters_per_work_item. This benefits performance in
calls that wrap multiple buffers in a single input / output through a
zip_iterator (e.g. dpct::scatter_if in SYCLomatic compatibility headers).

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…t that for pattern launches exactly n work items

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Due to the revert of the vectorization path the original test provides
sufficient coverage.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
@mmichel11 mmichel11 force-pushed the dev/mmichel11/parallel_for_vectorize branch from 085eaf5 to 505bdf3 Compare December 19, 2024 22:13
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
{
template <typename _Tp>
void
operator()(__lazy_ctor_storage<_Tp> __storage) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you pass __storage parameter by value?

__par_backend_hetero::access_mode::read_write>(
__tag, ::std::forward<_ExecutionPolicy>(__exec), __first1, __last1, __first2, __f);
auto __n = __last1 - __first1;
if (__n <= 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the case when __n < 0 is true?


// Path that intentionally disables vectorization for algorithms with a scattered access pattern (e.g. binary_search)
template <typename... _Ranges>
class walk_scalar_base
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why class walk_scalar_base declared as class but

template <typename _ExecutionPolicy, typename _F, typename _Range>
struct walk1_vector_or_scalar : public walk_vector_or_scalar_base<_Range>

declared as struct ?

void
__scalar_path(_IsFull, const _ItemId __idx, _Range __rng) const
{

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty string probably isn't required here.

__vector_path(_IsFull __is_full, const _ItemId __idx, _Range __rng) const
{
// This is needed to enable vectorization
auto __raw_ptr = __rng.begin();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think that __raw_ptr isn't very good name because begin() usually linked in mind with iterator. But raw usually is some pointer.
  2. Do we really need to have here local variable __raw_ptr ? Can we pass __rng.begin() instead of that variable into __vector_walk call?

@SergeyKopienko
Copy link
Contributor

SergeyKopienko commented Dec 23, 2024

So now we have 3 entity with defined constexpr static bool __can_vectorize :

  1. class walk_vector_or_scalar_base
  2. class walk_scalar_base
  3. struct __brick_shift_left

Does these constexpr-variables really has different semantic?

And if the semantic of these entities are the same, may be make sense to make some re-design to have only one entity __can_vectorize ?

@SergeyKopienko
Copy link
Contributor

In some moments implementation details remind me tag-dispatching which were designed by @rarutyun.
But with some differences: for example the walk2_vectors_or_scalars has not only information about vectorization or parallelization should be executed, but also two variant of functional staff and operator() with compile-time condition check to run one code or another code.

But what if we instead of two different functions

    template <typename _IsFull, typename _ItemId>
    void
    __vector_path(_IsFull __is_full, const _ItemId __idx, _Range __rng) const
    {
        // This is needed to enable vectorization
        auto __raw_ptr = __rng.begin();
        oneapi::dpl::__par_backend_hetero::__vector_walk<__base_t::__preferred_vector_size>{__n}(__is_full, __idx, __f,
                                                                                                 __raw_ptr);
    }

    // _IsFull is ignored here. We assume that boundary checking has been already performed for this index.
    template <typename _IsFull, typename _ItemId>
    void
    __scalar_path(_IsFull, const _ItemId __idx, _Range __rng) const
    {

        __f(__rng[__idx]);
    }

we will have some two functions with the same name and the format excepting the first parameter type which will be used as some tag ?

Please take a look at __parallel_policy_tag_selector_t for details.

@@ -784,6 +785,32 @@ union __lazy_ctor_storage
}
};

// Utility to explicitly call the destructor of __lazy_ctor_storage as a callback functor
struct __lazy_ctor_storage_deleter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably I don't understand something, but why this struct has name lazy?
It's looks like some kind of visitor pattern implementation, which call destroy() for each element in container.
What is exactly the lazy functional here?

Copy link
Contributor

@danhoeflinger danhoeflinger Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is a callable deleter for __lazy_ctor_storage which is storage that has a delayed "lazy" constructor. Perhaps it would be better to instead add a static member function to the __lazy_ctor_storage union, get_deleter_callable(), which returns a lambda to delete a __lazy_ctor_storage& passed as an argument. This would remove any confusion, and group these together.

void
operator()(std::false_type, _IdxType __start_idx, _LoadOp __load_op, _Acc... __acc) const
{
std::uint8_t __elements = std::min(std::size_t{__vec_size}, std::size_t{__n - __start_idx});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We assume here that std::min(std::size_t{}, std::size_t{}) will always fit into std::uint8_t type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense... __vec_size is 4 or less, but __n - __start_idx can only be assumed to fit within size_t (and you don't want to overflow before the min). The result will be 4 or less, which fits in 8 bits.

@SergeyKopienko
Copy link
Contributor

One more point: __vector_path and __scalar_path tell me about some path but not imp.
May be better to rename them to ..._impl ?

Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of review. I've not gotten to all the details yet, but this is enough to be interesting.

Comment on lines +808 to +809
template <template <typename...> typename _WrapperType, typename... _Ts>
struct __min_nested_type_size<_WrapperType<_Ts...>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this formulation leaves us open for bugs in the future with no restrictions on what _WrapperType could be.
What we probably want is something like tuple-like from c++23.

Would we be better off limiting this to std::tuple and and onedpl's tuple with explicit partial specializations? or limit it via some enable_if magic?

Right now any templated type is reduced to its template arguments, which isn't always the case. Imagine a contrived user provided type for their input range which has a template argument which isn't used as a member field.

template <typename T>
struct __my_converting_type{
    std::uint8_t var;
    T get_conversion(){ return T{var};}
};

This would match the _WrapperType flavor I think, and return the wrong result if I understand the intention correctly. We would want such a type to use sizeof.

Godbolt link

Comment on lines +132 to +136
// To ensure that the large submitter gets tested on all devices, set the switch point to 10,000 only when compiling
// oneDPL tests.
#if TEST_FOR_ALGORITHM_LARGE_SUBMITTER
return 10000;
#else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we try to avoid letting testing specific code seep into the main repo, though I understand the need here to gain coverage.
Can we instead perhaps add one large test to the "normal" test suite which would hit the large submitter and enable it only under the same circumstances. I understand the desire to limit the test time of the suite, but this both infects the main repo with test specifics, but also adds coverage of this code in situations it will never encounter in the wild, and doesn't cover any real sizes.

Id really prefer not to, but if we do have to have this, I'd suggest uglifying the name.

const std::uint32_t __sub_group_size = __sub_group.get_local_linear_range();
const std::uint32_t __sub_group_id = __sub_group.get_group_linear_id();
const std::uint32_t __sub_group_local_id = __sub_group.get_local_linear_id();
const std::size_t __work_group_id = __item.get_group().get_group_linear_id();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we could move this out of the branch and use in both sides.

Comment on lines +91 to +94
static inline std::tuple<std::size_t, std::size_t, bool>
__stride_recommender(const sycl::nd_item<1>& __item, std::size_t __count, std::size_t __iters_per_work_item,
std::size_t __adj_elements_per_work_item, std::size_t __work_group_size)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a general utility which might have utility for other commutative operations beyond just parallel_for or is there a reason you believe this to be specific to this algorithm / kernel?

If we think it might be useful, we could lift this to a general utility level. Obviously we don't need to incorporate it elsewhere in this PR. An alternative is to add an issue to explore this and only lift it if we find utility.

void
operator()(std::false_type, _IdxType __start_idx, _LoadOp __load_op, _Acc... __acc) const
{
std::uint8_t __elements = std::min(std::size_t{__vec_size}, std::size_t{__n - __start_idx});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense... __vec_size is 4 or less, but __n - __start_idx can only be assumed to fit within size_t (and you don't want to overflow before the min). The result will be 4 or less, which fits in 8 bits.

std::forward<_Range2>(__result))
unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _CopyBrick, std::decay_t<_Range1>,
std::decay_t<_Range2>>{
{}, _CopyBrick{}, static_cast<std::size_t>(__n)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dislike having to pass {} as the first argument here. I'm not sure I even really understand why its necessary, is this for the base class?

Can we just define constructors which accepts only the brick and size to avoid this issue?

struct custom_brick
#if _ONEDPL_BACKEND_SYCL
template <typename Comp, typename T, typename _Range, search_algorithm func>
struct custom_brick : oneapi::dpl::unseq_backend::walk_scalar_base<_Range>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets fix the naming of this while were touching all its instances __custom_brick

void
__scalar_path(_IsFull, const _Idx __idx, const _Range1 __rng1, _Range2 __rng2) const
{

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

auto __raw_ptr3 = __rng3.begin();

oneapi::dpl::__internal::__lazy_ctor_storage<_ValueType1> __rng1_vector[__base_t::__preferred_vector_size];
oneapi::dpl::__internal::__lazy_ctor_storage<_ValueType2> __rng2_vector[__base_t::__preferred_vector_size];
Copy link
Contributor

@danhoeflinger danhoeflinger Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be possible to combine walk*_vectors_or_scalars together with some complicated fold instructions, lambdas, tuples, and std::apply.

Take a look at the first answer of https://stackoverflow.com/questions/7230621/how-can-i-iterate-over-a-packed-variadic-template-argument-list. I think you should do something similar, chaining together instructions by returning tuples and then with std::apply.

Here is an example I was playing with.
https://godbolt.org/z/vc8dK4ed6

In the end, I'm not sure if (1) its actually possible and (2) its worth the complexity to consolidate these structs, but its worth considering...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants