Reworked wait_contex approach + scalability improvement in task_group #1345

pavelkumbrasev · 2024-04-15T10:28:56Z

Description

Introduce per thread reference_vertex that should help with scalability problem that single wait_context.
task_group has a flat reference counting scheme i.e., there is a central reference counter where all the created tasks should increase/decrease reference during execution.
This approach works fine while tasks are big and submitted from small number of threads (<8).
When multiple threads will start tree-like algorithm with a lot of tasks the overall performance of the application will drastically degrade with increasing number of threads due to huge synchronization cost.
This patch utilizes per thread reference counter in task_group.

Fixes # - issue number(s) if exists

- git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details)

Type of change

Choose one or multiple, leave empty if none of the other choices apply

Add a respective label(s) to PR if you have permissions

bug fix - change that fixes an issue
new feature - change that adds functionality
tests - change in tests
infrastructure - change in infrastructure and CI
documentation - documentation update

Tests

added - required for new features and some bug fixes
not needed

Documentation

updated in # - add PR number
needs to be updated
not needed

Breaks backward compatibility

Yes
No
Unknown

Notify the following users

@vossmjp @kboyarinov

Other information

src/tbb/task.cpp

include/oneapi/tbb/detail/_task.h

include/oneapi/tbb/detail/_task_handle.h

include/oneapi/tbb/task_group.h

aleksei-fedotov · 2024-04-16T12:25:05Z

src/tbb/task.cpp

@@ -221,6 +221,34 @@ void notify_waiters(std::uintptr_t wait_ctx_addr) {
    governor::get_thread_data()->my_arena->get_waiting_threads_monitor().notify(is_related_wait_ctx);
 }

+d1::wait_tree_vertex_interface* get_thread_reference_vertex(d1::wait_tree_vertex_interface* wc) {
+    __TBB_ASSERT(wc, nullptr);
+    auto& dispatcher = *governor::get_thread_data()->my_task_dispatcher;


Can thread data be uninitialized here? Would get_thread_data_if_initialized() with the assert __TBB_ASSERT(governor::get_thread_data() != nullptr, "") be enough here?

It can be not yet initialize. Consider this example:

tbb::task_group tg; tg.run(); // get_thread_reference_vertex will be called but TBB not yet initialized

include/oneapi/tbb/collaborative_call_once.h

aleksei-fedotov · 2024-04-22T10:19:03Z

src/tbb/task.cpp

@@ -221,6 +221,34 @@ void notify_waiters(std::uintptr_t wait_ctx_addr) {
    governor::get_thread_data()->my_arena->get_waiting_threads_monitor().notify(is_related_wait_ctx);
 }

+d1::wait_tree_vertex_interface* get_thread_reference_vertex(d1::wait_tree_vertex_interface* wc) {


Perhaps, rename wc to top_wait_context.

Are there plans to pass something different than pointer to wait_context into this function?

Renamed.

Not really.

aleksei-fedotov · 2024-04-22T10:22:56Z

src/tbb/task.cpp

+    auto& dispatcher = *governor::get_thread_data()->my_task_dispatcher;
+
+    d1::reference_vertex* ref_counter{nullptr};
+    auto pos = dispatcher.m_reference_vertex_map.find(wc);


To improve code readability I noticed that dispatcher.m_reference_vertex_map is used many times within this function. I suggest making an alias to it first, and then use the alias in other parts of the function.

Suggested change

auto pos = dispatcher.m_reference_vertex_map.find(wc);

auot& reference_map = dispatcher.m_reference_vertex_map;

auto pos = reference_map.find(wc);

aleksei-fedotov · 2024-04-22T13:48:37Z

src/tbb/task.cpp

+    } else {
+        constexpr std::size_t max_reference_vertex_map_size = 1000;
+        if (dispatcher.m_reference_vertex_map.size() > max_reference_vertex_map_size) {
+            for (auto it = dispatcher.m_reference_vertex_map.begin(); it != dispatcher.m_reference_vertex_map.end();) {


If I understand correctly, there might be a situation when a processing of this loop might take significant amount of time. Shall we introduce some threshold that would regulate how much time a thread could spend cleaning the container?
Perhaps, put a TODO note about this at least.

Added TODO

aleksei-fedotov · 2024-04-22T14:17:53Z

include/oneapi/tbb/detail/_task.h

+        std::uint64_t ref = m_ref_count.fetch_sub(static_cast<std::uint64_t>(delta)) - static_cast<std::uint64_t>(delta);
+        if (ref == 0) {


Perhaps, this would make this part a bit more readable.

Suggested change

std::uint64_t ref = m_ref_count.fetch_sub(static_cast<std::uint64_t>(delta)) - static_cast<std::uint64_t>(delta);

if (ref == 0) {

std::uint64_t prev_ref_value = m_ref_count.fetch_sub(static_cast<std::uint64_t>(delta));

if (prev_ref_value == static_cast<std::uint64_t>(delta)) {

I'm not sure. @kboyarinov what do you think?

I agree with Aleksei, since the arithmetic operation after the long fetch_sub expression is not much noticeable.
If you are warned about the logical part, I guess the check can also be if (prev_ref_value - delta == 0) but I am not sure about that.

aleksei-fedotov · 2024-04-22T14:20:59Z

include/oneapi/tbb/detail/_task.h

+        }
+    }
+
+    void release(std::uint32_t delta = 1) override {


delta can be negative, but here possible parameter's values are only positive. Consider, naming the parameter differently here and in other similar places. Perhaps, release_num or simply num would suit better.

It is identical to wait_contex

aleksei-fedotov · 2024-04-22T14:27:26Z

include/oneapi/tbb/detail/_task.h

+    void release(std::uint32_t, const d1::execution_data&) override {
+        __TBB_ASSERT(false,
+            "This method is overloaded only to fulfill the base class interface requirements, and thus, it should not be called.");
+    }


Usually it means that the inheritance is incorrect. Introducing intermediate interface would not only avoid writing such implementations but also would add to code readability.

@kboyarinov what do you think?
1 extra interface class or 1 method with assert in private section.

include/oneapi/tbb/task_group.h

aleksei-fedotov · 2024-04-22T15:57:05Z

include/oneapi/tbb/task_group.h

-template<typename F>
-class function_task : public task {
-    const F m_func;
-    wait_context& m_wait_ctx;
-    small_object_allocator m_allocator;
-
-    void finalize(const execution_data& ed) {
-        // Make a local reference not to access this after destruction.
-        wait_context& wo = m_wait_ctx;
-        // Copy allocator to the stack
-        auto allocator = m_allocator;
-        // Destroy user functor before release wait.
-        this->~function_task();
-        wo.release();
-
-        allocator.deallocate(this, ed);
-    }
-    task* execute(execution_data& ed) override {
-        task* res = d2::task_ptr_or_nullptr(m_func);
-        finalize(ed);
-        return res;
-    }
-    task* cancel(execution_data& ed) override {
-        finalize(ed);
-        return nullptr;
-    }
-public:
-    function_task(const F& f, wait_context& wo, small_object_allocator& alloc)
-        : m_func(f)
-        , m_wait_ctx(wo)
-        , m_allocator(alloc) {}
-
-    function_task(F&& f, wait_context& wo, small_object_allocator& alloc)
-        : m_func(std::move(f))
-        , m_wait_ctx(wo)
-        , m_allocator(alloc) {}
-};


Was that removed because of possibility to reuse function_task above that inherits task_handle_task?

Their layouts differ however. The PR is marked as backward compatible though. Is it because we are not certain about header-only backward incompatibilities or I miss something?

You will need to recompile the whole application.
It is explained here:
#1371