Faster CPU (Arg-)Reductions #6989

manuelvogel12 · 2024-09-26T14:24:12Z

This pull request aims to speed up Reductions and ArgReductions with single outputs on the CPU.

Type

Bug fix (non-breaking change which fixes an issue): Fixes #
New feature (non-breaking change which adds functionality). Resolves #
Breaking change (fix or feature that would cause existing functionality to not work as expected) Resolves #

Motivation and Context

For Reductions (e.g. sum), the problem of the current code is the access of multiple threads to a shared array in each loop iteration, which can be fixed using a single local variable.

Argreductions for single outputs (e.g. points.argmax()) run with the current code on a single CPU core, although the operation can be sped up in the same way as the reduction.

This PR fixes both problems and has a 400% speedup for these reductions (4x faster) and 2000% speedup for these argreductions (20x faster) on my PC.

Note that this does only apply for single-output operations performed on the CPU.

Checklist:

I have run python util/check_style.py --apply to apply Open3D code style
to my code.
This PR changes Open3D behavior or adds new functionality.
- Both C++ (Doxygen) and Python (Sphinx / Google style) documentation is
  updated accordingly.
- I have added or updated C++ and / or Python unit tests OR included test
  results (e.g. screenshots or numbers) here.
I will follow up and update the code if CI fails.
For fork PRs, I have selected Allow edits from maintainers.

Description

The benchmarking was done on a 24core cpu, using the sum() and argmax() functions of arrays with 150M elements.

…ingle outputs

update-docs · 2024-09-26T14:24:16Z

Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes.

ssheorey

Hi @manuelvogel12 thanks for this PR! Speeding up basic ops is really useful. Can I request some additional benchmarking? Essentially, we want to ensure there are no speed regressions for different use cases. An example of multiple scenarios:

number of points: 10K, 100K, 1M, 10M, 100M
number of CPU cores: 4, 8, 16, 24
[while counting cores, please ignore hyperthreading]

You can use (and update if needed) cpp/benchmarks/core/Reduction.cpp for benchmarking this. [Build with cmake option -D BUILD_BENCHMARKS=ON]

You can use taskset --cpu-list 0-7 open3dbenchmark... to restrict the benchmark to only CPU cores 0-7 inclusive. [This is 4 cores for hyperthreading CPUs].

ssheorey · 2024-09-28T00:40:31Z

cpp/open3d/core/kernel/ReductionCPU.cpp

+    static void LaunchArgReductionParallelDim(const Indexer& indexer,
+                                              func_t reduce_func,
+                                              scalar_t identity) {
+        int64_t num_output_elements = indexer.NumOutputElements();
 #pragma omp parallel for schedule(static) \
        num_threads(utility::EstimateMaxThreads())
        for (int64_t output_idx = 0; output_idx < num_output_elements;


Since usually num_output_elements << num_threads, we should collapse the two for loops to improve thread utilization with#pragma omp for collapse(2) I believe Visual Studio now supports this (OpenMP v3) but good to check Windows CI.

I just want to remind about #6626 as this will add new OpenMP code

I don't think its possible to collapse the two loops, since the inner loop has a loop-carried dependency. What could be done is using the LaunchArgReductionKernelTwoPass instead of the inner loop. Another option would be the use of OpenMP`s reduction clause directly

When simply launching LaunchArgReductionKernelTwoPass inside the outer for loop, it improves the performance when reducing over the large_dim, but worsens the performance of reducing over the small dim.

The image shows the time, so +4.5 means this approach would be 4.5 times slower in that benchmark,
-0.78 means roughly 4.5 times faster

Therefore, I think, when the number of output elements is smaller than the number of threads, it's best not to optimize the outer loop at all, or find a good way to calculate the number of threads for the inner loop,

Thanks @manuelvogel12 for the detailed benchmarking. This looks good to me.

To address @benjaminum's point, we will be switching from OpenMP to TBB soon (PR #6626). This helps quite a bit with lack of good support for OpenMP on Windows and Apple, simplifies the code and also comes with the benefit of composable parallelism. We also expect (some) speedup from TBB's work stealing scheduler model.

I think we can merge this PR due to its speed boost for now and then switch this code to TBB as well along with the rest of the code. @manuelvogel12 I would encourage you to check out PR #6626 and help us test that.

ssheorey · 2024-09-28T00:58:35Z

cpp/open3d/core/kernel/ReductionCPU.cpp

+        std::vector<int64_t> thread_results_idx(num_threads, 0);
+        std::vector<scalar_t> thread_results_val(num_threads, identity);
+
+#pragma omp parallel for schedule(static) \


Can we simplify this with for loop collapse in OpenMP?

manuelvogel12 · 2024-09-30T12:50:22Z

Here are the benchmarking results. I always reduced along dim 0, with varying shapes.
Since this PR aims only at single-output reductions, only the tests with single shape inputs are relevant here (e.g. (100) or (10000))

This is for all cores (24cores, 48 threads)

This is 4 cores with SMT (taskset --cpu-list 0-3,24-27)

This is 8 cores without SMT (taskset --cpu-list 0-7)

To interpret the results, I would say my claims earlier made (400% speedup for these reductions and 2000% speedup for these argreductions) are valid

manuelvogel12 added 2 commits September 26, 2024 16:05

Avoid multithread access to shared array in a loop (CPU Reductions)

10b85f7

Add two pass CPU ArgReduction Engine to speed up argreductions with s…

e961d19

…ingle outputs

manuelvogel12 mentioned this pull request Sep 27, 2024

Add Tensor-based implementation of FarthestPointDownSample #6948

Merged

9 tasks

ssheorey self-requested a review September 27, 2024 18:52

ssheorey reviewed Sep 28, 2024

View reviewed changes

ssheorey merged commit db00e33 into isl-org:main Oct 3, 2024
36 of 39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster CPU (Arg-)Reductions #6989

Faster CPU (Arg-)Reductions #6989

manuelvogel12 commented Sep 26, 2024

update-docs bot commented Sep 26, 2024

ssheorey left a comment

ssheorey Sep 28, 2024

benjaminum Sep 28, 2024

manuelvogel12 Sep 30, 2024

manuelvogel12 Sep 30, 2024 •

edited

Loading

ssheorey Oct 3, 2024

ssheorey Sep 28, 2024

manuelvogel12 commented Sep 30, 2024

Faster CPU (Arg-)Reductions #6989

Faster CPU (Arg-)Reductions #6989

Conversation

manuelvogel12 commented Sep 26, 2024

Type

Motivation and Context

Checklist:

Description

update-docs bot commented Sep 26, 2024

ssheorey left a comment

Choose a reason for hiding this comment

ssheorey Sep 28, 2024

Choose a reason for hiding this comment

benjaminum Sep 28, 2024

Choose a reason for hiding this comment

manuelvogel12 Sep 30, 2024

Choose a reason for hiding this comment

manuelvogel12 Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

ssheorey Oct 3, 2024

Choose a reason for hiding this comment

ssheorey Sep 28, 2024

Choose a reason for hiding this comment

manuelvogel12 commented Sep 30, 2024

manuelvogel12 Sep 30, 2024 •

edited

Loading