Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster CPU (Arg-)Reductions #6989

Merged
merged 2 commits into from
Oct 3, 2024
Merged

Conversation

manuelvogel12
Copy link
Contributor

This pull request aims to speed up Reductions and ArgReductions with single outputs on the CPU.

Type

  • Bug fix (non-breaking change which fixes an issue): Fixes #
  • New feature (non-breaking change which adds functionality). Resolves #
  • Breaking change (fix or feature that would cause existing functionality to not work as expected) Resolves #

Motivation and Context

For Reductions (e.g. sum), the problem of the current code is the access of multiple threads to a shared array in each loop iteration, which can be fixed using a single local variable.

Argreductions for single outputs (e.g. points.argmax()) run with the current code on a single CPU core, although the operation can be sped up in the same way as the reduction.

This PR fixes both problems and has a 400% speedup for these reductions (4x faster) and 2000% speedup for these argreductions (20x faster) on my PC.

Note that this does only apply for single-output operations performed on the CPU.

Checklist:

  • I have run python util/check_style.py --apply to apply Open3D code style
    to my code.
  • This PR changes Open3D behavior or adds new functionality.
    • Both C++ (Doxygen) and Python (Sphinx / Google style) documentation is
      updated accordingly.
    • I have added or updated C++ and / or Python unit tests OR included test
      results
      (e.g. screenshots or numbers) here.
  • I will follow up and update the code if CI fails.
  • For fork PRs, I have selected Allow edits from maintainers.

Description

The benchmarking was done on a 24core cpu, using the sum() and argmax() functions of arrays with 150M elements.

Copy link

update-docs bot commented Sep 26, 2024

Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes.

Copy link
Member

@ssheorey ssheorey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @manuelvogel12 thanks for this PR! Speeding up basic ops is really useful. Can I request some additional benchmarking? Essentially, we want to ensure there are no speed regressions for different use cases. An example of multiple scenarios:

number of points: 10K, 100K, 1M, 10M, 100M
number of CPU cores: 4, 8, 16, 24
[while counting cores, please ignore hyperthreading]

You can use (and update if needed) cpp/benchmarks/core/Reduction.cpp for benchmarking this. [Build with cmake option -D BUILD_BENCHMARKS=ON]

You can use taskset --cpu-list 0-7 open3dbenchmark... to restrict the benchmark to only CPU cores 0-7 inclusive. [This is 4 cores for hyperthreading CPUs].

static void LaunchArgReductionParallelDim(const Indexer& indexer,
func_t reduce_func,
scalar_t identity) {
int64_t num_output_elements = indexer.NumOutputElements();
#pragma omp parallel for schedule(static) \
num_threads(utility::EstimateMaxThreads())
for (int64_t output_idx = 0; output_idx < num_output_elements;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since usually num_output_elements << num_threads, we should collapse the two for loops to improve thread utilization with#pragma omp for collapse(2) I believe Visual Studio now supports this (OpenMP v3) but good to check Windows CI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to remind about #6626 as this will add new OpenMP code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think its possible to collapse the two loops, since the inner loop has a loop-carried dependency. What could be done is using the LaunchArgReductionKernelTwoPass instead of the inner loop. Another option would be the use of OpenMP`s reduction clause directly

Copy link
Contributor Author

@manuelvogel12 manuelvogel12 Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When simply launching LaunchArgReductionKernelTwoPass inside the outer for loop, it improves the performance when reducing over the large_dim, but worsens the performance of reducing over the small dim.

The image shows the time, so +4.5 means this approach would be 4.5 times slower in that benchmark,
-0.78 means roughly 4.5 times faster
image
Therefore, I think, when the number of output elements is smaller than the number of threads, it's best not to optimize the outer loop at all, or find a good way to calculate the number of threads for the inner loop,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @manuelvogel12 for the detailed benchmarking. This looks good to me.

To address @benjaminum's point, we will be switching from OpenMP to TBB soon (PR #6626). This helps quite a bit with lack of good support for OpenMP on Windows and Apple, simplifies the code and also comes with the benefit of composable parallelism. We also expect (some) speedup from TBB's work stealing scheduler model.

I think we can merge this PR due to its speed boost for now and then switch this code to TBB as well along with the rest of the code. @manuelvogel12 I would encourage you to check out PR #6626 and help us test that.

std::vector<int64_t> thread_results_idx(num_threads, 0);
std::vector<scalar_t> thread_results_val(num_threads, identity);

#pragma omp parallel for schedule(static) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this with for loop collapse in OpenMP?

@manuelvogel12
Copy link
Contributor Author

Here are the benchmarking results. I always reduced along dim 0, with varying shapes.
Since this PR aims only at single-output reductions, only the tests with single shape inputs are relevant here (e.g. (100) or (10000))

This is for all cores (24cores, 48 threads)
Screenshot from 2024-09-30 14-39-16

This is 4 cores with SMT (taskset --cpu-list 0-3,24-27)
Screenshot from 2024-09-30 14-39-28

This is 8 cores without SMT (taskset --cpu-list 0-7)
Screenshot from 2024-09-30 14-39-41

To interpret the results, I would say my claims earlier made (400% speedup for these reductions and 2000% speedup for these argreductions) are valid

@ssheorey ssheorey merged commit db00e33 into isl-org:main Oct 3, 2024
36 of 39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants