Expose boolean builder contents #5760

HadrienG2 · 2024-05-13T06:23:50Z

Which issue does this PR close?

Closes #5757.

Rationale for this change

This lets me expose a more consistent interface in my prototype for #5700, where builder contents can be accessed in place as a slice for more types.

What changes are included in this PR?

Boolean builder contents are exposed as a slice of bytes, as suggested by @tustvold in #5757.

Are there any user-facing changes?

This adds a values_slice() to BooleanBuilder, which exposes the bit-packed data as a slice of bytes. Since that's a bit different from other values_slice() methods due to the bitpacking involved, I added a one-liner explanation of how one would access the i-th boolean from this slice to the docs.

tustvold · 2024-05-13T10:18:12Z

arrow-array/src/builder/boolean_builder.rs

+    /// Returns the current values buffer as a slice
+    ///
+    /// Boolean values are bit-packed into bytes, so to read the i-th boolean,
+    /// you must compute `values_slice[i / 8] & (1 << (i % 8)) == 0`.


Perhaps we could direct people to https://docs.rs/arrow-buffer/latest/arrow_buffer/bit_util/fn.get_bit.html ?

It should also be faster

Will do so, but out of curiosity, did you benchmark that the current implementation is actually faster? A quick ASM comparison makes me a little surprised that it is the case.

Indexing a bitmap is a common operation that optimizing compilers know how deal with pretty well, trivially turning x / 8 into x >> 3, x % 8 into x & 7 and int & (1 << N) where N can be proven at compile time to be in bounds into a bit test. So effectively we're comparing a bt/setb pair into to a table lookup + test/setne pair here, and from https://uops.info/table.html, it does not look like bt/setb is particularly badly implemented on current CPUs.

Furthermore, since the table is specified to be a const, and thus does not have a dedicated memory location in global program data, extra instructions had to be generated to push the table to the stack on every call to the function, adding a memory store (which worries me because these have an unfortunate tendency to create superscalar memory hazards that worsen the latency of loads surrounding them) to the hot code path.

The store can be avoided (or more accurately turned into a load, which will reliably hit L1 if the function is called in the kind of tight loops where its performance matters) by turning a const into a static, as shown in the linked godbolt snippet.

Against, I'm just static-analyzing the assembly by eye here, did not take the time to set up a proper microbenchmark. Just surprised that it makes a positive difference.

Yes it was benchmarked and made a positive difference when called in a hot loop (as it often is). Regardless it is better encapsulated and if we find a better way to implement this in future all call sites can benefit

Just to be clear, overall I agree that the simple fact we're having this conversation is a very good argument in favor of adding a note about the encapsulted get_bit to the docs, instead of suggesting a manual implementation. So I changed the doc in the latest commit of this branch, which I believe should be enough to make it mergeable. If you want, we can move the remainder of this discussion about get_bit() to a dedicated issue.

But those benchmark results surprised me and I couldn't find the associated benchmark in the arrow repository, so given how tricky it is to correctly benchmark such small functions (you need to be very careful which optimization barriers you use and how you place them in order not to unrealistically pessimize the code you're benchmarking), I preferred to write another benchmark myself.

Here's how the implementations compare on a random indexing scenario (asked to probe N bit indices which are initially resident in CPU registers, but whose value is hidden from the compiler's optimizer):

hidden_constant/bit_naive time: [1.3571 ns 1.3582 ns 1.3593 ns] thrpt: [2.9426 Gelem/s 2.9452 Gelem/s 2.9475 Gelem/s] hidden_constant/bit_const_table time: [1.5597 ns 1.5729 ns 1.5898 ns] thrpt: [2.5161 Gelem/s 2.5431 Gelem/s 2.5646 Gelem/s] hidden_constant/bit_static_table time: [1.4149 ns 1.4161 ns 1.4174 ns] thrpt: [2.8221 Gelem/s 2.8248 Gelem/s 2.8271 Gelem/s]

Assembly analysis

At the top, we find the bit_naive implementation, whose assembly is very unsurprising (just the godbolt code above, duplicated 4 times by the 4-way unrolling and with a bit of shuffling from LLVM's instruction scheduler).

Then comes the bit_static_table, which as expected is faster than the bit_const_table implementation, but hampered by the extra overhead of loading data from memory and then doing a test/setne pair. The assembly is again unsurprising given what we've seen previously:

And as expected, the bit_const_table is slowest, though it surprises me how much of a performance that single table store (which was sadly not completely moved out of the benchmark loop, but was at least a bit amortized by the 4-way unrolling that I added) makes wrt the bit_static_table version.

And here's how the implementations compare on a linear iteration scenario (compilers knows that we're iterating linearly over the bits of a bitmap and is free to optimize accordingly):

linear/bit_naive time: [38.233 µs 38.382 µs 38.558 µs] thrpt: [6.7988 Gelem/s 6.8298 Gelem/s 6.8565 Gelem/s] linear/bit_const_table time: [38.010 µs 38.100 µs 38.195 µs] thrpt: [6.8634 Gelem/s 6.8803 Gelem/s 6.8967 Gelem/s] linear/bit_static_table time: [52.366 µs 52.398 µs 52.432 µs] thrpt: [4.9997 Gelem/s 5.0030 Gelem/s 5.0060 Gelem/s]

Assembly analysis

At the top, we find the bit_naive and bit_const_table implementation. They both generate the exact same pretty optimal assembly in this easier scenario, as this time the compiler's optimizer managed to optimize out the table store of bit_const_table and just generate a linear bit probing loop:

Sadly, bit_static_table did not get so lucky. To my surprise, it looks like the compiler did not manage to prove that the BIT_MASK_STATIC array does not get modified and generate the same code as for the const BIT_MASK, instead there are some surprise SSE instructions in there that I do not have time to look into right now:

All numbers are from an AMD Zen 3 CPU (AMD Ryzen 5 5600G), compiled with rustc v1.78 on mostly-default cargo bench settings (I use of the LLD linker instead of BFD, but it should not affect this particular benchmark).

Overall, I would argue that this proves that the naive version is always fastest, has less performance edge cases in the face of compiler optimizer mishaps, and thus should become the new implementation of get_bit.

But if you disagree with my conclusion, please feel free to re-run my benchmark with cargo bench on any machine whose performance is of interest to you, or point me to the benchmark that you used to conclude that the bit_naive implementation is slower than the bit_const_table one, so that I can figure out what it's doing differently and whether that points at an issue in my benchmark or yours.

It was years ago, but I believe it was one of the benchmarks involving boolean array iteration. Although that being said most of those use-cases now use BitChunks or UnalignedBitChunks, and so it is possible very few hot kernels are even using this method anymore.

If you wanted to file a PR to optimise this function, I think that would be well received

Expose boolean builder contents

f14bfbe

github-actions bot added the arrow Changes to the arrow crate label May 13, 2024

tustvold reviewed May 13, 2024

View reviewed changes

Suggest using arrow-provided utility for boolean unpacking

acdb37d

tustvold merged commit 78bda14 into apache:master May 14, 2024
25 checks passed

HadrienG2 deleted the expose-booleans branch May 16, 2024 05:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose boolean builder contents #5760

Expose boolean builder contents #5760

HadrienG2 commented May 13, 2024

tustvold May 13, 2024

HadrienG2 May 14, 2024 •

edited

Loading

tustvold May 14, 2024 •

edited

Loading

HadrienG2 May 14, 2024 •

edited

Loading

tustvold May 14, 2024

Expose boolean builder contents #5760

Expose boolean builder contents #5760

Conversation

HadrienG2 commented May 13, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold May 13, 2024

Choose a reason for hiding this comment

HadrienG2 May 14, 2024 • edited Loading

Choose a reason for hiding this comment

tustvold May 14, 2024 • edited Loading

Choose a reason for hiding this comment

HadrienG2 May 14, 2024 • edited Loading

Choose a reason for hiding this comment

tustvold May 14, 2024

Choose a reason for hiding this comment

HadrienG2 May 14, 2024 •

edited

Loading

tustvold May 14, 2024 •

edited

Loading

HadrienG2 May 14, 2024 •

edited

Loading