-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose boolean builder contents #5760
Conversation
/// Returns the current values buffer as a slice | ||
/// | ||
/// Boolean values are bit-packed into bytes, so to read the i-th boolean, | ||
/// you must compute `values_slice[i / 8] & (1 << (i % 8)) == 0`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could direct people to https://docs.rs/arrow-buffer/latest/arrow_buffer/bit_util/fn.get_bit.html ?
It should also be faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do so, but out of curiosity, did you benchmark that the current implementation is actually faster? A quick ASM comparison makes me a little surprised that it is the case.
- Indexing a bitmap is a common operation that optimizing compilers know how deal with pretty well, trivially turning
x / 8
intox >> 3
,x % 8
intox & 7
andint & (1 << N)
where N can be proven at compile time to be in bounds into a bit test. So effectively we're comparing abt/setb
pair into to a table lookup +test/setne
pair here, and from https://uops.info/table.html, it does not look likebt/setb
is particularly badly implemented on current CPUs. - Furthermore, since the table is specified to be a const, and thus does not have a dedicated memory location in global program data, extra instructions had to be generated to push the table to the stack on every call to the function, adding a memory store (which worries me because these have an unfortunate tendency to create superscalar memory hazards that worsen the latency of loads surrounding them) to the hot code path.
- The store can be avoided (or more accurately turned into a load, which will reliably hit L1 if the function is called in the kind of tight loops where its performance matters) by turning a const into a static, as shown in the linked godbolt snippet.
Against, I'm just static-analyzing the assembly by eye here, did not take the time to set up a proper microbenchmark. Just surprised that it makes a positive difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it was benchmarked and made a positive difference when called in a hot loop (as it often is). Regardless it is better encapsulated and if we find a better way to implement this in future all call sites can benefit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, overall I agree that the simple fact we're having this conversation is a very good argument in favor of adding a note about the encapsulted get_bit
to the docs, instead of suggesting a manual implementation. So I changed the doc in the latest commit of this branch, which I believe should be enough to make it mergeable. If you want, we can move the remainder of this discussion about get_bit()
to a dedicated issue.
But those benchmark results surprised me and I couldn't find the associated benchmark in the arrow repository, so given how tricky it is to correctly benchmark such small functions (you need to be very careful which optimization barriers you use and how you place them in order not to unrealistically pessimize the code you're benchmarking), I preferred to write another benchmark myself.
Here's how the implementations compare on a random indexing scenario (asked to probe N bit indices which are initially resident in CPU registers, but whose value is hidden from the compiler's optimizer):
hidden_constant/bit_naive
time: [1.3571 ns 1.3582 ns 1.3593 ns]
thrpt: [2.9426 Gelem/s 2.9452 Gelem/s 2.9475 Gelem/s]
hidden_constant/bit_const_table
time: [1.5597 ns 1.5729 ns 1.5898 ns]
thrpt: [2.5161 Gelem/s 2.5431 Gelem/s 2.5646 Gelem/s]
hidden_constant/bit_static_table
time: [1.4149 ns 1.4161 ns 1.4174 ns]
thrpt: [2.8221 Gelem/s 2.8248 Gelem/s 2.8271 Gelem/s]
Assembly analysis
At the top, we find the bit_naive
implementation, whose assembly is very unsurprising (just the godbolt code above, duplicated 4 times by the 4-way unrolling and with a bit of shuffling from LLVM's instruction scheduler).
Then comes the bit_static_table
, which as expected is faster than the bit_const_table
implementation, but hampered by the extra overhead of loading data from memory and then doing a test
/setne
pair. The assembly is again unsurprising given what we've seen previously:
And as expected, the bit_const_table
is slowest, though it surprises me how much of a performance that single table store (which was sadly not completely moved out of the benchmark loop, but was at least a bit amortized by the 4-way unrolling that I added) makes wrt the bit_static_table
version.
And here's how the implementations compare on a linear iteration scenario (compilers knows that we're iterating linearly over the bits of a bitmap and is free to optimize accordingly):
linear/bit_naive time: [38.233 µs 38.382 µs 38.558 µs]
thrpt: [6.7988 Gelem/s 6.8298 Gelem/s 6.8565 Gelem/s]
linear/bit_const_table time: [38.010 µs 38.100 µs 38.195 µs]
thrpt: [6.8634 Gelem/s 6.8803 Gelem/s 6.8967 Gelem/s]
linear/bit_static_table time: [52.366 µs 52.398 µs 52.432 µs]
thrpt: [4.9997 Gelem/s 5.0030 Gelem/s 5.0060 Gelem/s]
Assembly analysis
At the top, we find the bit_naive
and bit_const_table
implementation. They both generate the exact same pretty optimal assembly in this easier scenario, as this time the compiler's optimizer managed to optimize out the table store of bit_const_table
and just generate a linear bit probing loop:
Sadly, bit_static_table
did not get so lucky. To my surprise, it looks like the compiler did not manage to prove that the BIT_MASK_STATIC
array does not get modified and generate the same code as for the const BIT_MASK
, instead there are some surprise SSE instructions in there that I do not have time to look into right now:
All numbers are from an AMD Zen 3 CPU (AMD Ryzen 5 5600G), compiled with rustc v1.78 on mostly-default cargo bench
settings (I use of the LLD linker instead of BFD, but it should not affect this particular benchmark).
Overall, I would argue that this proves that the naive version is always fastest, has less performance edge cases in the face of compiler optimizer mishaps, and thus should become the new implementation of get_bit
.
But if you disagree with my conclusion, please feel free to re-run my benchmark with cargo bench
on any machine whose performance is of interest to you, or point me to the benchmark that you used to conclude that the bit_naive
implementation is slower than the bit_const_table
one, so that I can figure out what it's doing differently and whether that points at an issue in my benchmark or yours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was years ago, but I believe it was one of the benchmarks involving boolean array iteration. Although that being said most of those use-cases now use BitChunks or UnalignedBitChunks, and so it is possible very few hot kernels are even using this method anymore.
If you wanted to file a PR to optimise this function, I think that would be well received
Which issue does this PR close?
Closes #5757.
Rationale for this change
This lets me expose a more consistent interface in my prototype for #5700, where builder contents can be accessed in place as a slice for more types.
What changes are included in this PR?
Boolean builder contents are exposed as a slice of bytes, as suggested by @tustvold in #5757.
Are there any user-facing changes?
This adds a
values_slice()
toBooleanBuilder
, which exposes the bit-packed data as a slice of bytes. Since that's a bit different from othervalues_slice()
methods due to the bitpacking involved, I added a one-liner explanation of how one would access the i-th boolean from this slice to the docs.