-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Remove USE_ALIGNED_ACCESS and enhance BYTE_ORDER handling #2456
base: unstable
Are you sure you want to change the base?
Conversation
} | ||
} | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if the performance difference between this fast path and vanilla algorithm is large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/bitmap_ops.h#L160-L242 would be better, I'll do a benchmark there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked the current impl, and I guess the fast path would be faster😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use https://quick-bench.com/ which can generate a chart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revert the part because the underlying code is too slow 😅
Besides, I change to use memcpy to generalize it. I'll try a benchmark later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apache/arrow@76cebfa
Lucky, arrow has similiar testing here. On My MacOS M1Pro with Release -O2:
BenchmarkBitmapVisitBitsetAnd/32768/0 753392 ns 749634 ns 937 bytes_per_second=41.687M/s
BenchmarkBitmapVisitBitsetAnd/131072/0 2986097 ns 2985449 ns 234 bytes_per_second=41.8698M/s
BenchmarkBitmapVisitBitsetAnd/32768/1 746267 ns 746040 ns 939 bytes_per_second=41.8878M/s
BenchmarkBitmapVisitBitsetAnd/131072/1 2991597 ns 2990679 ns 234 bytes_per_second=41.7965M/s
BenchmarkBitmapVisitBitsetAnd/32768/2 747519 ns 747314 ns 940 bytes_per_second=41.8164M/s
BenchmarkBitmapVisitBitsetAnd/131072/2 2985102 ns 2984500 ns 234 bytes_per_second=41.8831M/s
The code has no different from bit-hacking and
@git-hulk @PragmaTwice I've paste the result #2456 (comment) The code runs same speed with highly optimized code in macos, and x86 would share this optimization |
Quality Gate passedIssues Measures |
@PragmaTwice would you mind check again? |
Redis uses lot of USE_ALIGNED_ACCESS as the "fastpath" for ARM like archtecture, however, I think modern compiler can handle this kind of optimization well. So this part of code is removed.
Besides, some vendored libraray uses
BYTE_ORDER
macro, this might not being defined in these files. So I use BYTE_ORDER instead