-
Notifications
You must be signed in to change notification settings - Fork 626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk processed pixel conversions #2383
Comments
I appreciate the extension for minimalism alone, very nice if that default'able extension method to an internal trait would work out to 3× speedups for some of the higher-level operations. In general, yes, anything that can be designed for batching but isn't deserves a critical look. What's the main hold-up for turning your linked implementation sketch and commit into a PR? |
The holdup is calling this method internally, as most APIs only call from_color 3-4 times (once per channel or alpha) per pixel at a time. |
An example issue would be FromColor as already mentioned: fn from_color_bulk(output: &mut [Self], input: &[Rgb<f32>])
where
Self: Sized,
{
// Safe because `Foo` is `#[repr(transparent)]`
let input: &[f32] = unsafe {
std::slice::from_raw_parts(
input.as_ptr() as *const f32,
input.len() * std::mem::size_of::<f32>(),
)
};
// Same goes for T
let output: &mut [u8] = unsafe {
std::slice::from_raw_parts_mut(
output.as_ptr() as *mut u8,
output.len() * std::mem::size_of::<u8>(),
)
};
u8::from_bulk_primitive(input, output);
} Here i have to somehow turn a buffer of |
I'm supportive of this investigation, though we should make sure to benchmark carefully and to double check that the same performance gains aren't possible the current interfaces. I poked around in godbolt a bit with the Another point to mention is that |
So far we've avoided ever dealing with |
This is why im still toying around with all of this. |
SIMD will be much better here on x86 since Rust make a lot efforts on f32 -> u8, with NEON conversion is done in one instruction and there will be no difference, some improvements will be only for x86 where conversion f32 -> u8 goes without explicit saturation and Rust makes additional checks. You can check the time that Rust spending to check that it won't cause an UB on x86: #[inline(always)]
fn from_primitive(float: f32) -> u8 {
unsafe {
(normalize_float(float, u8::MAX as f32) + 0.5).to_int_unchecked::<u8>()
}
} On Other things in loop should not make real overhead. |
I observed that the compiler can actually emit pretty fast SIMD code for scalar implementations, given that proper slice sizes are used (such as casting Rgb). So it might be enough to just provide scalar bulk ops and let the compiler take the wheel. |
I believe this should be packaged into three stages of PR/Work that ill explain below, that i would like to hear some thoughts on. Ill end each case with progress that i have made in that direction.
|
I think we should start by adding a benchmark of the user-facing functionality that we want to optimize. In this case, I believe it would either be measuring either From there, the next step would be a proof-of-concept implementation showing that the overall performance difference that is possible. It is also an opportunity to experiment with whether there's other ways to get similar performance gains. If results from the previous step are compelling, then we can figure out the right sequence to merge things in. Your proposed sequence of first defining the API, then adding specialized implementations, and following up with further optimizations where beneficial seems broadly reasonable. This is also where we'd have to answer questions about usage of unsafe (ideally we'd avoid unless absolutely necessary) and using explicit SIMD |
Adding internal API to bulk process pixel conversions with per-conversion friendly SIMD/SWIR would speed up many image ops in this crate. I would like to hear some feedback on the general concept and its applications for the future.
Draft
I'm willing to take a shot at implementing these APIs as far as possible, I have done some bikeshedding and benchmarks as seen here. I have observed a 3x speedup when converting a 16mb f32 buffer to u8, while taking these new constraints into account.
Generally, most conversion APIs would need to be extended like this, though this new API would not be mandatory to implement as a scalar fallback should always be possible.
The text was updated successfully, but these errors were encountered: