SSE2 support #6

scottlamb · 2024-09-19T20:46:40Z

This deinterleaves 8-bit pairs via the PACKUSWB instruction: either shifting right to get the high 8 bits of every 16, or ANDing to get the low 8 bits of every 16. I took that idea from compiling the following code:

https://godbolt.org/z/enaMY7v4o

use std::arch::x86_64;
use std::simd::i8x16;

pub unsafe fn process(
    top_uyvy_addr: *const u8,
    bot_uyvy_addr: *const u8,
    top_y_addr: *mut u8,
    bot_y_addr: *mut u8,
    u_addr: *mut u8,
    v_addr: *mut u8,
) {
    let [top_uv, bot_uv] = [
        (top_uyvy_addr, top_y_addr),
        (bot_uyvy_addr, bot_y_addr),
    ].map(|(uyvy_addr, y_addr)| {
        let uyvy = std::ptr::read_unaligned(uyvy_addr as *const [i8x16; 4]);
        let (uv_hi, y_hi) = uyvy[0].deinterleave(uyvy[1]);
        let (uv_lo, y_lo) = uyvy[2].deinterleave(uyvy[3]);
        std::ptr::write_unaligned(y_addr as *mut i8x16, y_hi);
        std::ptr::write_unaligned(y_addr.add(16) as *mut i8x16, y_lo);
        [uv_hi, uv_lo]
    });
    let uv = [
        i8x16::from(x86_64::_mm_avg_epu8(top_uv[0].into(), bot_uv[0].into())),
        i8x16::from(x86_64::_mm_avg_epu8(top_uv[1].into(), bot_uv[1].into())),
    ];
    let (u, v) = uv[0].deinterleave(uv[1]);
    std::ptr::write_unaligned(u_addr as *mut i8x16, u);
    std::ptr::write_unaligned(v_addr as *mut i8x16, v);
}

Its performance is surprisingly good: 24 GB/s cold, 73 GB/s hot. Some noise in all these measurements.

cold/memcpy_baseline    time:   [6.1385 ms 6.1503 ms 6.1639 ms]
                        thrpt:  [35.090 GiB/s 35.168 GiB/s 35.236 GiB/s]
                 change:
                        time:   [+0.8184% +1.1318% +1.4649%] (p = 0.00 < 0.05)
                        thrpt:  [-1.4438% -1.1191% -0.8117%]
                        Change within noise threshold.
cold/libyuv             time:   [8.9479 ms 8.9586 ms 8.9708 ms]
                        thrpt:  [24.111 GiB/s 24.144 GiB/s 24.173 GiB/s]
                 change:
                        time:   [+1.8972% +2.0382% +2.2017%] (p = 0.00 < 0.05)
                        thrpt:  [-2.1543% -1.9975% -1.8619%]
                        Performance has regressed.
cold/explicit_avx2_double
                        time:   [8.5369 ms 8.5513 ms 8.5665 ms]
                        thrpt:  [25.249 GiB/s 25.294 GiB/s 25.336 GiB/s]
                 change:
                        time:   [+12.556% +12.813% +13.076%] (p = 0.00 < 0.05)
                        thrpt:  [-11.564% -11.358% -11.155%]
                        Performance has regressed.
cold/explicit_avx2_single
                        time:   [8.0669 ms 8.0752 ms 8.0852 ms]
                        thrpt:  [26.752 GiB/s 26.785 GiB/s 26.812 GiB/s]
                 change:
                        time:   [+1.6387% +1.7825% +1.9258%] (p = 0.00 < 0.05)
                        thrpt:  [-1.8894% -1.7513% -1.6123%]
                        Performance has regressed.
cold/explicit_sse2      time:   [8.9443 ms 8.9541 ms 8.9652 ms]
                        thrpt:  [24.126 GiB/s 24.156 GiB/s 24.182 GiB/s]
cold/auto_avx2_64       time:   [32.122 ms 32.139 ms 32.158 ms]
                        thrpt:  [6.7260 GiB/s 6.7300 GiB/s 6.7335 GiB/s]
                 change:
                        time:   [+0.4201% +0.4922% +0.5622%] (p = 0.00 < 0.05)
                        thrpt:  [-0.5591% -0.4898% -0.4184%]
                        Change within noise threshold.
Benchmarking cold/auto_vanilla_64: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.3s, or reduce sample count to 90.
cold/auto_vanilla_64    time:   [53.057 ms 53.092 ms 53.132 ms]
                        thrpt:  [4.0708 GiB/s 4.0739 GiB/s 4.0766 GiB/s]
                 change:
                        time:   [-0.5970% -0.5149% -0.4217%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4235% +0.5176% +0.6006%]
                        Change within noise threshold.

hot/memcpy_baseline     time:   [74.306 µs 74.385 µs 74.477 µs]
                        thrpt:  [90.755 GiB/s 90.867 GiB/s 90.964 GiB/s]
                 change:
                        time:   [-0.1252% +0.1649% +0.5301%] (p = 0.39 > 0.05)
                        thrpt:  [-0.5273% -0.1646% +0.1254%]
                        No change in performance detected.
hot/libyuv              time:   [107.00 µs 107.04 µs 107.09 µs]
                        thrpt:  [63.116 GiB/s 63.145 GiB/s 63.170 GiB/s]
                 change:
                        time:   [+4.9819% +5.1083% +5.2261%] (p = 0.00 < 0.05)
                        thrpt:  [-4.9665% -4.8600% -4.7455%]
                        Performance has regressed.
hot/explicit_avx2_double
                        time:   [90.068 µs 90.113 µs 90.155 µs]
                        thrpt:  [74.973 GiB/s 75.008 GiB/s 75.045 GiB/s]
                 change:
                        time:   [+19.614% +20.304% +21.006%] (p = 0.00 < 0.05)
                        thrpt:  [-17.360% -16.877% -16.398%]
                        Performance has regressed.
hot/explicit_avx2_single
                        time:   [79.458 µs 79.556 µs 79.655 µs]
                        thrpt:  [84.856 GiB/s 84.961 GiB/s 85.066 GiB/s]
                 change:
                        time:   [+6.9429% +7.3397% +7.6897%] (p = 0.00 < 0.05)
                        thrpt:  [-7.1406% -6.8378% -6.4921%]
                        Performance has regressed.
hot/explicit_sse2       time:   [92.316 µs 92.406 µs 92.511 µs]
                        thrpt:  [73.063 GiB/s 73.146 GiB/s 73.218 GiB/s]
hot/auto_avx2_64        time:   [920.02 µs 920.20 µs 920.42 µs]
                        thrpt:  [7.3435 GiB/s 7.3453 GiB/s 7.3467 GiB/s]
                 change:
                        time:   [+0.7899% +0.8556% +0.9185%] (p = 0.00 < 0.05)
                        thrpt:  [-0.9102% -0.8483% -0.7837%]
                        Change within noise threshold.
Benchmarking hot/auto_vanilla_64: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50.
hot/auto_vanilla_64     time:   [1.6063 ms 1.6069 ms 1.6075 ms]
                        thrpt:  [4.2048 GiB/s 4.2064 GiB/s 4.2078 GiB/s]
                 change:
                        time:   [-0.8669% -0.8069% -0.7479%] (p = 0.00 < 0.05)
                        thrpt:  [+0.7536% +0.8134% +0.8745%]
                        Change within noise threshold.

This deinterleaves 8-bit pairs via the PACKUSWB instruction: either shifting right to get the high 8 bits of every 16, or ANDing to get the low 8 bits of every 16. I took that idea from compiling the following code: https://godbolt.org/z/enaMY7v4o ```rust use std::arch::x86_64; use std::simd::i8x16; pub unsafe fn process( top_uyvy_addr: *const u8, bot_uyvy_addr: *const u8, top_y_addr: *mut u8, bot_y_addr: *mut u8, u_addr: *mut u8, v_addr: *mut u8, ) { let [top_uv, bot_uv] = [ (top_uyvy_addr, top_y_addr), (bot_uyvy_addr, bot_y_addr), ].map(|(uyvy_addr, y_addr)| { let uyvy = std::ptr::read_unaligned(uyvy_addr as *const [i8x16; 4]); let (uv_hi, y_hi) = uyvy[0].deinterleave(uyvy[1]); let (uv_lo, y_lo) = uyvy[2].deinterleave(uyvy[3]); std::ptr::write_unaligned(y_addr as *mut i8x16, y_hi); std::ptr::write_unaligned(y_addr.add(16) as *mut i8x16, y_lo); [uv_hi, uv_lo] }); let uv = [ i8x16::from(x86_64::_mm_avg_epu8(top_uv[0].into(), bot_uv[0].into())), i8x16::from(x86_64::_mm_avg_epu8(top_uv[1].into(), bot_uv[1].into())), ]; let (u, v) = uv[0].deinterleave(uv[1]); std::ptr::write_unaligned(u_addr as *mut i8x16, u); std::ptr::write_unaligned(v_addr as *mut i8x16, v); } ``` Its performance is surprisingly good: 24 GB/s cold, 73 GB/s hot. Some noise in all these measurements. ``` cold/memcpy_baseline time: [6.1385 ms 6.1503 ms 6.1639 ms] thrpt: [35.090 GiB/s 35.168 GiB/s 35.236 GiB/s] change: time: [+0.8184% +1.1318% +1.4649%] (p = 0.00 < 0.05) thrpt: [-1.4438% -1.1191% -0.8117%] Change within noise threshold. cold/libyuv time: [8.9479 ms 8.9586 ms 8.9708 ms] thrpt: [24.111 GiB/s 24.144 GiB/s 24.173 GiB/s] change: time: [+1.8972% +2.0382% +2.2017%] (p = 0.00 < 0.05) thrpt: [-2.1543% -1.9975% -1.8619%] Performance has regressed. cold/explicit_avx2_double time: [8.5369 ms 8.5513 ms 8.5665 ms] thrpt: [25.249 GiB/s 25.294 GiB/s 25.336 GiB/s] change: time: [+12.556% +12.813% +13.076%] (p = 0.00 < 0.05) thrpt: [-11.564% -11.358% -11.155%] Performance has regressed. cold/explicit_avx2_single time: [8.0669 ms 8.0752 ms 8.0852 ms] thrpt: [26.752 GiB/s 26.785 GiB/s 26.812 GiB/s] change: time: [+1.6387% +1.7825% +1.9258%] (p = 0.00 < 0.05) thrpt: [-1.8894% -1.7513% -1.6123%] Performance has regressed. cold/explicit_sse2 time: [8.9443 ms 8.9541 ms 8.9652 ms] thrpt: [24.126 GiB/s 24.156 GiB/s 24.182 GiB/s] cold/auto_avx2_64 time: [32.122 ms 32.139 ms 32.158 ms] thrpt: [6.7260 GiB/s 6.7300 GiB/s 6.7335 GiB/s] change: time: [+0.4201% +0.4922% +0.5622%] (p = 0.00 < 0.05) thrpt: [-0.5591% -0.4898% -0.4184%] Change within noise threshold. Benchmarking cold/auto_vanilla_64: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.3s, or reduce sample count to 90. cold/auto_vanilla_64 time: [53.057 ms 53.092 ms 53.132 ms] thrpt: [4.0708 GiB/s 4.0739 GiB/s 4.0766 GiB/s] change: time: [-0.5970% -0.5149% -0.4217%] (p = 0.00 < 0.05) thrpt: [+0.4235% +0.5176% +0.6006%] Change within noise threshold. hot/memcpy_baseline time: [74.306 µs 74.385 µs 74.477 µs] thrpt: [90.755 GiB/s 90.867 GiB/s 90.964 GiB/s] change: time: [-0.1252% +0.1649% +0.5301%] (p = 0.39 > 0.05) thrpt: [-0.5273% -0.1646% +0.1254%] No change in performance detected. hot/libyuv time: [107.00 µs 107.04 µs 107.09 µs] thrpt: [63.116 GiB/s 63.145 GiB/s 63.170 GiB/s] change: time: [+4.9819% +5.1083% +5.2261%] (p = 0.00 < 0.05) thrpt: [-4.9665% -4.8600% -4.7455%] Performance has regressed. hot/explicit_avx2_double time: [90.068 µs 90.113 µs 90.155 µs] thrpt: [74.973 GiB/s 75.008 GiB/s 75.045 GiB/s] change: time: [+19.614% +20.304% +21.006%] (p = 0.00 < 0.05) thrpt: [-17.360% -16.877% -16.398%] Performance has regressed. hot/explicit_avx2_single time: [79.458 µs 79.556 µs 79.655 µs] thrpt: [84.856 GiB/s 84.961 GiB/s 85.066 GiB/s] change: time: [+6.9429% +7.3397% +7.6897%] (p = 0.00 < 0.05) thrpt: [-7.1406% -6.8378% -6.4921%] Performance has regressed. hot/explicit_sse2 time: [92.316 µs 92.406 µs 92.511 µs] thrpt: [73.063 GiB/s 73.146 GiB/s 73.218 GiB/s] hot/auto_avx2_64 time: [920.02 µs 920.20 µs 920.42 µs] thrpt: [7.3435 GiB/s 7.3453 GiB/s 7.3467 GiB/s] change: time: [+0.7899% +0.8556% +0.9185%] (p = 0.00 < 0.05) thrpt: [-0.9102% -0.8483% -0.7837%] Change within noise threshold. Benchmarking hot/auto_vanilla_64: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50. hot/auto_vanilla_64 time: [1.6063 ms 1.6069 ms 1.6075 ms] thrpt: [4.2048 GiB/s 4.2064 GiB/s 4.2078 GiB/s] change: time: [-0.8669% -0.8069% -0.7479%] (p = 0.00 < 0.05) thrpt: [+0.7536% +0.8134% +0.8745%] Change within noise threshold. ```

Xaeroxe

Looks good!

scottlamb requested a review from Xaeroxe September 19, 2024 20:46

Xaeroxe approved these changes Sep 19, 2024

View reviewed changes

scottlamb merged commit 42326fb into main Sep 19, 2024
3 checks passed

scottlamb deleted the sse2 branch September 19, 2024 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSE2 support #6

SSE2 support #6

scottlamb commented Sep 19, 2024

Xaeroxe left a comment

SSE2 support #6

SSE2 support #6

Conversation

scottlamb commented Sep 19, 2024

Xaeroxe left a comment

Choose a reason for hiding this comment