Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This deinterleaves 8-bit pairs via the PACKUSWB instruction: either shifting right to get the high 8 bits of every 16, or ANDing to get the low 8 bits of every 16. I took that idea from compiling the following code: https://godbolt.org/z/enaMY7v4o ```rust use std::arch::x86_64; use std::simd::i8x16; pub unsafe fn process( top_uyvy_addr: *const u8, bot_uyvy_addr: *const u8, top_y_addr: *mut u8, bot_y_addr: *mut u8, u_addr: *mut u8, v_addr: *mut u8, ) { let [top_uv, bot_uv] = [ (top_uyvy_addr, top_y_addr), (bot_uyvy_addr, bot_y_addr), ].map(|(uyvy_addr, y_addr)| { let uyvy = std::ptr::read_unaligned(uyvy_addr as *const [i8x16; 4]); let (uv_hi, y_hi) = uyvy[0].deinterleave(uyvy[1]); let (uv_lo, y_lo) = uyvy[2].deinterleave(uyvy[3]); std::ptr::write_unaligned(y_addr as *mut i8x16, y_hi); std::ptr::write_unaligned(y_addr.add(16) as *mut i8x16, y_lo); [uv_hi, uv_lo] }); let uv = [ i8x16::from(x86_64::_mm_avg_epu8(top_uv[0].into(), bot_uv[0].into())), i8x16::from(x86_64::_mm_avg_epu8(top_uv[1].into(), bot_uv[1].into())), ]; let (u, v) = uv[0].deinterleave(uv[1]); std::ptr::write_unaligned(u_addr as *mut i8x16, u); std::ptr::write_unaligned(v_addr as *mut i8x16, v); } ``` Its performance is surprisingly good: 24 GB/s cold, 73 GB/s hot. Some noise in all these measurements. ``` cold/memcpy_baseline time: [6.1385 ms 6.1503 ms 6.1639 ms] thrpt: [35.090 GiB/s 35.168 GiB/s 35.236 GiB/s] change: time: [+0.8184% +1.1318% +1.4649%] (p = 0.00 < 0.05) thrpt: [-1.4438% -1.1191% -0.8117%] Change within noise threshold. cold/libyuv time: [8.9479 ms 8.9586 ms 8.9708 ms] thrpt: [24.111 GiB/s 24.144 GiB/s 24.173 GiB/s] change: time: [+1.8972% +2.0382% +2.2017%] (p = 0.00 < 0.05) thrpt: [-2.1543% -1.9975% -1.8619%] Performance has regressed. cold/explicit_avx2_double time: [8.5369 ms 8.5513 ms 8.5665 ms] thrpt: [25.249 GiB/s 25.294 GiB/s 25.336 GiB/s] change: time: [+12.556% +12.813% +13.076%] (p = 0.00 < 0.05) thrpt: [-11.564% -11.358% -11.155%] Performance has regressed. cold/explicit_avx2_single time: [8.0669 ms 8.0752 ms 8.0852 ms] thrpt: [26.752 GiB/s 26.785 GiB/s 26.812 GiB/s] change: time: [+1.6387% +1.7825% +1.9258%] (p = 0.00 < 0.05) thrpt: [-1.8894% -1.7513% -1.6123%] Performance has regressed. cold/explicit_sse2 time: [8.9443 ms 8.9541 ms 8.9652 ms] thrpt: [24.126 GiB/s 24.156 GiB/s 24.182 GiB/s] cold/auto_avx2_64 time: [32.122 ms 32.139 ms 32.158 ms] thrpt: [6.7260 GiB/s 6.7300 GiB/s 6.7335 GiB/s] change: time: [+0.4201% +0.4922% +0.5622%] (p = 0.00 < 0.05) thrpt: [-0.5591% -0.4898% -0.4184%] Change within noise threshold. Benchmarking cold/auto_vanilla_64: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.3s, or reduce sample count to 90. cold/auto_vanilla_64 time: [53.057 ms 53.092 ms 53.132 ms] thrpt: [4.0708 GiB/s 4.0739 GiB/s 4.0766 GiB/s] change: time: [-0.5970% -0.5149% -0.4217%] (p = 0.00 < 0.05) thrpt: [+0.4235% +0.5176% +0.6006%] Change within noise threshold. hot/memcpy_baseline time: [74.306 µs 74.385 µs 74.477 µs] thrpt: [90.755 GiB/s 90.867 GiB/s 90.964 GiB/s] change: time: [-0.1252% +0.1649% +0.5301%] (p = 0.39 > 0.05) thrpt: [-0.5273% -0.1646% +0.1254%] No change in performance detected. hot/libyuv time: [107.00 µs 107.04 µs 107.09 µs] thrpt: [63.116 GiB/s 63.145 GiB/s 63.170 GiB/s] change: time: [+4.9819% +5.1083% +5.2261%] (p = 0.00 < 0.05) thrpt: [-4.9665% -4.8600% -4.7455%] Performance has regressed. hot/explicit_avx2_double time: [90.068 µs 90.113 µs 90.155 µs] thrpt: [74.973 GiB/s 75.008 GiB/s 75.045 GiB/s] change: time: [+19.614% +20.304% +21.006%] (p = 0.00 < 0.05) thrpt: [-17.360% -16.877% -16.398%] Performance has regressed. hot/explicit_avx2_single time: [79.458 µs 79.556 µs 79.655 µs] thrpt: [84.856 GiB/s 84.961 GiB/s 85.066 GiB/s] change: time: [+6.9429% +7.3397% +7.6897%] (p = 0.00 < 0.05) thrpt: [-7.1406% -6.8378% -6.4921%] Performance has regressed. hot/explicit_sse2 time: [92.316 µs 92.406 µs 92.511 µs] thrpt: [73.063 GiB/s 73.146 GiB/s 73.218 GiB/s] hot/auto_avx2_64 time: [920.02 µs 920.20 µs 920.42 µs] thrpt: [7.3435 GiB/s 7.3453 GiB/s 7.3467 GiB/s] change: time: [+0.7899% +0.8556% +0.9185%] (p = 0.00 < 0.05) thrpt: [-0.9102% -0.8483% -0.7837%] Change within noise threshold. Benchmarking hot/auto_vanilla_64: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50. hot/auto_vanilla_64 time: [1.6063 ms 1.6069 ms 1.6075 ms] thrpt: [4.2048 GiB/s 4.2064 GiB/s 4.2078 GiB/s] change: time: [-0.8669% -0.8069% -0.7479%] (p = 0.00 < 0.05) thrpt: [+0.7536% +0.8134% +0.8745%] Change within noise threshold. ```
- Loading branch information