This project is the author's first use of explicit SIMD. Lessons learned:
- Pixel format conversions are a straightforward use of SIMD because huge blocks of pixels are just shuffled around and lightly processed independently—no variable-width encodings, branching, or masking required. This does not require the level of sophistication you may have heard of in such works as simdjson.
- We get diminishing marginal benefits as we move from "scalar instructions only" to "autovectorized SIMD" to "novice-quality explicit SIMD" to "expert quality explicit SIMD". We don't claim to have reached expert quality here, yet appear to be approximately memory bandwidth-limited (thus have optimal performance, if not energy efficiency) in cold CPU cache situations.
- x86_64 SIMD (including AVX2) is much more challenging to write than ARM SIMD (NEON). In particular, NEON makes it very straightforward to unzip parallel sequences, while it's a challenge in AVX2 to shuffle/permute things to the correct lanes. More on that below.
- AVX2 is more powerful and potentially performant than NEON. For example, it supports masked operations which could be used to avoid needing a scalar fallback path for the last few pixels in each row.
- Godbolt's Compiler Explorer can be useful. clang, g++, and Rust each have vector types and matching shuffle operations that are more general than e.g. AVX2 shuffle/permute intrinsics. In particular, one compiler shuffle call may produce a sequence of a few vendor shuffle/permute intrinsics/instructions with matching constants. It may not be possible to directly take advantage of this in the final code because these don't seem to combine well with vendor intrinsics, and because (in the case of Rust) they're not available on stable (tracking issue). But Compiler Explorer allows you to write a kernel that does the shuffle you want and look at the resulting assembly; it's then straightforward to transcribe this to intrinsics use. This can be a bit tedious, but as the total lines of code here are quite minimal and could represent the majority of a program's CPU time if written without SIMD, it's worthwhile.
A couple good references:
- The Intel intrinsics guide is the authoritative description of what x86_64 intrinsics/instructions are meant to do.
- uops.info has the best information about the latency, throughput, etc. of the instructions on specific processors, based on Intel/AMD vendor documentation and microbenchmarks.
One key challenge with AVX2 is that its 256-bit registers are mostly separate 128-bit lanes. This comes up in a variety of ways:
- many of the "swizzle" (shuffle, permute) operations have the phrase
"within 128-bit lanes" (e.g.
_mm256_shuffle_epi8
a.k.a.vpshufb
). It's cheap and easy to move data within these lanes; crossing 128-bit lanes seems to be a separate step (typically via_mm256_permute4x64_epi64
a.k.a.vpermq
, or the more flexible but more expensive_mm256_permutevar8x32_epi32
a.k.a.vpermd
). - some operations combine the top or bottom 128-bit lanes from two different
256-bit registers rather than processing a full single register (e.g.
_mm256_unpackhi_epi16
and_mm256_unpacklo_epi16
). You then may need to use a couplevpermq
operations to get them back in the order you expect. If you're persistent and lucky, you might find use them in a sequence that crosses them and then back again without the permutes.
There are many interesting discussions of this on the Internet, including the following:
- https://stackoverflow.com/questions/66664367/interleave-two-vectors
- https://stackoverflow.com/questions/52982211/best-way-to-shuffle-across-avx-lanes
- https://community.intel.com/t5/Intel-ISA-Extensions/Cross-lane-operations-how/td-p/819068
As mentioned above, Compiler Explorer can be a big help with this challenge.
Here is an example in C++ using clang's
__builtin_shufflevector
(live in compiler explorer,
clang docs):
#include <stdint.h>
typedef int8_t i8x32 __attribute__((__vector_size__(32)));
typedef int32_t i32x8 __attribute__((__vector_size__(32)));
typedef int64_t i64x4 __attribute__((__vector_size__(32)));
typedef int64_t i64x2 __attribute__((__vector_size__(16)));
void pre(i8x32 in, i8x32 *out) {
*out = __builtin_shufflevector(in, in,
// Lower 128-bit lane.
1, 3, 5, 7, 9, 11, 13, 15, // lower half: 8 Y components.
0, 4, 8, 12, 2, 6, 10, 14, // upper half: (4 * U), (4 * V).
// Upper 128-bit lane (same layout).
16+1, 16+3, 16+5, 16+7, 16+9, 16+11, 16+13, 16+15, // lower half: 8 Y components.
16+0, 16+4, 16+8, 16+12, 16+2, 16+6, 16+10, 16+14 // upper half: (4 * U), (4 * V).
);
}
void y(i64x4 data0, i64x4 data1, i64x4 *y) {
*y = __builtin_shufflevector(data0, data1, 0, 2, 4, 6);
}
void double_block(i32x8 uv0, i32x8 uv1, i32x8 *u, i32x8 *v) {
// (u0 u2 v0 v2) (u1 u3 v1 v3) (u4 u6 v4 v6) (u5 u7 v5 v7)
*u = __builtin_shufflevector(uv0, uv1, 0, 4, 1, 5, 8, 12, 9, 13);
*v = __builtin_shufflevector(uv0, uv1, 2, 6, 3, 7, 10, 14, 11, 15);
}
void single_block_split(i32x8 uv, i32x8 *uv_out) {
// (u0 u2 v0 v2) (u1 u3 v1 v3)
*uv_out = __builtin_shufflevector(uv, uv, 0, 4, 1, 5, 2, 6, 3, 7);
}
Compiler Explorer also supports similar operations with Rust's nightly compiler
and std::simd::simd_swizzle!
.
The NEON portions of this code were much easier than AVX2 to write but may not
be optimal. In particular, looking at them with Compiler Explorer shows extra
register moves that seem unnecessary and likely add some latency. Dropping to
inline assembly would allow us to optimize further, at the minor cost of
requiring duplication to support both aarch64
and (32-bit) arm
platforms
if the latter is desired.
One reason no great effort has been made for optimization here is that on macOS, the end goal may be to feed data to Video Toolbox. As that framework accepts a variety of pixel formats and appears to convert them efficiently. The author has not looked into whether this happens by SIMD code or by dedicated silicon, but the result is good nonetheless.