Intel 8360Y AVX-512 #12470
Replies: 2 comments 2 replies
-
+1 for this potential feature request. The 3rd gen Xeon-SP CPU Ice Lake has Crypto-NI AVX set for crypto acceleration using same software framework as QAT (minus driver stack). As it's from the Sunny Cove microarchitecture (VAES,GFNI,IFMA,VPCLMULQDQ), desktop CPUs like 11th gen Tiger Lake also have this feature. Intel just released Ice Lake D series for edge which is also available for NAS/OpenZFS. According to the official doc or my analysis, per core crypto (AES-256-GCM) throughput can be accelerated ~2.x times and ~3.x times compared to 2nd gen Xeon-SP. To my best knowledge, there's no compression acceleration in Ice Lake CPU. And Ice Lake can still support QAT offload. Hash/dedup (SHA256) can also be accelerated by SHA extension (SHA-NI). Previous Xeon-SP CPU doesn't support SHA-NI. SHA256 throughput can be accelerated by 4~5 times, on par with other CPUs (AMD, ARM). |
Beta Was this translation helpful? Give feedback.
-
As you've remarked, there's work underway to support existing implementations of various algorithms that can take advantage of AVX extensions, though last I looked the number of algorithms that have premade AVX512 support written by people already is limited, since the extensions have only shipped in a very limited number of chips, and it can be difficult to reformulate things in a way that can take advantage of them if they don't neatly fit. As far as newer QAT support, I've not seen anyone working on it, but I am not an all-knowing oracle; if you've got a working implementation, clean it up if need be to pass the various lint checks and open a PR; I'm sure people with QAT available would enjoy being able to use the hardware, even if it may have complications sometimes. If you'd like the checksum acceleration PRs to go faster, you could try contributing to them - the BLAKE3 PR (#12918) seems to be trying to make it easier to have generic pluggable hashing with a consistent interface (and has, I believe, a branch separate from the PR with SHA2 implementations integrated, though IDK how updated it is), and there's also another PR (#12549) proposing integrating a number of SHA256/SHA512 vectorized implementations. (You could also make your own, but I'd discourage that if there are already two implementations to decide between.) If you'd like to integrate something to leverage VAES et al, great, go for it. I've not played with that, because I don't often use OpenZFS native encryption for my own data, and I don't have any VAES-capable hardware, but it should be in a reasonable state to plug in. (It'd be nice if the crypto stuff microbenchmarked the different implementations rather than having a hardcoded "fastest" enumeration, but such is life.) As far as other platform-specific optimizations - fletcher4 already provides vectorized implementations to go from stupidly fast to preposterously so on various platforms (though I should contribute the ones I modified that improve it a bit, the difference between, say, 35 and 38 GB/s isn't really that significant in most use cases), similarly RAIDZ parity to some extent. It turns out getting LZ4 to vectorize well is difficult enough that the LZ4 devs haven't managed it yet (Intel apparently has an implementation baked into their proprietary libraries, but that's not especially helpful), newer ZSTD has some limited vectorization hooks that we don't (I believe) currently turn on, because we'd need to pay a bunch of overhead for saving/restoring FPU state, and when I last benchmarked it (which was I think on zstd 1.5.0), it wasn't really a win a lot of the time. I've not looked at whether anyone's tried vectorizing gzip. (I suppose one could convince the ZLE pass ZFS does to use BMI instructions when available if nobody's done that already.) If you can find better opportunities for optimization that I haven't mentioned, great! But that's my quick summary of where I'd look for platform-specific gains and the state of them. |
Beta Was this translation helpful? Give feedback.
-
Hi All,
I have built a test bed on the new Intel 8360Y which has QAT/AVX capability built in through the AVX512 instruction set, I can see there is work underway to exploit AVX. I previously built a DKMS hack to get zfs to use an 8950 QAT PCIE module. Is there any support for the 8360Y or can I recompile zfs to take advantage of the capabilities?
I have successfully built QAT-Engine with openssl and ipsec and getting some very very good results.
Any pointers on taking advantage of or any API| for doing PR to get zfs exploiting the new 3rd Gen chips.
many thanks for any assistance, pointers or help greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions