Intel 8360Y AVX-512 #12470

jimthedj65 · 2021-08-10T20:35:35Z

jimthedj65
Aug 10, 2021

Hi All,

I have built a test bed on the new Intel 8360Y which has QAT/AVX capability built in through the AVX512 instruction set, I can see there is work underway to exploit AVX. I previously built a DKMS hack to get zfs to use an 8950 QAT PCIE module. Is there any support for the 8360Y or can I recompile zfs to take advantage of the capabilities?

I have successfully built QAT-Engine with openssl and ipsec and getting some very very good results.

Any pointers on taking advantage of or any API| for doing PR to get zfs exploiting the new 3rd Gen chips.

many thanks for any assistance, pointers or help greatly appreciated.

xutao323 · 2022-03-01T07:29:31Z

xutao323
Mar 1, 2022

+1 for this potential feature request.

The 3rd gen Xeon-SP CPU Ice Lake has Crypto-NI AVX set for crypto acceleration using same software framework as QAT (minus driver stack). As it's from the Sunny Cove microarchitecture (VAES,GFNI,IFMA,VPCLMULQDQ), desktop CPUs like 11th gen Tiger Lake also have this feature. Intel just released Ice Lake D series for edge which is also available for NAS/OpenZFS. According to the official doc or my analysis, per core crypto (AES-256-GCM) throughput can be accelerated ~2.x times and ~3.x times compared to 2nd gen Xeon-SP.

To my best knowledge, there's no compression acceleration in Ice Lake CPU. And Ice Lake can still support QAT offload.

Hash/dedup (SHA256) can also be accelerated by SHA extension (SHA-NI). Previous Xeon-SP CPU doesn't support SHA-NI. SHA256 throughput can be accelerated by 4~5 times, on par with other CPUs (AMD, ARM).

1 reply

jimthedj65 Jul 16, 2022
Author

Thank you for the remarks and yes edge with 11th gen is an interesting use case.

rincebrain · 2022-03-01T08:05:44Z

rincebrain
Mar 1, 2022
Collaborator

As you've remarked, there's work underway to support existing implementations of various algorithms that can take advantage of AVX extensions, though last I looked the number of algorithms that have premade AVX512 support written by people already is limited, since the extensions have only shipped in a very limited number of chips, and it can be difficult to reformulate things in a way that can take advantage of them if they don't neatly fit.

As far as newer QAT support, I've not seen anyone working on it, but I am not an all-knowing oracle; if you've got a working implementation, clean it up if need be to pass the various lint checks and open a PR; I'm sure people with QAT available would enjoy being able to use the hardware, even if it may have complications sometimes.

If you'd like the checksum acceleration PRs to go faster, you could try contributing to them - the BLAKE3 PR (#12918) seems to be trying to make it easier to have generic pluggable hashing with a consistent interface (and has, I believe, a branch separate from the PR with SHA2 implementations integrated, though IDK how updated it is), and there's also another PR (#12549) proposing integrating a number of SHA256/SHA512 vectorized implementations. (You could also make your own, but I'd discourage that if there are already two implementations to decide between.)

If you'd like to integrate something to leverage VAES et al, great, go for it. I've not played with that, because I don't often use OpenZFS native encryption for my own data, and I don't have any VAES-capable hardware, but it should be in a reasonable state to plug in. (It'd be nice if the crypto stuff microbenchmarked the different implementations rather than having a hardcoded "fastest" enumeration, but such is life.)

As far as other platform-specific optimizations - fletcher4 already provides vectorized implementations to go from stupidly fast to preposterously so on various platforms (though I should contribute the ones I modified that improve it a bit, the difference between, say, 35 and 38 GB/s isn't really that significant in most use cases), similarly RAIDZ parity to some extent. It turns out getting LZ4 to vectorize well is difficult enough that the LZ4 devs haven't managed it yet (Intel apparently has an implementation baked into their proprietary libraries, but that's not especially helpful), newer ZSTD has some limited vectorization hooks that we don't (I believe) currently turn on, because we'd need to pay a bunch of overhead for saving/restoring FPU state, and when I last benchmarked it (which was I think on zstd 1.5.0), it wasn't really a win a lot of the time. I've not looked at whether anyone's tried vectorizing gzip. (I suppose one could convince the ZLE pass ZFS does to use BMI instructions when available if nobody's done that already.)

If you can find better opportunities for optimization that I haven't mentioned, great! But that's my quick summary of where I'd look for platform-specific gains and the state of them.

1 reply

jimthedj65 Jul 16, 2022
Author

Thanks this is really helpful and has given me food for thought, so glad I checked in to see these helpful answers. I will have the hardware soon and will explore my 2nd Gen QAT offload with discrete to 3rd Gen and see where we can take advanatge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intel 8360Y AVX-512 #12470

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Intel 8360Y AVX-512 #12470

jimthedj65 Aug 10, 2021

Replies: 2 comments · 2 replies

xutao323 Mar 1, 2022

jimthedj65 Jul 16, 2022 Author

rincebrain Mar 1, 2022 Collaborator

jimthedj65 Jul 16, 2022 Author

jimthedj65
Aug 10, 2021

Replies: 2 comments 2 replies

xutao323
Mar 1, 2022

jimthedj65 Jul 16, 2022
Author

rincebrain
Mar 1, 2022
Collaborator

jimthedj65 Jul 16, 2022
Author