Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mbedTLS AES code, intrinsics vs. assembly, alignment #5593

Open
magnumripper opened this issue Nov 30, 2024 · 39 comments
Open

mbedTLS AES code, intrinsics vs. assembly, alignment #5593

magnumripper opened this issue Nov 30, 2024 · 39 comments

Comments

@magnumripper
Copy link
Member

I believe it builds asm for now. Brief tests with intrinsics show a slight performance drop. From PR comments:

@solardiz said:

I notice there are pieces of inline asm code in mbedTLS, which use non-VEX SSE instructions. Hopefully this works OK, but there's risk of it being slow (or of VEX-encoded code being slow afterwards) without vzeroupper on transitions (which would also be slow, just without the risk of being an order of magnitude slower). I don't suggest changing this yet, just writing down this note.

@magnumripper said:

 * \note AESNI is only supported with certain compilers and target options:
 * - Visual Studio: supported
 * - GCC, x86-64, target not explicitly supporting AESNI:
 *   requires MBEDTLS_HAVE_ASM.
 * - GCC, x86-32, target not explicitly supporting AESNI:
 *   not supported.
 * - GCC, x86-64 or x86-32, target supporting AESNI: supported.
 *   For this assembly-less implementation, you must currently compile
 *   `library/aesni.c` and `library/aes.c` with machine options to enable
 *   SSE2 and AESNI instructions: `gcc -msse2 -maes -mpclmul` or
 *   `clang -maes -mpclmul`.
 * - Non-x86 targets: this option is silently ignored.
 * - Other compilers: this option is silently ignored.
 *
 * \note
 * Above, "GCC" includes compatible compilers such as Clang.
 * The limitations on target support are likely to be relaxed in the future.

Perhaps we do need some tweak to ensure intrinsics and not asm, but I did just now manually build with -mavx2 -maes -mpclmul per above, and that resulted in a 62% larger aes.a and definitely worse performance (not a lot, but worse).

Disregarding the performance drop, we do have @CC_CPU@ from configure.ac to put in Makefile.in (will add -mavx2 for my laptop) but the -maes -mpclmul would need to be added too. I assume those two can be added even for machines not supporting it (bc cpuid checking) but it would need testing, and obviously can't be blindly added - the machine could be a Sparc and/or the compiler could be one that hasn't got a clue what those options are.

@solardiz
Copy link
Member

Upstream is also considering dropping the assembly in favor of intrinsics: Mbed-TLS/mbedtls#8231

@solardiz
Copy link
Member

solardiz commented Dec 1, 2024

Experiment 1:

I tried adding -maes -mpclmul to CFLAGS inside the mbedtls directory. The aesni.o text section became about twice larger (from 1445 to 3168 bytes). Speeds have changed: o5logon became 5% slower, but keepass 25% faster with intrinsics.

This is without usage of VEX yet, even though most of the rest of john is built with AVX512BW. Apparently, we're not propagating the main SIMD flags from the main CFLAGS to this sub-make - maybe a bug on its own.

Experiment 2:

On top of the above, added also -mavx512bw to CFLAGS inside the mbedtls directory. The aesni.o text section's size is now in between asm and the above, at 2709 bytes, and there are indeed VEX-encoded instructions in there (including vaes*). o5logon speed is also in between asm and the experiment above (so it's partially recovered), keepass is as good as the above.

@solardiz
Copy link
Member

solardiz commented Dec 1, 2024

Apparently, we're not propagating the main SIMD flags from the main CFLAGS to this sub-make - maybe a bug on its own.

Looks the same for other subdirectories we have with third-party crypto code that we build into .a files.
Separately but related, we also have non-VEX inline asm SIMD code in ed25519-donna/ed25519-donna-64bit-x86.h.

@solardiz solardiz added this to the Potentially 2.0.0 milestone Dec 1, 2024
@magnumripper magnumripper self-assigned this Dec 1, 2024
magnumripper added a commit to magnumripper/john that referenced this issue Dec 2, 2024
This is for x86 AES-NI, intrinsics version.

Closes openwall#5593
@magnumripper
Copy link
Member Author

magnumripper commented Dec 2, 2024

For legacy makes, perhaps we could just add -maes -mpclmul for AVX and beyond? Or just ignore it and rely on the asm. The -native targets will get it automagically of course.

@solardiz
Copy link
Member

solardiz commented Dec 2, 2024

For legacy makes, perhaps we could just add -maes -mpclmul for AVX and beyond?

I've just checked - Intel Sandy Bridge - the first microarch to introduce AVX - already had both of these as well. So I was tempted to say yes.

But then it gets interesting - it was also a time of paranoia about possible backdoors in AES-NI, which is probably why Intel added a way to disable AES-NI from the firmware (including in a way that this can't be re-enabled later on a live system), and many systems shipped with it default-disabled (and default-locked)! I still have a server like this where I cannot easily re-enable AES-NI remotely. I wonder what this means not only for our Makefile.legacy, but also for our regular autoconf'ed builds. I may check.

@magnumripper
Copy link
Member Author

Provided the mbedTLS code checks cpuid sufficiently, it's more a matter of the compiler supporting these flags. A compiler supporting AVX should also support -maes.

magnumripper added a commit to magnumripper/john that referenced this issue Dec 2, 2024
It would already be supported by the -native targets.

See openwall#5593
magnumripper added a commit to magnumripper/john that referenced this issue Dec 2, 2024
This is for x86 AES-NI, intrinsics version.

Closes openwall#5593
magnumripper added a commit to magnumripper/john that referenced this issue Dec 2, 2024
It would already be supported by the -native targets.

See openwall#5593
magnumripper added a commit to magnumripper/john that referenced this issue Dec 2, 2024
This is for x86 AES-NI, intrinsics version.

Closes openwall#5593
magnumripper added a commit to magnumripper/john that referenced this issue Dec 2, 2024
It would already be supported by the -native targets.

See openwall#5593
claudioandre-br pushed a commit to claudioandre-br/JohnTheRipper that referenced this issue Dec 2, 2024
This is for x86 AES-NI, intrinsics version.

Closes openwall#5593
claudioandre-br pushed a commit to claudioandre-br/JohnTheRipper that referenced this issue Dec 2, 2024
It would already be supported by the -native targets.

See openwall#5593
magnumripper added a commit to magnumripper/john that referenced this issue Dec 3, 2024
This is for x86 AES-NI, intrinsics version.

Closes openwall#5593
magnumripper added a commit to magnumripper/john that referenced this issue Dec 3, 2024
It would already be supported by the -native targets.

See openwall#5593
magnumripper added a commit that referenced this issue Dec 3, 2024
It would already be supported by the -native targets.

See #5593
@solardiz
Copy link
Member

solardiz commented Dec 4, 2024

keepass 25% faster with intrinsics.

After merging these changes and doing a clean build, I confirmed that this same speedup for keepass appeared - yes, it did. (Also the expected slight slowdown for o5logon.)

However, here's what I did not expect:

On top of all the merged changes, I went into the mbedtls directory and removed all 3 -m* flags from CFLAGS, and built just this part (like with make clean; make; cd ..; make). And guess what - keepass became faster yet, by another 14% on top of the 25% previously seen from intrinsics. Yet that's with asm code again, so I'd expect it to have become slower like it was with the asm code before. I still have an older john binary using the asm code, and it still runs slower, as expected. So this speedup with re-enabled asm is really puzzling.

It could be that these performance differences we're seeing are not so much asm vs. intrinsics, but e.g. whether something just happens to be well-aligned in a given build/run or not.

@solardiz
Copy link
Member

solardiz commented Dec 4, 2024

Tried this:

+++ b/src/keepass_fmt_plug.c
@@ -176,7 +176,7 @@ static void set_salt(void *salt)
 static int transform_key(char *masterkey, unsigned char *final_key)
 {
        SHA256_CTX ctx;
-       unsigned char hash[32];
+       unsigned char hash[32] JTR_ALIGN(16);
        int ret = 0;
 
        // First, hash the masterkey

The speeds look unchanged - still +25% with intrinsics, +25%+14% with re-enabled asm. So I guess the asm is actually faster, and I guess that previously the alignment just happened to be wrong.

@solardiz
Copy link
Member

solardiz commented Dec 4, 2024

I guess that previously the alignment just happened to be wrong.

I've just tried deliberately misaligning hash (by making it a macro). Got only slight slowdown. So that's not it. Maybe alignment elsewhere?

Anyway, we should probably ensure hash is aligned.

@magnumripper
Copy link
Member Author

I didn't run valgrind for a while, perhaps time for that. OTOH I get consistenly get 1-7% better speed with intrinsics on my intel macbook. BTW could that be a data point?. What CPU(s) are you testing? Also, did you try different compilers?

@solardiz
Copy link
Member

solardiz commented Dec 4, 2024

I didn't run valgrind for a while, perhaps time for that.

I'd use perf if I were not running these tests in a VM. Edit: I mean, I can and did use perf even in this VM, but the hardware performance counters are inaccessible, so I only get to see what pieces of code the CPU runs most.

What CPU(s) are you testing? Also, did you try different compilers?

The above is with Tiger Lake, gcc 11, and in a VM. I didn't try anything else yet. I wasn't planning on benchmarking this much. Of course, need a different setup for proper benchmarking, but I was surprised by the relative speed differences here.

@magnumripper
Copy link
Member Author

magnumripper commented Dec 4, 2024

Here are my test results on a i9-13900KF. Speed is for KeePass -cost=50000 using 4 threads, no HT. I ran the lot several times and there were no significant variations.

Original speed with no AES-NI:

Benchmarking: KeePass [AES/Argon2 256/256 AVX2]... (4xOMP) DONE
Speed for cost 1 (t (rounds)) of 50000, cost 2 (m) of 0, cost 3 (p) of 0, cost 4 (KDF [0=Argon2d 2=Argon2id 3=AES]) of 3
Raw:    494 c/s real, 124 c/s virtual

MbedTLS using asm, 10.3x from the above, and baseline for all below

Benchmarking: KeePass [AES/Argon2 256/256 AVX2]... (4xOMP) DONE
Speed for cost 1 (t (rounds)) of 50000, cost 2 (m) of 0, cost 3 (p) of 0, cost 4 (KDF [0=Argon2d 2=Argon2id 3=AES]) of 3
Raw:    5122 c/s real, 1308 c/s virtual

MbedTLS as a lib (trivial patch never committed), +22%

Benchmarking: KeePass [AES/Argon2 256/256 AVX2]... (4xOMP) DONE
Speed for cost 1 (t (rounds)) of 50000, cost 2 (m) of 0, cost 3 (p) of 0, cost 4 (KDF [0=Argon2d 2=Argon2id 3=AES]) of 3
Raw:    6272 c/s real, 1635 c/s virtual

Current bleeding-jumbo (b1d063a), +5% for a total of +29%

Benchmarking: KeePass [AES/Argon2 256/256 AVX2]... (4xOMP) DONE
Speed for cost 1 (t (rounds)) of 50000, cost 2 (m) of 0, cost 3 (p) of 0, cost 4 (KDF [0=Argon2d 2=Argon2id 3=AES]) of 3
Raw:    6606 c/s real, 1676 c/s virtual

MbedTLS intrinsics + @solardiz unroll + alignment -2% (more like ±0 on repeated runs)

Benchmarking: KeePass [AES/Argon2 256/256 AVX2]... (4xOMP) DONE
Speed for cost 1 (t (rounds)) of 50000, cost 2 (m) of 0, cost 3 (p) of 0, cost 4 (KDF [0=Argon2d 2=Argon2id 3=AES]) of 3
Raw:    6456 c/s real, 1638 c/s virtual

MbedTLS intrinsics + @solardiz unroll ±0%

Benchmarking: KeePass [AES/Argon2 256/256 AVX2]... (4xOMP) DONE
Speed for cost 1 (t (rounds)) of 50000, cost 2 (m) of 0, cost 3 (p) of 0, cost 4 (KDF [0=Argon2d 2=Argon2id 3=AES]) of 3
Raw:    6482 c/s real, 1638 c/s virtual

It's puzzling you get better speeds with asm, while I get worse. But here's another random data point:

Using OpenSSL EVP (trivial patch never committed) This is +29% over current bleeding-jumbo 😢 and +67% over baseline

Benchmarking: KeePass [AES/Argon2 256/256 AVX2]... (4xOMP) DONE
Speed for cost 1 (t (rounds)) of 50000, cost 2 (m) of 0, cost 3 (p) of 0, cost 4 (KDF [0=Argon2d 2=Argon2id 3=AES]) of 3
Raw:    8540 c/s real, 2173 c/s virtual 

The EVP test was with as little code I could get away with (mostly init/update/final). Back when we dropped all/most of EVP from Jumbo, EVP needed lots of silly calls to stuff for enabling AES-NI at all, and also was not thread safe unless you added callbacks. I'm not sure if callbacks are still needed but AES-NI apparently just works now. I haven't looked at the OpenSSL code yet to see why it's so much faster.

@solardiz
Copy link
Member

solardiz commented Dec 4, 2024

It's puzzling why you get better speeds with asm, while I get worse.

The thing is I was also getting worse speeds with asm (and still am using old binary I saved) before the switch to intrinsics. I get better speeds with asm when I remove just the -m* flags from mbedtls/Makefile after the switch to intrinsics (revert to asm). It doesn't appear you have tried that, have you?

@magnumripper
Copy link
Member Author

Ah, my "MbedTLS using asm" baseline is actually f88688e meaning it did get -mavx2 but not the others. Good call, I'll test using the commit before that tomorrow.

@solardiz
Copy link
Member

solardiz commented Dec 5, 2024

my "MbedTLS using asm" baseline is actually f88688e meaning it did get -mavx2 but not the others. Good call, I'll test using the commit before that

That's not what I meant. Sure please do create a better baseline, but also please literally try the revert-to-asm approach where I observed the puzzling speedup - that is, starting from the latest commit, edit the generated mbedtls/Makefile to remove the -m* flags. I don't know why, but for me this resulted in very different performance than we had with asm before.

@magnumripper
Copy link
Member Author

magnumripper commented Dec 5, 2024

But that must be exactly the same as "the commit before that" (which translates to 40752fe). It is the git history's version of mbedtls with no -m but including your subl/decl and clobber patches.

Verifying... yep, while aes.a does differ in md5sum (should be some timestamp, size is same) mbedtls/*.o all end up with same md5sums, like they should.

@magnumripper
Copy link
Member Author

b40a1a2 could be the baseline though. The only difference from 40752fe is it also excludes your patches.

@solardiz
Copy link
Member

While I still don't know why for me re-enabling of asm is a lot faster than it was before being disabled, I now managed to achieve even better speeds with intrinsics by disabling MAY_NEED_TO_ALIGN, which was only enabled for the case of intrinsics and was apparently responsible for the worse speed I was getting with intrinsics vs. re-enabled asm. I think this setting should be unneeded with our usage if this comment in aes.c is true and if I understand it correctly:

/* VIA Padlock and our intrinsics-based implementation of AESNI require
 * the round keys to be aligned on a 16-byte boundary. We take care of this
 * before creating them, but the AES context may have moved (this can happen
 * if the library is called from a language with managed memory), and in later
 * calls it might have a different alignment with respect to 16-byte memory.
 * So we may need to realign.

This says that Mbed TLS takes care of the alignment when creating the round keys (I guess placing them accordingly within the larger context?) and I hope we're not copying them to differently-aligned context structs (if we are, we'll need to fix that).

BTW, I found --format=@aes useful for quicker testing of just the maybe-affected formats.

solardiz added a commit to solardiz/john that referenced this issue Dec 11, 2024
Mbed TLS takes care of the alignment when creating the round keys (placing
them accordingly within the larger context struct) and we hope we're not
copying them to differently-aligned context structs (if we are, we'll need
to fix that).

See openwall#5593
@solardiz
Copy link
Member

FWIW, completely disabling MAY_NEED_TO_ALIGN actually caused some crashes in our CI bots on my fork of the repo - turns out this also disables the taking care of the alignment when creating the round keys. So I am now trying to disable this setting later in the file, near the comment quoted above, so that the change would be only for the encryption function. This retains the speedup.

@solardiz
Copy link
Member

Pushed the above change to this repo now (passed testing in my fork).

I also have the below, not pushed since no measurable speedup (but it probably does speed up the key setup a tiny bit by skipping a call to a function provided by aesni.c):

+++ b/src/mbedtls/aes.c
@@ -535,7 +535,12 @@ void mbedtls_aes_xts_free(mbedtls_aes_xts_context *ctx)
 MBEDTLS_MAYBE_UNUSED static unsigned mbedtls_aes_rk_offset(uint32_t *buf)
 {
 #if defined(MAY_NEED_TO_ALIGN)
-    int align_16_bytes = 0;
+    static int align_16_bytes = 0;
+
+    if (align_16_bytes > 0)
+        goto align_16_bytes;
+    if (align_16_bytes < 0)
+        return 0;
 
 #if defined(MBEDTLS_VIA_PADLOCK_HAVE_CODE)
     if (aes_padlock_ace == -1) {
@@ -553,6 +558,7 @@ MBEDTLS_MAYBE_UNUSED static unsigned mbedtls_aes_rk_offset(uint32_t *buf)
 #endif
 
     if (align_16_bytes) {
+align_16_bytes:
         /* These implementations needs 16-byte alignment
          * for the round key array. */
         unsigned delta = ((uintptr_t) buf & 0x0000000fU) / 4;
@@ -561,6 +567,8 @@ MBEDTLS_MAYBE_UNUSED static unsigned mbedtls_aes_rk_offset(uint32_t *buf)
         } else {
             return 4 - delta; // 16 bytes = 4 uint32_t
         }
+    } else {
+        align_16_bytes = -1;
     }
 #else /* MAY_NEED_TO_ALIGN */
     (void) buf;

@magnumripper
Copy link
Member Author

Good stuff. I remember seeing that realigning comment and made a mental note to examine it later, then immediately forgot about it 🙄

BTW, I found --format=@aes useful for quicker testing of just the maybe-affected formats.

Yes, I'm currently revising the OpenCL AES code (using --format=@aes,+opencl for regression testing) and I assumed many formats would lack "AES" in their algorithm name, but to my surprise that is not the case at all.

@solardiz
Copy link
Member

not pushed since no measurable speedup (but it probably does speed up the key setup a tiny bit

@magnumripper Maybe you can benchmark this? My "no measurable speedup" may be simply because my system isn't idle.

@magnumripper
Copy link
Member Author

Will do

@magnumripper magnumripper reopened this Dec 11, 2024
@magnumripper magnumripper changed the title mbedTLS AES code, intrinsics vs. assembly mbedTLS AES code, intrinsics vs. assembly, alignment Dec 11, 2024
@solardiz
Copy link
Member

Tried benchmarking the align_16_bytes caching on our "well", which is otherwise idle, using o5logon. The results are inconclusive. It'd take lots of benchmarks and perhaps for different builds to have conclusive results.

@solardiz
Copy link
Member

not pushed since no measurable speedup (but it probably does speed up the key setup a tiny bit

And here's maybe why not: for me, mbedtls_aes_rk_offset was getting inlined before, but with these changes it's not anymore.

@solardiz
Copy link
Member

Indeed, with inline added this provides some speedup. So I'll probably push now.

@solardiz
Copy link
Member

Loop unrolling also helps. Now doing it in the same way as aesce.c does it.

@solardiz
Copy link
Member

solardiz commented Dec 11, 2024

Pushed the unrolling. Another way to do it could be:

+++ b/src/mbedtls/aesni.c
@@ -96,16 +96,6 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
 
 #if !defined(MBEDTLS_BLOCK_CIPHER_NO_DECRYPT)
     if (mode == MBEDTLS_AES_DECRYPT) {
-        if (nr == 10)
-            goto rounds_10_dec;
-        if (nr == 12)
-            goto rounds_12_dec;
-        state = _mm_aesdec_si128(state, *++rk);
-        state = _mm_aesdec_si128(state, *++rk);
-rounds_12_dec:
-        state = _mm_aesdec_si128(state, *++rk);
-        state = _mm_aesdec_si128(state, *++rk);
-rounds_10_dec:
         state = _mm_aesdec_si128(state, *++rk);
         state = _mm_aesdec_si128(state, *++rk);
         state = _mm_aesdec_si128(state, *++rk);
@@ -115,22 +105,20 @@ rounds_10_dec:
         state = _mm_aesdec_si128(state, *++rk);
         state = _mm_aesdec_si128(state, *++rk);
         state = _mm_aesdec_si128(state, *++rk);
+        if (__builtin_expect(nr > 10, 1)) {
+            state = _mm_aesdec_si128(state, *++rk);
+            state = _mm_aesdec_si128(state, *++rk);
+            if (__builtin_expect(nr > 12, 1)) {
+                state = _mm_aesdec_si128(state, *++rk);
+                state = _mm_aesdec_si128(state, *++rk);
+            }
+        }
         state = _mm_aesdeclast_si128(state, *++rk);
     } else
 #else
     (void) mode;
 #endif
     {
-        if (nr == 10)
-            goto rounds_10_enc;
-        if (nr == 12)
-            goto rounds_12_enc;
-        state = _mm_aesenc_si128(state, *++rk);
-        state = _mm_aesenc_si128(state, *++rk);
-rounds_12_enc:
-        state = _mm_aesenc_si128(state, *++rk);
-        state = _mm_aesenc_si128(state, *++rk);
-rounds_10_enc:
         state = _mm_aesenc_si128(state, *++rk);
         state = _mm_aesenc_si128(state, *++rk);
         state = _mm_aesenc_si128(state, *++rk);
@@ -140,6 +128,14 @@ rounds_10_enc:
         state = _mm_aesenc_si128(state, *++rk);
         state = _mm_aesenc_si128(state, *++rk);
         state = _mm_aesenc_si128(state, *++rk);
+        if (__builtin_expect(nr > 10, 1)) {
+            state = _mm_aesenc_si128(state, *++rk);
+            state = _mm_aesenc_si128(state, *++rk);
+            if (__builtin_expect(nr > 12, 1)) {
+                state = _mm_aesenc_si128(state, *++rk);
+                state = _mm_aesenc_si128(state, *++rk);
+            }
+        }
         state = _mm_aesenclast_si128(state, *++rk);
     }
 

For me, this other way results in pre-loading of round keys into registers, which is probably unneeded, and it hurts a little bit.

@magnumripper magnumripper removed their assignment Dec 11, 2024
@solardiz
Copy link
Member

I observed that on older gcc (such as gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)), the memcpy used by the intrinsics code is not "free" (is not compiled to a simple [v]movdqu instruction, which it is on newer gcc). Apparently, this causes much slowdown (up to 2x on keepass vs. asm code on that same old system).

We could want to come up with and switch to using an API that accepts a guaranteed-aligned buffer, but then we'd deviate from upstream Mbed-TLS much more.

@magnumripper
Copy link
Member Author

the memcpy used by the intrinsics code is not "free" (is not compiled to a simple [v]movdqu instruction, which it is on newer gcc)

Could we add an intrinsic for [v]movdqu (if applicable, at compile time) with no overhead?

@solardiz
Copy link
Member

Could we add an intrinsic for [v]movdqu (if applicable, at compile time) with no overhead?

Yes, I think that's _mm_loadu_si128 and _mm_storeu_si128 as appropriate. This could even be accepted upstream.

@solardiz
Copy link
Member

Could we add an intrinsic for [v]movdqu (if applicable, at compile time) with no overhead?

I've just implemented this. Testing.

@solardiz
Copy link
Member

I pushed the load/store intrinsics usage and it seems to work well so far.

@solardiz
Copy link
Member

Some other observations are:

  1. We waste about 2% on runtime cached cpuid checks - we could save this by moving them out of the loop (API change?)
  2. We waste some other little bit of performance on passing the always-zero return value from the lowest-level AES encryption code to higher levels, where we eventually drop it. We could optimize this out, maybe submit such change upstream.

@magnumripper
Copy link
Member Author

For future reference: Many relevant comments also in #5591

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants