-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace our AES code with the one from MbedTLS 3.6.2 #5591
Conversation
a1939b7
to
869dc3d
Compare
Tested on intel Linux and intel and M1 macOS. Needs testing on other archs. The AES-NI and AES-CE support is now disabled with |
Boost (for KeePass) is only 5-6x on super, while it's 12-13x on my intel macbook. Edit: Here, it's 9x on super:
|
BTW the default KeePass benchmark (like in the relbench test above) includes old low iteration test vectors. To see the difference better, use
New
Thats a 13.78x boost (intel macbook, single core) but speed is of course prohibitively low anyway and shows KeePass is extremely hard to crack even without Argon2. The KDF-AES at this cost will encrypt exactly 1 GB with AES-256 for each candidate - great for testing AES performance - so we're doing 1.93 GB/s per core now and from some googling I believe that's on par? At least in the right ballpark. The same test vector edit can be made to opencl_keepass_fmt_plug.c for testing our shared OpenCL AES, which comes in two flavors. Using the "bitsliced AES" (default for GPU - I kinda doubt bitsliced is a correct description but it does some things in parallel), Super's 1080 does 20.5 GB/s which I believe could be improved with at least 10x, (and forcing it to run our older, even worse, vanilla AES it's down to 1.15 GB/s which is essentially useless for anything but decrypting a short verifier). Edit: The faster GPU code is at https://www.github.com/cihangirtezcan/CUDA_AES - kudos to them for sharing. I should have a look at it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are passing:
BSD:
Build: freebsd13.4 64-bit x86_64 AVX2 AC OMP
SIMD: AVX2, interleaving: MD4:4 MD5:5 SHA1:2 SHA256:1 SHA512:1
[...]
clang version: 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67) (gcc 4.2.1 compatibility)
[...]
Parsed terminal locale: UTF-8
AES hardware acceleration: AES-NI
Windows:
Build: cygwin 64-bit x86_64 AVX2 AC OMP
SIMD: AVX2, interleaving: MD4:3 MD5:3 SHA1:1 SHA256:1 SHA512:1
[...]
gcc version: 12.4.0
[....]
AES hardware acceleration: AES-NI
Cygwin version: 3.5.4-1.x86_64, 2024-08-25 16:52 UTC
Red Hat 8:
Build: linux-gnu 64-bit x86_64 AVX512BW AC OMP
SIMD: AVX512BW, interleaving: MD4:3 MD5:3 SHA1:1 SHA256:1 SHA512:1
[...]
gcc version: 8.5.0
[...]
Parsed terminal locale: UNDEF
AES hardware acceleration: AES-NI
ARM
Build: linux-gnu 64-bit aarch64 ASIMD AC OMP
SIMD: ASIMD, interleaving: MD4:2 MD5:2 SHA1:1 SHA256:1 SHA512:1
[...]
gcc version: 13.2.0
[...]
Parsed terminal locale: UTF-8
AES hardware acceleration: AES-CE
But some tests are only possible after merging.
In the examples below, it's not clear to me whether we should say that we're doing nothing or simply keep quiet and omit the extra line.
|
I think it should be kept as-is. Any platform "can" have AES hardware acceleration of some sort, and if it doesn't OR we do not utilize it, we simply say "no" which makes it very clear: It's all done in software. BTW apparently mbedTLS also supported something called VIA Padlock. I never heard of it but I briefly tried to resurrect it (scratching my head) and then realized they had dropped the support in June (but some comments were left in the code). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a pity we had to drop const
from a few places, but I guess otherwise we'd have to modify the mbedTLS code more (adding const
to files from there)?
I did hear of it - it's one of the earlier (possibly the earliest, like 20 years by now) implementations of hardware crypto primitives acceleration on x86, in VIA's energy-efficient single-core CPUs. So you could have an SBC (industrial or automotive single-board computer) with only passive cooling yet have non-awful cryptographic performance (for its time). There were also some small form factor or blade servers built on those CPUs, e.g. with multi-server chassis offered by Dell and once one of the cheapest rented dedicated server options at OVH, so it was conceivable someone would run John there like 10 years ago. But this is probably irrelevant enough now that it's too late for us to bother introducing support.
You could help them drop the remnants (e.g. do it here, then send them a pull request for the changes). |
I notice you update
Not only low, but also two different iteration counts. We should avoid that - e.g., maybe see what hashcat is using and standardize on that?
You could want to open a separate issue for that. |
I thought it didn't make sense at all - it wasn't the "cipher key" that was const, but the "AES_CTX". I have yet to investigate how that could ever be const with no warnings. |
I notice there are pieces of inline asm code in mbedTLS, which use non-VEX SSE instructions. Hopefully this works OK, but there's risk of it being slow (or of VEX-encoded code being slow afterwards) without |
Oh, you could be right. Or could it be specific uses where the context is supposed to remain unchanged? |
Oh I thought they did, so I did not 😎 I'll try it out. |
I haven't delved much into it but I think it "prefers" the intrinsics (perhaps as long as the compiler supports them? What other cause would there be for having both?) and they are planning to drop the assembler. |
c18eac4
to
54b8fa6
Compare
Next commit will amend them and add Makefiles, replacing our older AES-NI aware code that didn't "just work" with all and any format.
This one supports AES-NI (Intel) and AES-CE (ARM, including Apple Silicon) and does not depend on yasm as it's primarily written in C with intrinsics. Unlike the old code that was only used for o5logon, this code kicks in for any format using AES. Great boosts seen with AES-heavy formats. The AES-CBC function was modifed so it accepts sizes not a multiple of block size, and does what OpenSSL and others do: Treat the last block as a full one, possibly writing past end of output buffer. Closes openwall#4314
* \note AESNI is only supported with certain compilers and target options:
* - Visual Studio: supported
* - GCC, x86-64, target not explicitly supporting AESNI:
* requires MBEDTLS_HAVE_ASM.
* - GCC, x86-32, target not explicitly supporting AESNI:
* not supported.
* - GCC, x86-64 or x86-32, target supporting AESNI: supported.
* For this assembly-less implementation, you must currently compile
* `library/aesni.c` and `library/aes.c` with machine options to enable
* SSE2 and AESNI instructions: `gcc -msse2 -maes -mpclmul` or
* `clang -maes -mpclmul`.
* - Non-x86 targets: this option is silently ignored.
* - Other compilers: this option is silently ignored.
*
* \note
* Above, "GCC" includes compatible compilers such as Clang.
* The limitations on target support are likely to be relaxed in the future. Perhaps we do need some tweak to ensure intrinsics and not asm, but I did just now manually build with I think I'll merge now and take care of this later. Disregarding the performance drop, we do have EDIT: issue #5593 opened for this, do not reply here |
* just like we do here. - magnum | ||
*/ | ||
length = (length + 15) / 16 * 16; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the indentation in this "chug along" block is inconsistent with that of the original file. Sorry I didn't notice this before you inserted the pristine files commit - in fact, I didn't even notice the existence of this block of code (although after my initial review, you mentioned you added it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, I meant to fix that but then forgot about it as it was hard to see. I'm used to emacs adopting whatever style is already there but I think I lost that functionality somehow.
printf("AES-NI: not built\n"); | ||
#else | ||
printf("AES-NI: not applicable\n"); | ||
printf("AES hardware acceleration: %s\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move the AES hardware acceleration
line to be near the SIMD line in the beginning, not at the end of list.
FWIW, testing the merged changes on Intel Tiger Lake, I am seeing a ~15% slowdown for o5logon compared to our old yasm code. Interrupting the old code in gdb, I see it ran this:
For the new, I see this:
or this:
or this:
These new pieces of code even look like they're relative high overhead - the density of AES-NI instructions is lower. |
Looking at |
Also, none of the |
Also, editing the source files under |
Except one, for Here's an attempt at correcting this for one +++ b/src/mbedtls/aesni.c
@@ -456,40 +456,40 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
const unsigned char input[16],
unsigned char output[16])
{
- asm ("movdqu (%3), %%xmm0 \n\t" // load input
- "movdqu (%1), %%xmm1 \n\t" // load round key 0
+ asm ("movdqu (%4), %%xmm0 \n\t" // load input
+ "movdqu (%2), %%xmm1 \n\t" // load round key 0
"pxor %%xmm1, %%xmm0 \n\t" // round 0
- "add $16, %1 \n\t" // point to next round key
- "subl $1, %0 \n\t" // normal rounds = nr - 1
- "test %2, %2 \n\t" // mode?
+ "add $16, %2 \n\t" // point to next round key
+ "subl $1, %1 \n\t" // normal rounds = nr - 1
+ "test %3, %3 \n\t" // mode?
"jz 2f \n\t" // 0 = decrypt
"1: \n\t" // encryption loop
- "movdqu (%1), %%xmm1 \n\t" // load round key
+ "movdqu (%2), %%xmm1 \n\t" // load round key
AESENC(xmm1_xmm0) // do round
- "add $16, %1 \n\t" // point to next round key
- "subl $1, %0 \n\t" // loop
+ "add $16, %2 \n\t" // point to next round key
+ "subl $1, %1 \n\t" // loop
"jnz 1b \n\t"
- "movdqu (%1), %%xmm1 \n\t" // load round key
+ "movdqu (%2), %%xmm1 \n\t" // load round key
AESENCLAST(xmm1_xmm0) // last round
#if !defined(MBEDTLS_BLOCK_CIPHER_NO_DECRYPT)
"jmp 3f \n\t"
"2: \n\t" // decryption loop
- "movdqu (%1), %%xmm1 \n\t"
+ "movdqu (%2), %%xmm1 \n\t"
AESDEC(xmm1_xmm0) // do round
- "add $16, %1 \n\t"
- "subl $1, %0 \n\t"
+ "add $16, %2 \n\t"
+ "subl $1, %1 \n\t"
"jnz 2b \n\t"
- "movdqu (%1), %%xmm1 \n\t" // load round key
+ "movdqu (%2), %%xmm1 \n\t" // load round key
AESDECLAST(xmm1_xmm0) // last round
#endif
"3: \n\t"
- "movdqu %%xmm0, (%4) \n\t" // export output
- :
- : "r" (ctx->nr), "r" (ctx->buf + ctx->rk_offset), "r" (mode), "r" (input), "r" (output)
- : "memory", "cc", "xmm0", "xmm1");
+ "movdqu %%xmm0, %0 \n\t" // export output
+ : "+m" (*(uint8_t(*)[16]) output)
+ : "r" (ctx->nr), "r" (ctx->buf + ctx->rk_offset), "r" (mode), "r" (input)
+ : "cc", "xmm0", "xmm1");
return 0; Other things I noticed while doing this experiment:
This is for just one asm block, but I think others are similar. |
Here's an attempt at fixing this for that same first large asm block, the resulting machine code is unchanged for me: +++ b/src/mbedtls/aesni.c
@@ -456,7 +456,8 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
const unsigned char input[16],
unsigned char output[16])
{
- asm ("movdqu (%3), %%xmm0 \n\t" // load input
+ uint32_t n = ctx->nr, *p = ctx->buf + ctx->rk_offset;
+ asm ("movdqu (%4), %%xmm0 \n\t" // load input
"movdqu (%1), %%xmm1 \n\t" // load round key 0
"pxor %%xmm1, %%xmm0 \n\t" // round 0
"add $16, %1 \n\t" // point to next round key
@@ -486,10 +487,10 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
#endif
"3: \n\t"
- "movdqu %%xmm0, (%4) \n\t" // export output
- :
- : "r" (ctx->nr), "r" (ctx->buf + ctx->rk_offset), "r" (mode), "r" (input), "r" (output)
- : "memory", "cc", "xmm0", "xmm1");
+ "movdqu %%xmm0, %3 \n\t" // export output
+ : "+r" (n), "+r" (p), "+r" (mode), "=m" (*(uint8_t(*)[16]) output)
+ : "r" (input)
+ : "cc", "xmm0", "xmm1");
return 0; I don't know if we want to be fixing these issues or maybe look into avoiding the asm blocks entirely. Since I think these are upstream code bugs (unless such clobbering is somehow allowed and I'm unaware? unlikely) we could want to notify upstream and maybe submit fixes there. |
Upon a closer look, all others specify |
Sent a PR with this one-liner fix upstream. |
I included only the AES pieces in my comment above, but FWIW I also frequently saw it run SHA-1 SHA-NI code from OpenSSL. |
Tested, |
There is a measurable impact on performance. $ sudo snap revert john-the-ripper --revision=686 # <== YASM
john-the-ripper reverted to 1.9J1+7df682c
$ john --test --format=o5logon
Will run 8 OpenMP threads
Benchmarking: o5logon, Oracle O5LOGON protocol [SHA1 AES 32/64]... (8xOMP) DONE
Many salts: 27901K c/s real, 3500K c/s virtual
Only one salt: 16433K c/s real, 2060K c/s virtual
$ john --test --format=o5logon
Will run 8 OpenMP threads
Benchmarking: o5logon, Oracle O5LOGON protocol [SHA1 AES 32/64]... (8xOMP) DONE
Many salts: 26853K c/s real, 3373K c/s virtual
Only one salt: 16015K c/s real, 2008K c/s virtual
$ sudo snap revert john-the-ripper --revision=687 # <== mbed lib
john-the-ripper reverted to 1.9J1+b3bd5ea
$ john --test --format=o5logon
Will run 8 OpenMP threads
Benchmarking: o5logon, Oracle O5LOGON protocol [SHA1 AES 32/64]... (8xOMP) DONE
Many salts: 17735K c/s real, 2233K c/s virtual
Only one salt: 12124K c/s real, 1517K c/s virtual
$ john --test --format=o5logon
Will run 8 OpenMP threads
Benchmarking: o5logon, Oracle O5LOGON protocol [SHA1 AES 32/64]... (8xOMP) DONE
Many salts: 17260K c/s real, 2177K c/s virtual
Only one salt: 12263K c/s real, 1535K c/s virtual
[EDITED]
|
First, you need to state what hardware was used. I have seen a slight regression on some, and a slight boost on other. |
I'm afraid that with no support for AES_NI the values will be around 8M and 6M:
2x gain, worst scenario. In any case, I note that there is room for improvement (which will be made if/when possible). I'm not the right person for testing, I don't have the hardware. Anyway, what I have is AMD. |
I think @claudioandre-br is right. I did observe a ~15% regression for o5logon on Intel Tiger Lake (one core, as I can't easily make this system otherwise-idle). It is conceivable that it would be ~35% on some other CPU (and at many threads). |
Right. I think there's actually a lot of room for improvement relative to the yasm speeds as well if we're ever willing to interleave multiple instances of the computation (different API). |
I can reproduce:
|
In top Makefile, john depends on aes.a Should that not suffice? I'm a bit rusty. |
Yes, I figured the same. I don't mind keeping this as-is, although it would have been nice for everything to be rebuilt when necessary by simple top-level |
Out of curiosity I had a look at this now and lo and behold, AES encryption/decryption never writes to the ctx. MbedTLS don't declare them const though. |
Testing on our old "bull" FX-8120, o5logon became about 30% slower with our current Mbed-TLS AES-NI code (with my loop unrolling) than it was with our previous AES-NI code (requiring yasm, which we have installed there). I guess our new key setup is still slower. At the same time, keepass became several times faster since it wasn't using AES-NI before. |
|
I've tried switching from intrinsics to asm (removed the |
Also tried reverting my loop unrolling commits while keeping the intrinsics and VAES - got same poor speeds at o5logon, but 20% better speeds at keepass - still a lot slower than the maximum seen with the asm code). So apparently both VAES and unrolling hurt on this CPU. ... but then I tried removing just Could be an alignment thing. Maybe unlike on newer CPUs, here |
No, that's not it. In fact, the generated code uses |
Oh, I see now. It first does two 64-bit reads, saves them to 128-bit aligned stack location, and then reads that with |
It is not hard to detect this performance regression [1]. So someone who needs to deal with o5logon might be interested in using an old commit and getting maximum performance with YASM. All other users, on the other hand, have gained something. [1] https://github.com/openwall/john-packages/wiki/Oracle-O5LOGON-notes#using-cloud-servers |
With the loadu/storeu intrinsics, the keepass speeds on "bull" are now better than ever, and reverting the loop unrolling hurts them. So the unrolling is a good thing, after all.
With unrolling reverted:
o5logon still has the known performance regression compared to yasm, and I don't know exactly why we have it - the key setup code doesn't look that bad at first glance, but then I didn't compare it against yasm's closely. |
This one supports AES-NI (Intel) and AES-CE (ARM, including Apple Silicon) and does not depend on yasm as it's written in C with intrinsics. Unlike the old code that was only used for o5logon, this code kicks in for any format using AES. Great boosts seen with AES-heavy formats.
Closes #4314