Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace our AES code with the one from MbedTLS 3.6.2 #5591

Merged
merged 2 commits into from
Nov 30, 2024

Conversation

magnumripper
Copy link
Member

This one supports AES-NI (Intel) and AES-CE (ARM, including Apple Silicon) and does not depend on yasm as it's written in C with intrinsics. Unlike the old code that was only used for o5logon, this code kicks in for any format using AES. Great boosts seen with AES-heavy formats.

Closes #4314

@magnumripper magnumripper force-pushed the AES-NI branch 3 times, most recently from a1939b7 to 869dc3d Compare November 29, 2024 02:04
@magnumripper
Copy link
Member Author

magnumripper commented Nov 29, 2024

Tested on intel Linux and intel and M1 macOS. Needs testing on other archs.

The AES-NI and AES-CE support is now disabled with ./configure --disable-simd and support will be shown in
--list=build-info output.

@magnumripper
Copy link
Member Author

magnumripper commented Nov 29, 2024

Boost (for KeePass) is only 5-6x on super, while it's 12-13x on my intel macbook.

Edit: Here, it's 9x on super:

$ ../run/relbench -v before.txt after.txt | sort -k2,2nr -k4,4nr | headtail -n15
Maximum:			9.11809 real, 9.06389 virtual
Ratio:	9.11809 real, 9.06389 virtual	KeePass:Raw
Ratio:	6.26058 real, 6.22896 virtual	AxCrypt:Raw
Ratio:	3.14249 real, 3.15828 virtual	securezip, PKWARE SecureZIP:Only one salt
Ratio:	3.05745 real, 3.04270 virtual	cryptoSafe:Raw
Ratio:	3.04781 real, 3.03258 virtual	securezip, PKWARE SecureZIP:Many salts
Ratio:	2.29021 real, 2.29021 virtual	PuTTY, Private Key (RSA/DSA/ECDSA/ED25519):Raw
Ratio:	1.90971 real, 1.90102 virtual	bitshares, BitShares Wallet:Many salts
Ratio:	1.89716 real, 1.89716 virtual	bitshares, BitShares Wallet:Only one salt
Ratio:	1.87988 real, 1.87988 virtual	o10glogon, Oracle 10g (...) rotocol:Only one salt
Ratio:	1.77551 real, 1.78440 virtual	o10glogon, Oracle 10g-logon protocol:Many salts
Ratio:	1.66070 real, 1.66070 virtual	openssl-enc, OpenSSL  (...) -128, MD5):Many salts
Ratio:	1.64436 real, 1.64436 virtual	openssl-enc, OpenSSL  (...) 8, MD5):Only one salt
Ratio:	1.48109 real, 1.46728 virtual	o5logon, Oracle O5LOGON protocol:Many salts
Ratio:	1.46603 real, 1.48024 virtual	o5logon, Oracle O5LOGON protocol:Only one salt
(... 349 lines skipped ...)
Ratio:	0.97869 real, 0.99372 virtual	net-ah, IPsec AH HMAC-MD5-96:Many salts
Ratio:	0.97845 real, 0.97845 virtual	leet:Many salts
Ratio:	0.97807 real, 0.97807 virtual	netntlmv2, NTLMv2 C/R:Only one salt
Ratio:	0.97558 real, 0.97558 virtual	Raw-MD4:Raw
Ratio:	0.97416 real, 0.97909 virtual	EPI, EPiServer SID:Only one salt
Ratio:	0.97292 real, 0.96804 virtual	EPI, EPiServer SID:Many salts
Ratio:	0.96640 real, 0.97123 virtual	nsec3, DNSSEC NSEC3:Raw
Ratio:	0.96181 real, 0.96181 virtual	Signal, Signal Android:Raw
Ratio:	0.96100 real, 0.95610 virtual	chap, iSCSI CHAP auth (...) EAP-MD5:Only one salt
Ratio:	0.96031 real, 0.96031 virtual	chap, iSCSI CHAP auth (...)  / EAP-MD5:Many salts
Minimum:			0.96031 real, 0.95610 virtual
Number of benchmarks:		372
Geometric standard deviation:	1.22175 real, 1.22137 virtual
Median absolute deviation:	0.00320 real, 0.00474 virtual
Geometric mean:			1.03813 real, 1.03793 virtual

@magnumripper
Copy link
Member Author

magnumripper commented Nov 29, 2024

BTW the default KeePass benchmark (like in the relbench test above) includes old low iteration test vectors. To see the difference better, use -cost=60000 and to see modern, real-life cost, edit keepass_fmt_plug.c, change the define of KEEPASS_REAL_COST_TEST_VECTORS to 1, rebuild and benchmark with -cost=30000000.
Old

Speed for cost 1 (t (rounds)) of 31250000, cost 2 (m) of 0, cost 3 (p) of 0, cost 4 (KDF [0=Argon2d 2=Argon2id 3=AES]) of 3
Raw:	0.14 c/s real, 0.14 c/s virtual

New

Speed for cost 1 (t (rounds)) of 31250000, cost 2 (m) of 0, cost 3 (p) of 0, cost 4 (KDF [0=Argon2d 2=Argon2id 3=AES]) of 3
Raw:	1.93 c/s real, 1.93 c/s virtual

Thats a 13.78x boost (intel macbook, single core) but speed is of course prohibitively low anyway and shows KeePass is extremely hard to crack even without Argon2. The KDF-AES at this cost will encrypt exactly 1 GB with AES-256 for each candidate - great for testing AES performance - so we're doing 1.93 GB/s per core now and from some googling I believe that's on par? At least in the right ballpark.

The same test vector edit can be made to opencl_keepass_fmt_plug.c for testing our shared OpenCL AES, which comes in two flavors. Using the "bitsliced AES" (default for GPU - I kinda doubt bitsliced is a correct description but it does some things in parallel), Super's 1080 does 20.5 GB/s which I believe could be improved with at least 10x, (and forcing it to run our older, even worse, vanilla AES it's down to 1.15 GB/s which is essentially useless for anything but decrypting a short verifier).

Edit: The faster GPU code is at https://www.github.com/cihangirtezcan/CUDA_AES - kudos to them for sharing. I should have a look at it.

Copy link
Member

@claudioandre-br claudioandre-br left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are passing:

BSD:

Build: freebsd13.4 64-bit x86_64 AVX2 AC OMP
SIMD: AVX2, interleaving: MD4:4 MD5:5 SHA1:2 SHA256:1 SHA512:1
[...]
clang version: 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67) (gcc 4.2.1 compatibility)
[...]
Parsed terminal locale: UTF-8
AES hardware acceleration: AES-NI

Windows:

Build: cygwin 64-bit x86_64 AVX2 AC OMP
SIMD: AVX2, interleaving: MD4:3 MD5:3 SHA1:1 SHA256:1 SHA512:1
[...]
gcc version: 12.4.0
[....]
AES hardware acceleration: AES-NI
Cygwin version: 3.5.4-1.x86_64, 2024-08-25 16:52 UTC

Red Hat 8:

Build: linux-gnu 64-bit x86_64 AVX512BW AC OMP
SIMD: AVX512BW, interleaving: MD4:3 MD5:3 SHA1:1 SHA256:1 SHA512:1
[...]
gcc version: 8.5.0
[...]
Parsed terminal locale: UNDEF
AES hardware acceleration: AES-NI

ARM

Build: linux-gnu 64-bit aarch64 ASIMD AC OMP
SIMD: ASIMD, interleaving: MD4:2 MD5:2 SHA1:1 SHA256:1 SHA512:1
[...]
gcc version: 13.2.0
[...]
Parsed terminal locale: UTF-8
AES hardware acceleration: AES-CE

But some tests are only possible after merging.

@claudioandre-br
Copy link
Member

In the examples below, it's not clear to me whether we should say that we're doing nothing or simply keep quiet and omit the extra line.

Build: linux-gnu 64-bit s390x  AC
[...]
AES hardware acceleration: no
Build: linux-gnu 64-bit powerpc64le Altivec AC
SIMD: AltiVec, interleaving: MD4:1 MD5:1 SHA1:1 SHA256:1 SHA512:1
[...]
Parsed terminal locale: UTF-8
AES hardware acceleration: no

@magnumripper
Copy link
Member Author

magnumripper commented Nov 29, 2024

In the examples below, it's not clear to me whether we should say that we're doing nothing or simply keep quiet and omit the extra line.

Build: linux-gnu 64-bit s390x  AC
[...]
AES hardware acceleration: no
Build: linux-gnu 64-bit powerpc64le Altivec AC
SIMD: AltiVec, interleaving: MD4:1 MD5:1 SHA1:1 SHA256:1 SHA512:1
[...]
AES hardware acceleration: no

I think it should be kept as-is. Any platform "can" have AES hardware acceleration of some sort, and if it doesn't OR we do not utilize it, we simply say "no" which makes it very clear: It's all done in software.

BTW apparently mbedTLS also supported something called VIA Padlock. I never heard of it but I briefly tried to resurrect it (scratching my head) and then realized they had dropped the support in June (but some comments were left in the code).

Copy link
Member

@solardiz solardiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pity we had to drop const from a few places, but I guess otherwise we'd have to modify the mbedTLS code more (adding const to files from there)?

src/configure.ac Show resolved Hide resolved
src/enpass_fmt_plug.c Show resolved Hide resolved
src/mbedtls/Makefile.in Show resolved Hide resolved
src/mbedtls/Makefile.in Outdated Show resolved Hide resolved
src/mbedtls/Makefile.legacy Outdated Show resolved Hide resolved
src/mbedtls/mbedtls_config.h Show resolved Hide resolved
@solardiz
Copy link
Member

VIA Padlock. I never heard of it

I did hear of it - it's one of the earlier (possibly the earliest, like 20 years by now) implementations of hardware crypto primitives acceleration on x86, in VIA's energy-efficient single-core CPUs. So you could have an SBC (industrial or automotive single-board computer) with only passive cooling yet have non-awful cryptographic performance (for its time). There were also some small form factor or blade servers built on those CPUs, e.g. with multi-server chassis offered by Dell and once one of the cheapest rented dedicated server options at OVH, so it was conceivable someone would run John there like 10 years ago. But this is probably irrelevant enough now that it's too late for us to bother introducing support.

realized they had dropped the support in June (but some comments were left in the code).

You could help them drop the remnants (e.g. do it here, then send them a pull request for the changes).

@solardiz
Copy link
Member

I notice you update Makefile.legacy - that's great. Have you tested it? Our bots currently don't.

the default KeePass benchmark (like in the relbench test above) includes old low iteration test vectors.

Not only low, but also two different iteration counts. We should avoid that - e.g., maybe see what hashcat is using and standardize on that?

The faster GPU code is at https://www.github.com/cihangirtezcan/CUDA_AES - kudos to them for sharing. I should have a look at it.

You could want to open a separate issue for that.

@magnumripper
Copy link
Member Author

It's a pity we had to drop const from a few places, but I guess otherwise we'd have to modify the mbedTLS code more (adding const to files from there)?

I thought it didn't make sense at all - it wasn't the "cipher key" that was const, but the "AES_CTX". I have yet to investigate how that could ever be const with no warnings.

@solardiz
Copy link
Member

I notice there are pieces of inline asm code in mbedTLS, which use non-VEX SSE instructions. Hopefully this works OK, but there's risk of it being slow (or of VEX-encoded code being slow afterwards) without vzeroupper on transitions (which would also be slow, just without the risk of being an order of magnitude slower). I don't suggest changing this yet, just writing down this note.

@solardiz
Copy link
Member

I thought it didn't make sense at all - it wasn't the "cipher key" that was const, but the "AES_CTX". I have yet to investigate how that could ever be const with no warnings.

Oh, you could be right. Or could it be specific uses where the context is supposed to remain unchanged?

@magnumripper
Copy link
Member Author

I notice you update Makefile.legacy - that's great. Have you tested it? Our bots currently don't.

Oh I thought they did, so I did not 😎 I'll try it out.

@magnumripper
Copy link
Member Author

I notice there are pieces of inline asm code in mbedTLS, which use non-VEX SSE instructions. Hopefully this works OK, but there's risk of it being slow (or of VEX-encoded code being slow afterwards) without vzeroupper on transitions (which would also be slow, just without the risk of being an order of magnitude slower). I don't suggest changing this yet, just writing down this note.

I haven't delved much into it but I think it "prefers" the intrinsics (perhaps as long as the compiler supports them? What other cause would there be for having both?) and they are planning to drop the assembler.

@magnumripper magnumripper force-pushed the AES-NI branch 3 times, most recently from c18eac4 to 54b8fa6 Compare November 29, 2024 23:05
Next commit will amend them and add Makefiles, replacing our older AES-NI
aware code that didn't "just work" with all and any format.
This one supports AES-NI (Intel) and AES-CE (ARM, including Apple Silicon)
and does not depend on yasm as it's primarily written in C with intrinsics.
Unlike the old code that was only used for o5logon, this code kicks in for
any format using AES.  Great boosts seen with AES-heavy formats.

The AES-CBC function was modifed so it accepts sizes not a multiple of block
size, and does what OpenSSL and others do:  Treat the last block as a full
one, possibly writing past end of output buffer.

Closes openwall#4314
@magnumripper
Copy link
Member Author

magnumripper commented Nov 30, 2024

 * \note AESNI is only supported with certain compilers and target options:
 * - Visual Studio: supported
 * - GCC, x86-64, target not explicitly supporting AESNI:
 *   requires MBEDTLS_HAVE_ASM.
 * - GCC, x86-32, target not explicitly supporting AESNI:
 *   not supported.
 * - GCC, x86-64 or x86-32, target supporting AESNI: supported.
 *   For this assembly-less implementation, you must currently compile
 *   `library/aesni.c` and `library/aes.c` with machine options to enable
 *   SSE2 and AESNI instructions: `gcc -msse2 -maes -mpclmul` or
 *   `clang -maes -mpclmul`.
 * - Non-x86 targets: this option is silently ignored.
 * - Other compilers: this option is silently ignored.
 *
 * \note
 * Above, "GCC" includes compatible compilers such as Clang.
 * The limitations on target support are likely to be relaxed in the future.

Perhaps we do need some tweak to ensure intrinsics and not asm, but I did just now manually build with -mavx2 -maes -mpclmul per above, and that resulted in a 62% larger aes.a and definitely worse performance (not a lot, but worse).

I think I'll merge now and take care of this later.

Disregarding the performance drop, we do have @CC_CPU@ from configure.ac to put in Makefile.in (will add -mavx2 for my laptop) but the -maes -mpclmul would need to be added too. I assume those two can be added even for machines not supporting it (bc cpuid checking) but it would need testing, and obviously can't be blindly added - the machine could be a Sparc and/or the compiler could be one that hasn't got a clue what those options are.

EDIT: issue #5593 opened for this, do not reply here

@magnumripper magnumripper merged commit b40a1a2 into openwall:bleeding-jumbo Nov 30, 2024
35 of 36 checks passed
@magnumripper magnumripper deleted the AES-NI branch November 30, 2024 01:05
* just like we do here. - magnum
*/
length = (length + 15) / 16 * 16;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the indentation in this "chug along" block is inconsistent with that of the original file. Sorry I didn't notice this before you inserted the pristine files commit - in fact, I didn't even notice the existence of this block of code (although after my initial review, you mentioned you added it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, I meant to fix that but then forgot about it as it was hard to see. I'm used to emacs adopting whatever style is already there but I think I lost that functionality somehow.

printf("AES-NI: not built\n");
#else
printf("AES-NI: not applicable\n");
printf("AES hardware acceleration: %s\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move the AES hardware acceleration line to be near the SIMD line in the beginning, not at the end of list.

@solardiz
Copy link
Member

FWIW, testing the merged changes on Intel Tiger Lake, I am seeing a ~15% slowdown for o5logon compared to our old yasm code. Interrupting the old code in gdb, I see it ran this:

(gdb) disass
Dump of assembler code for function iDecExpandKey192:
   0x000000000075dac0 <+0>:	mov    %rdi,%rcx
   0x000000000075dac3 <+3>:	mov    %rsi,%rdx
   0x000000000075dac6 <+6>:	push   %rcx
   0x000000000075dac7 <+7>:	push   %rdx
   0x000000000075dac8 <+8>:	sub    $0x18,%rsp
   0x000000000075dacc <+12>:	call   0x75d7c0 <iEncExpandKey192>
   0x000000000075dad1 <+17>:	add    $0x18,%rsp
   0x000000000075dad5 <+21>:	pop    %rdx
   0x000000000075dad6 <+22>:	pop    %rcx
   0x000000000075dad7 <+23>:	movdqu 0x10(%rdx),%xmm1
   0x000000000075dadc <+28>:	aesimc %xmm1,%xmm0
   0x000000000075dae1 <+33>:	movdqu %xmm0,0x10(%rdx)
   0x000000000075dae6 <+38>:	movdqu 0x20(%rdx),%xmm1
   0x000000000075daeb <+43>:	aesimc %xmm1,%xmm0
   0x000000000075daf0 <+48>:	movdqu %xmm0,0x20(%rdx)
   0x000000000075daf5 <+53>:	movdqu 0x30(%rdx),%xmm1
   0x000000000075dafa <+58>:	aesimc %xmm1,%xmm0
   0x000000000075daff <+63>:	movdqu %xmm0,0x30(%rdx)
   0x000000000075db04 <+68>:	movdqu 0x40(%rdx),%xmm1
   0x000000000075db09 <+73>:	aesimc %xmm1,%xmm0
=> 0x000000000075db0e <+78>:	movdqu %xmm0,0x40(%rdx)
   0x000000000075db13 <+83>:	movdqu 0x50(%rdx),%xmm1
   0x000000000075db18 <+88>:	aesimc %xmm1,%xmm0

For the new, I see this:

(gdb) disass
Dump of assembler code for function mbedtls_aesni_crypt_ecb:
   0x000000000075a5d0 <+0>:	mov    %rdx,%r8
   0x000000000075a5d3 <+3>:	mov    0x8(%rdi),%rdx
   0x000000000075a5d7 <+7>:	mov    (%rdi),%eax
   0x000000000075a5d9 <+9>:	lea    0x10(%rdi,%rdx,4),%rdx
   0x000000000075a5de <+14>:	movdqu (%r8),%xmm0
   0x000000000075a5e3 <+19>:	movdqu (%rdx),%xmm1
   0x000000000075a5e7 <+23>:	pxor   %xmm1,%xmm0
   0x000000000075a5eb <+27>:	add    $0x10,%rdx
   0x000000000075a5ef <+31>:	sub    $0x1,%eax
   0x000000000075a5f2 <+34>:	test   %esi,%esi
   0x000000000075a5f4 <+36>:	je     0x75a613 <mbedtls_aesni_crypt_ecb+67>
   0x000000000075a5f6 <+38>:	movdqu (%rdx),%xmm1
   0x000000000075a5fa <+42>:	aesenc %xmm1,%xmm0
   0x000000000075a5ff <+47>:	add    $0x10,%rdx
   0x000000000075a603 <+51>:	sub    $0x1,%eax
   0x000000000075a606 <+54>:	jne    0x75a5f6 <mbedtls_aesni_crypt_ecb+38>
   0x000000000075a608 <+56>:	movdqu (%rdx),%xmm1
   0x000000000075a60c <+60>:	aesenclast %xmm1,%xmm0
   0x000000000075a611 <+65>:	jmp    0x75a62e <mbedtls_aesni_crypt_ecb+94>
   0x000000000075a613 <+67>:	movdqu (%rdx),%xmm1
   0x000000000075a617 <+71>:	aesdec %xmm1,%xmm0
=> 0x000000000075a61c <+76>:	add    $0x10,%rdx
   0x000000000075a620 <+80>:	sub    $0x1,%eax
   0x000000000075a623 <+83>:	jne    0x75a613 <mbedtls_aesni_crypt_ecb+67>
   0x000000000075a625 <+85>:	movdqu (%rdx),%xmm1
   0x000000000075a629 <+89>:	aesdeclast %xmm1,%xmm0
   0x000000000075a62e <+94>:	movdqu %xmm0,(%rcx)
   0x000000000075a632 <+98>:	xor    %eax,%eax
   0x000000000075a634 <+100>:	ret

or this:

623	        asm ("movdqu (%0), %%xmm0       \n\t"
(gdb) disass
Dump of assembler code for function mbedtls_aesni_inverse_key:
   0x000000000075a7b0 <+0>:	mov    %edx,%ecx
   0x000000000075a7b2 <+2>:	add    $0x10,%rdi
   0x000000000075a7b6 <+6>:	shl    $0x4,%ecx
   0x000000000075a7b9 <+9>:	movslq %ecx,%rcx
   0x000000000075a7bc <+12>:	add    %rsi,%rcx
   0x000000000075a7bf <+15>:	movdqu (%rcx),%xmm0
   0x000000000075a7c3 <+19>:	lea    -0x10(%rcx),%r8
   0x000000000075a7c7 <+23>:	movups %xmm0,-0x10(%rdi)
   0x000000000075a7cb <+27>:	cmp    %r8,%rsi
   0x000000000075a7ce <+30>:	jae    0x75a80e <mbedtls_aesni_inverse_key+94>
   0x000000000075a7d0 <+32>:	mov    %r8,%rax
   0x000000000075a7d3 <+35>:	mov    %rdi,%rdx
   0x000000000075a7d6 <+38>:	cs nopw 0x0(%rax,%rax,1)
   0x000000000075a7e0 <+48>:	movdqu (%rax),%xmm0
   0x000000000075a7e4 <+52>:	aesimc %xmm0,%xmm0
=> 0x000000000075a7e9 <+57>:	movdqu %xmm0,(%rdx)
   0x000000000075a7ed <+61>:	sub    $0x10,%rax
   0x000000000075a7f1 <+65>:	add    $0x10,%rdx
   0x000000000075a7f5 <+69>:	cmp    %rax,%rsi
   0x000000000075a7f8 <+72>:	jb     0x75a7e0 <mbedtls_aesni_inverse_key+48>
   0x000000000075a7fa <+74>:	sub    %rsi,%rcx
   0x000000000075a7fd <+77>:	lea    -0x11(%rcx),%rax
   0x000000000075a801 <+81>:	not    %rax
   0x000000000075a804 <+84>:	and    $0xfffffffffffffff0,%rax
   0x000000000075a808 <+88>:	add    %rax,%r8
   0x000000000075a80b <+91>:	sub    %rax,%rdi
   0x000000000075a80e <+94>:	movdqu (%r8),%xmm1
   0x000000000075a813 <+99>:	movups %xmm1,(%rdi)
   0x000000000075a816 <+102>:	ret    

or this:

692	    asm ("movdqu (%1), %%xmm0   \n\t" // copy original round key
(gdb) disass
Dump of assembler code for function mbedtls_aesni_setkey_enc:
   0x000000000075a820 <+0>:	cmp    $0xc0,%rdx
   0x000000000075a827 <+7>:	je     0x75a9d0 <mbedtls_aesni_setkey_enc+432>
   0x000000000075a82d <+13>:	cmp    $0x100,%rdx
   0x000000000075a834 <+20>:	je     0x75a900 <mbedtls_aesni_setkey_enc+224>
   0x000000000075a83a <+26>:	mov    $0xffffffe0,%eax
   0x000000000075a83f <+31>:	cmp    $0x80,%rdx
   0x000000000075a846 <+38>:	je     0x75a850 <mbedtls_aesni_setkey_enc+48>
   0x000000000075a848 <+40>:	ret    
   0x000000000075a849 <+41>:	nopl   0x0(%rax)
   0x000000000075a850 <+48>:	movdqu (%rsi),%xmm0
   0x000000000075a854 <+52>:	movdqu %xmm0,(%rdi)
   0x000000000075a858 <+56>:	jmp    0x75a887 <mbedtls_aesni_setkey_enc+103>
   0x000000000075a85a <+58>:	pshufd $0xff,%xmm1,%xmm1
   0x000000000075a85f <+63>:	pxor   %xmm0,%xmm1
   0x000000000075a863 <+67>:	pslldq $0x4,%xmm0
   0x000000000075a868 <+72>:	pxor   %xmm0,%xmm1
   0x000000000075a86c <+76>:	pslldq $0x4,%xmm0
   0x000000000075a871 <+81>:	pxor   %xmm0,%xmm1
   0x000000000075a875 <+85>:	pslldq $0x4,%xmm0
   0x000000000075a87a <+90>:	pxor   %xmm1,%xmm0
   0x000000000075a87e <+94>:	add    $0x10,%rdi
   0x000000000075a882 <+98>:	movdqu %xmm0,(%rdi)
   0x000000000075a886 <+102>:	ret    
   0x000000000075a887 <+103>:	aeskeygenassist $0x1,%xmm0,%xmm1
   0x000000000075a88d <+109>:	call   0x75a85a <mbedtls_aesni_setkey_enc+58>
   0x000000000075a892 <+114>:	aeskeygenassist $0x2,%xmm0,%xmm1
   0x000000000075a898 <+120>:	call   0x75a85a <mbedtls_aesni_setkey_enc+58>
   0x000000000075a89d <+125>:	aeskeygenassist $0x4,%xmm0,%xmm1
   0x000000000075a8a3 <+131>:	call   0x75a85a <mbedtls_aesni_setkey_enc+58>
   0x000000000075a8a8 <+136>:	aeskeygenassist $0x8,%xmm0,%xmm1
   0x000000000075a8ae <+142>:	call   0x75a85a <mbedtls_aesni_setkey_enc+58>
   0x000000000075a8b3 <+147>:	aeskeygenassist $0x10,%xmm0,%xmm1
   0x000000000075a8b9 <+153>:	call   0x75a85a <mbedtls_aesni_setkey_enc+58>
   0x000000000075a8be <+158>:	aeskeygenassist $0x20,%xmm0,%xmm1
   0x000000000075a8c4 <+164>:	call   0x75a85a <mbedtls_aesni_setkey_enc+58>
   0x000000000075a8c9 <+169>:	aeskeygenassist $0x40,%xmm0,%xmm1
   0x000000000075a8cf <+175>:	call   0x75a85a <mbedtls_aesni_setkey_enc+58>
   0x000000000075a8d4 <+180>:	aeskeygenassist $0x80,%xmm0,%xmm1
   0x000000000075a8da <+186>:	call   0x75a85a <mbedtls_aesni_setkey_enc+58>
   0x000000000075a8df <+191>:	aeskeygenassist $0x1b,%xmm0,%xmm1
   0x000000000075a8e5 <+197>:	call   0x75a85a <mbedtls_aesni_setkey_enc+58>
   0x000000000075a8ea <+202>:	aeskeygenassist $0x36,%xmm0,%xmm1
   0x000000000075a8f0 <+208>:	call   0x75a85a <mbedtls_aesni_setkey_enc+58>
   0x000000000075a8f5 <+213>:	xor    %eax,%eax
   0x000000000075a8f7 <+215>:	ret    

These new pieces of code even look like they're relative high overhead - the density of AES-NI instructions is lower.

@solardiz
Copy link
Member

I am seeing a ~15% slowdown for o5logon compared to our old yasm code

Looking at o5logon_fmt_plug.c, I see it does indeed set up a new AES key for every tiny AES_cbc_encrypt (not as tiny as a single block, but few blocks). This is probably suboptimal for either implementation, so hopefully we don't have such slowdown for most other formats using AES much.

@solardiz
Copy link
Member

Also, none of the asm directives in mbedtls/aesni.c specify any outputs - instead, they specify that they clobber all memory. I don't know if whoever wrote this was simply ignorant or felt this was somehow safer, but anyhow this is suboptimal since it doesn't let the compiler cache any other variables in registers across the asm blocks. In contrast, mbedtls/aesce.c does specify outputs, including in one place in a pretty advanced way (and I can see how it could be error-prone when an array is output).

@solardiz
Copy link
Member

Also, editing the source files under mbedtls and running make at the top-level does not rebuild aes.a. Looks like we only build it if it's missing. Perhaps same for other .a files we build/use?

@solardiz
Copy link
Member

none of the asm directives in mbedtls/aesni.c specify any outputs

Except one, for cpuid.

Here's an attempt at correcting this for one asm block, but it results in exactly the same code generated as without these changes, perhaps because this is the entirety of the non-inline function anyway:

+++ b/src/mbedtls/aesni.c
@@ -456,40 +456,40 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
                             const unsigned char input[16],
                             unsigned char output[16])
 {
-    asm ("movdqu    (%3), %%xmm0    \n\t" // load input
-         "movdqu    (%1), %%xmm1    \n\t" // load round key 0
+    asm ("movdqu    (%4), %%xmm0    \n\t" // load input
+         "movdqu    (%2), %%xmm1    \n\t" // load round key 0
          "pxor      %%xmm1, %%xmm0  \n\t" // round 0
-         "add       $16, %1         \n\t" // point to next round key
-         "subl      $1, %0          \n\t" // normal rounds = nr - 1
-         "test      %2, %2          \n\t" // mode?
+         "add       $16, %2         \n\t" // point to next round key
+         "subl      $1, %1          \n\t" // normal rounds = nr - 1
+         "test      %3, %3          \n\t" // mode?
          "jz        2f              \n\t" // 0 = decrypt
 
          "1:                        \n\t" // encryption loop
-         "movdqu    (%1), %%xmm1    \n\t" // load round key
+         "movdqu    (%2), %%xmm1    \n\t" // load round key
          AESENC(xmm1_xmm0)                // do round
-         "add       $16, %1         \n\t" // point to next round key
-         "subl      $1, %0          \n\t" // loop
+         "add       $16, %2         \n\t" // point to next round key
+         "subl      $1, %1          \n\t" // loop
          "jnz       1b              \n\t"
-         "movdqu    (%1), %%xmm1    \n\t" // load round key
+         "movdqu    (%2), %%xmm1    \n\t" // load round key
          AESENCLAST(xmm1_xmm0)            // last round
 #if !defined(MBEDTLS_BLOCK_CIPHER_NO_DECRYPT)
          "jmp       3f              \n\t"
 
          "2:                        \n\t" // decryption loop
-         "movdqu    (%1), %%xmm1    \n\t"
+         "movdqu    (%2), %%xmm1    \n\t"
          AESDEC(xmm1_xmm0)                // do round
-         "add       $16, %1         \n\t"
-         "subl      $1, %0          \n\t"
+         "add       $16, %2         \n\t"
+         "subl      $1, %1          \n\t"
          "jnz       2b              \n\t"
-         "movdqu    (%1), %%xmm1    \n\t" // load round key
+         "movdqu    (%2), %%xmm1    \n\t" // load round key
          AESDECLAST(xmm1_xmm0)            // last round
 #endif
 
          "3:                        \n\t"
-         "movdqu    %%xmm0, (%4)    \n\t" // export output
-         :
-         : "r" (ctx->nr), "r" (ctx->buf + ctx->rk_offset), "r" (mode), "r" (input), "r" (output)
-         : "memory", "cc", "xmm0", "xmm1");
+         "movdqu    %%xmm0, %0      \n\t" // export output
+         : "+m" (*(uint8_t(*)[16]) output)
+         : "r" (ctx->nr), "r" (ctx->buf + ctx->rk_offset), "r" (mode), "r" (input)
+         : "cc", "xmm0", "xmm1");
 
 
     return 0;

Other things I noticed while doing this experiment:

  1. The code clobbers input registers. That's a bug - those should have been specified as input/output for that. I guess the only reason it didn't show up so far is because this is a separate non-inline function. I didn't correct this above.
  2. The usage of sub instead of dec is rather old-fashioned - could briefly make sense on the original Pentium, but perhaps not on any CPU with AES-NI.
  3. The usage of unaligned-supporting movdqu instructions should have no performance impact on modern CPUs when there's proper alignment anyway, but still maybe we could ensure alignment and use movdqa (and be alerted when there's no alignment, so we can fix rather than silently have performance impact).

This is for just one asm block, but I think others are similar.

@solardiz
Copy link
Member

The code clobbers input registers. That's a bug - those should have been specified as input/output for that.

Here's an attempt at fixing this for that same first large asm block, the resulting machine code is unchanged for me:

+++ b/src/mbedtls/aesni.c
@@ -456,7 +456,8 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
                             const unsigned char input[16],
                             unsigned char output[16])
 {
-    asm ("movdqu    (%3), %%xmm0    \n\t" // load input
+    uint32_t n = ctx->nr, *p = ctx->buf + ctx->rk_offset;
+    asm ("movdqu    (%4), %%xmm0    \n\t" // load input
          "movdqu    (%1), %%xmm1    \n\t" // load round key 0
          "pxor      %%xmm1, %%xmm0  \n\t" // round 0
          "add       $16, %1         \n\t" // point to next round key
@@ -486,10 +487,10 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
 #endif
 
          "3:                        \n\t"
-         "movdqu    %%xmm0, (%4)    \n\t" // export output
-         :
-         : "r" (ctx->nr), "r" (ctx->buf + ctx->rk_offset), "r" (mode), "r" (input), "r" (output)
-         : "memory", "cc", "xmm0", "xmm1");
+         "movdqu    %%xmm0, %3      \n\t" // export output
+         : "+r" (n), "+r" (p), "+r" (mode), "=m" (*(uint8_t(*)[16]) output)
+         : "r" (input)
+         : "cc", "xmm0", "xmm1");
 
 
     return 0;

I don't know if we want to be fixing these issues or maybe look into avoiding the asm blocks entirely.

Since I think these are upstream code bugs (unless such clobbering is somehow allowed and I'm unaware? unlikely) we could want to notify upstream and maybe submit fixes there.

@solardiz
Copy link
Member

This is for just one asm block, but I think others are similar.

Upon a closer look, all others specify "0" as clobbered when they do clobber it. So I think a proper least-invasive fix would be to add similar specifiers to this one asm block and that's it, until this drop asm for good.

@solardiz
Copy link
Member

all others specify "0" as clobbered when they do clobber it. So I think a proper least-invasive fix would be to add similar specifiers to this one asm block and that's it

Sent a PR with this one-liner fix upstream.

@solardiz
Copy link
Member

on Intel Tiger Lake, I am seeing a ~15% slowdown for o5logon compared to our old yasm code. Interrupting the old code in gdb, I see it ran this

I included only the AES pieces in my comment above, but FWIW I also frequently saw it run SHA-1 SHA-NI code from OpenSSL.

@solardiz
Copy link
Member

solardiz commented Dec 1, 2024

The usage of sub instead of dec is rather old-fashioned - could briefly make sense on the original Pentium, but perhaps not on any CPU with AES-NI.

Tested, dec works at least as well or better in that code on Intel Tiger Lake. However, going even further to use the loop instruction makes it a lot slower - apparently, performance of that one is not repaired since they broke it in the Pentium days.

@claudioandre-br
Copy link
Member

claudioandre-br commented Dec 1, 2024

There is a measurable impact on performance.

$ sudo snap revert john-the-ripper --revision=686 # <== YASM
john-the-ripper reverted to 1.9J1+7df682c

$ john --test --format=o5logon
Will run 8 OpenMP threads
Benchmarking: o5logon, Oracle O5LOGON protocol [SHA1 AES 32/64]... (8xOMP) DONE
Many salts:	27901K c/s real, 3500K c/s virtual
Only one salt:	16433K c/s real, 2060K c/s virtual

$ john --test --format=o5logon
Will run 8 OpenMP threads
Benchmarking: o5logon, Oracle O5LOGON protocol [SHA1 AES 32/64]... (8xOMP) DONE
Many salts:	26853K c/s real, 3373K c/s virtual
Only one salt:	16015K c/s real, 2008K c/s virtual

$ sudo snap revert john-the-ripper --revision=687 # <== mbed lib
john-the-ripper reverted to 1.9J1+b3bd5ea

$ john --test --format=o5logon
Will run 8 OpenMP threads
Benchmarking: o5logon, Oracle O5LOGON protocol [SHA1 AES 32/64]... (8xOMP) DONE
Many salts:	17735K c/s real, 2233K c/s virtual
Only one salt:	12124K c/s real, 1517K c/s virtual

$ john --test --format=o5logon
Will run 8 OpenMP threads
Benchmarking: o5logon, Oracle O5LOGON protocol [SHA1 AES 32/64]... (8xOMP) DONE
Many salts:	17260K c/s real, 2177K c/s virtual
Only one salt:	12263K c/s real, 1535K c/s virtual

[EDITED]

$ john | head -1
John the Ripper 1.9.0-jumbo-1+bleeding-f9fedd238b 2024-04-01 13:35:37 +0200 OMP [linux-gnu 64-bit x86_64 AVX2 AC]

$ john --test --format=o5logon
Will run 8 OpenMP threads
Benchmarking: o5logon, Oracle O5LOGON protocol [SHA1 AES 32/64]... (8xOMP) DONE
Many salts:	7143K c/s real, 1039K c/s virtual
Only one salt:	5840K c/s real, 828495 c/s virtual

$ john | head -1
John the Ripper 1.9.0-jumbo-1+bleeding-7df682c6da 2024-11-27 02:27:42 +0100 OMP [linux-gnu 64-bit x86_64 AVX2 AC]

$ john --test --format=o5logon
Will run 8 OpenMP threads
Benchmarking: o5logon, Oracle O5LOGON protocol [SHA1 AES 32/64]... (8xOMP) DONE
Many salts:	24673K c/s real, 3142K c/s virtual
Only one salt:	14630K c/s real, 1863K c/s virtual

@magnumripper
Copy link
Member Author

There is a measurable impact on performance.

First, you need to state what hardware was used. I have seen a slight regression on some, and a slight boost on other.
Second, those figure really look like you did not get AES-NI at all for the mbedtls build - for one reason or the other.

@claudioandre-br
Copy link
Member

claudioandre-br commented Dec 1, 2024

I'm afraid that with no support for AES_NI the values will be around 8M and 6M:

$ john --test --format=o5logon
Will run 8 OpenMP threads
Benchmarking: o5logon, Oracle O5LOGON protocol [SHA1 AES 32/64]... (8xOMP) DONE
Many salts:	7140K c/s real, 1018K c/s virtual
Only one salt:	5873K c/s real, 827859 c/s virtual

2x gain, worst scenario.


In any case, I note that there is room for improvement (which will be made if/when possible).

I'm not the right person for testing, I don't have the hardware. Anyway, what I have is AMD.

@solardiz
Copy link
Member

solardiz commented Dec 1, 2024

I think @claudioandre-br is right. I did observe a ~15% regression for o5logon on Intel Tiger Lake (one core, as I can't easily make this system otherwise-idle). It is conceivable that it would be ~35% on some other CPU (and at many threads).

@solardiz
Copy link
Member

solardiz commented Dec 1, 2024

In any case, I note that there is room for improvement (which will be made if/when possible).

Right. I think there's actually a lot of room for improvement relative to the yasm speeds as well if we're ever willing to interleave multiple instances of the computation (different API).

@claudioandre-br
Copy link
Member

I can reproduce:

  • YASM: more than 20 million, easily.
  • now, less than 20 million, for sure.

@magnumripper
Copy link
Member Author

Also, editing the source files under mbedtls and running make at the top-level does not rebuild aes.a. Looks like we only build it if it's missing. Perhaps same for other .a files we build/use?

In top Makefile, john depends on aes.a
In mbedtls' Makefile, aes.a depends on "aesce.o aesni.o aes.o".
They in turn each have complete dependency lists.

Should that not suffice? I'm a bit rusty.

@solardiz
Copy link
Member

solardiz commented Dec 2, 2024

Should that not suffice? I'm a bit rusty.

Yes, I figured the same. I don't mind keeping this as-is, although it would have been nice for everything to be rebuilt when necessary by simple top-level make, for which I guess we'll need to be running the sub-makes unconditionally.

@magnumripper
Copy link
Member Author

I thought it didn't make sense at all - it wasn't the "cipher key" that was const, but the "AES_CTX". I have yet to investigate how that could ever be const with no warnings.

Oh, you could be right. Or could it be specific uses where the context is supposed to remain unchanged?

Out of curiosity I had a look at this now and lo and behold, AES encryption/decryption never writes to the ctx. MbedTLS don't declare them const though.

@solardiz
Copy link
Member

Testing on our old "bull" FX-8120, o5logon became about 30% slower with our current Mbed-TLS AES-NI code (with my loop unrolling) than it was with our previous AES-NI code (requiring yasm, which we have installed there). I guess our new key setup is still slower. At the same time, keepass became several times faster since it wasn't using AES-NI before.

@solardiz
Copy link
Member

on our old "bull" FX-8120, o5logon became about 30% slower with our current Mbed-TLS AES-NI code

aesni_setkey_enc_192 and aesni_set_rk_256 were not getting inlined with the older gcc we have there, while aesni_set_rk_128 was inlined anyway. I've tried adding explicit inline to all of these and to aesni_setkey_enc_* and verified that everything got inlined. However, the speed remained almost the same - still 30% slower than old code's - I don't know why.

@solardiz
Copy link
Member

on our old "bull" FX-8120, o5logon became about 30% slower with our current Mbed-TLS AES-NI code

I've tried switching from intrinsics to asm (removed the -m* flags from mbedtls/Makefile) - this made o5logon significantly faster (speed between old yasm code's and new intrinsics code's) and keepass twice faster (than with intrinsics, even though the disassembled code looked OK to me). I suspect this older CPU may be slower at VAES than at original AES-NI.

@solardiz
Copy link
Member

I suspect this older CPU may be slower at VAES than at original AES-NI.

Also tried reverting my loop unrolling commits while keeping the intrinsics and VAES - got same poor speeds at o5logon, but 20% better speeds at keepass - still a lot slower than the maximum seen with the asm code). So apparently both VAES and unrolling hurt on this CPU.

... but then I tried removing just -mxop (keep intrinsics) and confirmed the code is old-style AES-NI, but it does not run faster (unlike the asm code, even though the code compiled in this way looks similar). I'm puzzled.

Could be an alignment thing. Maybe unlike on newer CPUs, here movdqu isn't free when unaligned?

@solardiz
Copy link
Member

Could be an alignment thing. Maybe unlike on newer CPUs, here movdqu isn't free when unaligned?

No, that's not it. In fact, the generated code uses movdqa there. I don't know why the compiler dares, and whether it means it may crash in an unlucky build/run.

@solardiz
Copy link
Member

No, that's not it. In fact, the generated code uses movdqa there. I don't know why the compiler dares

Oh, I see now. It first does two 64-bit reads, saves them to 128-bit aligned stack location, and then reads that with movdqa. And this may be (at least part of) why it's slower - so when building with intrinsics with this older gcc, the memcpy isn't free (not one instruction) unlike on newer gcc.

@claudioandre-br
Copy link
Member

on our old "bull" FX-8120, o5logon became about 30% slower with our current Mbed-TLS AES-NI code

It is not hard to detect this performance regression [1]. So someone who needs to deal with o5logon might be interested in using an old commit and getting maximum performance with YASM. All other users, on the other hand, have gained something.

[1] https://github.com/openwall/john-packages/wiki/Oracle-O5LOGON-notes#using-cloud-servers

@solardiz
Copy link
Member

solardiz commented Dec 21, 2024

With the loadu/storeu intrinsics, the keepass speeds on "bull" are now better than ever, and reverting the loop unrolling hurts them. So the unrolling is a good thing, after all.

Warning: OpenMP is disabled; a non-OpenMP build may be faster
Benchmarking: KeePass [AES/Argon2 128/128 XOP]... DONE
Speed for cost 1 (t (rounds)) of 24569, cost 2 (m) of 0, cost 3 (p) of 0, cost 4 (KDF [0=Argon2d 2=Argon2id 3=AES]) of 3
Raw:    892 c/s real, 892 c/s virtual

Will run 8 OpenMP threads
Benchmarking: KeePass [AES/Argon2 128/128 XOP]... (8xOMP) DONE
Speed for cost 1 (t (rounds)) of 24569, cost 2 (m) of 0, cost 3 (p) of 0, cost 4 (KDF [0=Argon2d 2=Argon2id 3=AES]) of 3
Raw:    5504 c/s real, 687 c/s virtual

With unrolling reverted:

Warning: OpenMP is disabled; a non-OpenMP build may be faster
Benchmarking: KeePass [AES/Argon2 128/128 XOP]... DONE
Speed for cost 1 (t (rounds)) of 24569, cost 2 (m) of 0, cost 3 (p) of 0, cost 4 (KDF [0=Argon2d 2=Argon2id 3=AES]) of 3
Raw:    680 c/s real, 680 c/s virtual

Will run 8 OpenMP threads
Benchmarking: KeePass [AES/Argon2 128/128 XOP]... (8xOMP) DONE
Speed for cost 1 (t (rounds)) of 24569, cost 2 (m) of 0, cost 3 (p) of 0, cost 4 (KDF [0=Argon2d 2=Argon2id 3=AES]) of 3
Raw:    4220 c/s real, 527 c/s virtual

o5logon still has the known performance regression compared to yasm, and I don't know exactly why we have it - the key setup code doesn't look that bad at first glance, but then I didn't compare it against yasm's closely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix and enhance our AES-NI support
3 participants