ARMv8 Crypto Extensions

Basics

SoC vendors who license ARMv8 cores (usually 64-bit capable) can decide between certain optional features: for example cryptographic acceleration called 'ARMv8 Cryptography Extensions'.

Usually SoC vendors do, the only known exceptions are early Cortex-A53 SoCs like Qualcomm's Snapdragon 410, Amlogic's very first 64-bit SoC S905 (used only on ODROID-C2 and NanoPi K2) and BroadCom's SoCs powering all 64-bit capable Raspberry Pis: all lack any crypto acceleration and perform way lower than all other 64-bit ARM SoCs in this area.

If the kernel has been built correctly, availability of accelerated cryptography functions can be checked by querying /proc/cpuinfo: The 'Features' entry will additionally show aes pmull sha1 sha2.

sbc-bench's use of OpenSSL

sbc-bench is using OpenSSL's internal AES benchmark as a detection for crypto acceleration testing single-threaded through AES-128, AES-192 and AES-256. For the latter a benchmark run looks like this:

openssl speed -elapsed -evp aes-256-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 63579690 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 34729604 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 11848770 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 3221240 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 419117 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 209578 aes-256-cbc's in 3.00s
...
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc     339091.68k   740898.22k  1011095.04k  1099516.59k  1144468.82k  1144575.32k

The results are '1000s of bytes per second processed' and we'll focus from now on only on most right column since not affected by initialization overhead (16K chunk size with a 1144575.32k score in the example above).

ARMv8 Crypto Extensions are not a classic 'crypto engine' running at a fixed clock (like Marvell's CESA for example) but scale linearly with clockspeed. Also with the openssl benchmark it doesn't matter how DRAM configuration/performance looks like since the whole benchmark runs inside CPU caches and while OpenSSL uses userspace crypto the scores are identical regardless whether userland is armhf or arm64 (see Samsung/Nexell S5P6818 numbers below). Distro used as well as OpenSSL version also don't seem to matter.

Scores predictable based on CPU core and clockspeed

It all boils down to type of ARM core and CPU clockspeed since the ratio between openssl score and CPU clockspeed is fixed in the following way (using sbc-bench result collection as base which unfortunately misses all more modern ARM cores than A73 and A76):

Cortex-A35: ~217, an A35 running at 1000 MHz will produce an ~217000k aes-256-cbc score (or ~434000k at 2000 MHz)
Cortex-A57: ~359, an A57 running at 1000 MHz will produce an ~359000k aes-256-cbc score (or ~718000k at 2000 MHz)
Cortex-A53/A55: ~467, A53/A55 running at 1000 MHz will produce an ~467000k aes-256-cbc score (or ~935000k at 2000 MHz)
Cortex-A72/A73/A76: ~570, A72/A73/A76 running at 1000 MHz will produce an ~570000k aes-256-cbc score (or ~1140000k at 2000 MHz)

Amazon's Graviton/Graviton2 ARM CPUs score identical to A72/A73/A76 and the custom FTC663 core inside the Feiteng D2000 CPU performs identical to an A57. NVidia's Carmel core performs marginally better than Cortex-A57 (~374, the Jetson Xavier NX numbers below). Qualcomm's Kryo 4XX Silver cores are based on A55 and perform exactly the same here.

Implications

Encryption/decryption performance with real-world tasks is an entirely different thing than looking at these results from a synthetic benchmark that runs completly inside the CPU cores/caches. Real performance with real use cases might look really different (e.g. full disk encryption or performance as a VPN gateway).

The openssl speed -elapsed -evp aes-256-cbc test is still more of a check whether crypto acceleration is available than a benchmark for real-world crypto performance. But if and only if ARMv8 Crypto Extensions have been licensed by an ARM SoC vendor simple conclusions can be drawn since there exists a fixed correlation between core type, clockspeed and aes-256-cbc score. So if we know that a new SoC features e.g. A55 cores, cheats with reported clockspeeds and we're not able to measure clockspeeds then we can use the openssl benchmark to guess real CPU clockspeeds. Vice versa should work too but it's better to look up the CPU ID instead.

All of this only applies to ARM SoCs with ARMv8 Crypto Extensions licensed. Since otherwise scores thrown out by openssl depend heavily on compiler version/settings and even different code paths. Check out ODROID-C2 and RPi 4 'AES-256 (16 KB)' scores in official results list: with C2 'modern OS' outperforms higher CPU clock and with RPi 4 comparing armhf userland (32-bit) and arm64 (64-bit) is even more telling since openssl reports less than 50% of 'AES performance' when running 64-bit compared to 32-bit since different code paths: generic C with 64-bit vs. optimized assembler routines with 32-bit.

Numbers the aforementioned conclusions are based on

Crawling through sbc-bench results collection comparing +30 different SoCs/CPUs from various vendors at various clockspeeds using OpenSSL versions 1.1.0f (25 May 2017) through 3.0.2 (15 Mar 2022) shows always the same relation between openssl score and clockspeed for those four core families (right column is OpenSSL's aes-256-cbc score divided through clockspeed in MHz):

ARM core	MHz	aes-256-cbc	score/mhz
Cortex-A35
RK3308	1300	282290	217
Apple Firestorm
M1 Pro	3030	1064110	351
Cortex-A57
Jetson Nano	1430	513700	359
Nintendo Switch	1780	642670	361
Jetson Nano	2000	717500	358
Nintendo Switch	2090	746680	357
FTC663
Phytium D2000	2300	828520	360
Carmel
Jetson Xavier NX	1890	706280	374
Apple Icestorm
M1 Pro	2060	784430	381
Cortex-A53
Armada 3700LP	790	368330	466
S912	1000	466780	466
Allwinner A64	1050	491590	468
RK3328	1290	601200	466
Allwinner H5	1370	637980	465
RK3328	1380	644200	467
S5P6818 (64-bit)	1400	653770	466
S5P6818 (32-bit)	1400	651000	465
RTD1395	1400	651460	465
S905X	1410	659460	467
S912	1420	659603	464
i.MX8M Quad	1500	695540	463
RK3399	1510	695265	460
S905Y2	1800	838360	465
i.MX8M Quad	1800	839321	466
RK3399	1800	839360	466
Allwinner H6	1800	839870	466
A311D	2010	940425	467
A311D2	2010	941040	468
Cortex-A55
RK3588	915	427750	467
RK3588s	1780	830640	467
QRB5165	1780	831950	467
RK3566	1800	845490	469
RK3588s	1815	846760	467
S905X3	1908	890730	466
RK3568	1930	898610	465
RK3568	1950	911730	467
S905X3	2010	941590	468
S905X3	2100	981940	467
Cortex-A72
RK3399	1800	1023600	568
LX2160A	1900	1079480	568
RK3399	2010	1144950	569
RK3399	2088	1184306	567
LX2160A	2200	1251710	569
Amazon a1.xlarge	2300	1297960	564
Cortex-A73
S922X	1800	1024680	569
S922X	1900	1085350	571
A311D2	2200	1252070	569
A311D	2400	1365900	569
Neoverse-N1
Amazon m6g.8xlarge	2500	1424770	570
Cortex-A76
RK3588	985	560200	569
RK3588s	2330	1325370	569
Cortex-A77
QRB5165	2415	1345230	557
QRB5165	2830	1581487	559

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARMv8-Crypto-Extensions.md

ARMv8-Crypto-Extensions.md

ARMv8 Crypto Extensions

Basics

sbc-bench's use of OpenSSL

Scores predictable based on CPU core and clockspeed

Implications

Numbers the aforementioned conclusions are based on

Files

ARMv8-Crypto-Extensions.md

Latest commit

History

ARMv8-Crypto-Extensions.md

File metadata and controls

ARMv8 Crypto Extensions

Basics

sbc-bench's use of OpenSSL

Scores predictable based on CPU core and clockspeed

Implications

Numbers the aforementioned conclusions are based on