SoC vendors who license ARMv8 cores (usually 64-bit capable) can decide between certain optional features: for example cryptographic acceleration called 'ARMv8 Cryptography Extensions'.
Usually SoC vendors do, the only known exceptions are early Cortex-A53 SoCs like Qualcomm's Snapdragon 410, Amlogic's very first 64-bit SoC S905 (used only on ODROID-C2 and NanoPi K2) and BroadCom's SoCs powering all 64-bit capable Raspberry Pis: all lack any crypto acceleration and perform way lower than all other 64-bit ARM SoCs in this area.
If the kernel has been built correctly, availability of accelerated cryptography functions can be checked by querying /proc/cpuinfo
: The 'Features' entry will additionally show aes pmull sha1 sha2
.
sbc-bench
is using OpenSSL's internal AES benchmark as a detection for crypto acceleration testing single-threaded through AES-128, AES-192 and AES-256. For the latter a benchmark run looks like this:
openssl speed -elapsed -evp aes-256-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 63579690 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 34729604 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 11848770 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 3221240 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 419117 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 209578 aes-256-cbc's in 3.00s
...
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 339091.68k 740898.22k 1011095.04k 1099516.59k 1144468.82k 1144575.32k
The results are '1000s of bytes per second processed' and we'll focus from now on only on most right column since not affected by initialization overhead (16K chunk size with a 1144575.32k
score in the example above).
ARMv8 Crypto Extensions are not a classic 'crypto engine' running at a fixed clock (like Marvell's CESA for example) but scale linearly with clockspeed. Also with the openssl benchmark it doesn't matter how DRAM configuration/performance looks like since the whole benchmark runs inside CPU caches and while OpenSSL uses userspace crypto the scores are identical regardless whether userland is armhf
or arm64
(see Samsung/Nexell S5P6818 numbers below). Distro used as well as OpenSSL version also don't seem to matter.
It all boils down to type of ARM core and CPU clockspeed since the ratio between openssl score and CPU clockspeed is fixed in the following way (using sbc-bench result collection as base which unfortunately misses all more modern ARM cores than A73 and A76):
- Cortex-A35: ~217, an A35 running at 1000 MHz will produce an ~217000k aes-256-cbc score (or ~434000k at 2000 MHz)
- Cortex-A57: ~359, an A57 running at 1000 MHz will produce an ~359000k aes-256-cbc score (or ~718000k at 2000 MHz)
- Cortex-A53/A55: ~467, A53/A55 running at 1000 MHz will produce an ~467000k aes-256-cbc score (or ~935000k at 2000 MHz)
- Cortex-A72/A73/A76: ~570, A72/A73/A76 running at 1000 MHz will produce an ~570000k aes-256-cbc score (or ~1140000k at 2000 MHz)
Amazon's Graviton/Graviton2 ARM CPUs score identical to A72/A73/A76 and the custom FTC663 core inside the Feiteng D2000 CPU performs identical to an A57. NVidia's Carmel core performs marginally better than Cortex-A57 (~374, the Jetson Xavier NX numbers below). Qualcomm's Kryo 4XX Silver cores are based on A55 and perform exactly the same here.
Encryption/decryption performance with real-world tasks is an entirely different thing than looking at these results from a synthetic benchmark that runs completly inside the CPU cores/caches. Real performance with real use cases might look really different (e.g. full disk encryption or performance as a VPN gateway).
The openssl speed -elapsed -evp aes-256-cbc
test is still more of a check whether crypto acceleration is available than a benchmark for real-world crypto performance. But if and only if ARMv8 Crypto Extensions have been licensed by an ARM SoC vendor simple conclusions can be drawn since there exists a fixed correlation between core type, clockspeed and aes-256-cbc
score. So if we know that a new SoC features e.g. A55 cores, cheats with reported clockspeeds and we're not able to measure clockspeeds then we can use the openssl benchmark to guess real CPU clockspeeds. Vice versa should work too but it's better to look up the CPU ID instead.
All of this only applies to ARM SoCs with ARMv8 Crypto Extensions licensed. Since otherwise scores thrown out by openssl
depend heavily on compiler version/settings and even different code paths. Check out ODROID-C2 and RPi 4 'AES-256 (16 KB)' scores in official results list: with C2 'modern OS' outperforms higher CPU clock and with RPi 4 comparing armhf userland (32-bit) and arm64 (64-bit) is even more telling since openssl
reports less than 50% of 'AES performance' when running 64-bit compared to 32-bit since different code paths: generic C with 64-bit vs. optimized assembler routines with 32-bit.
Crawling through sbc-bench results collection comparing +30 different SoCs/CPUs from various vendors at various clockspeeds using OpenSSL versions 1.1.0f (25 May 2017) through 3.0.2 (15 Mar 2022) shows always the same relation between openssl score and clockspeed for those four core families (right column is OpenSSL's aes-256-cbc score divided through clockspeed in MHz):
ARM core | MHz | aes-256-cbc | score/mhz |
---|---|---|---|
Cortex-A35 | |||
RK3308 | 1300 | 282290 | 217 |
Apple Firestorm | |||
M1 Pro | 3030 | 1064110 | 351 |
Cortex-A57 | |||
Jetson Nano | 1430 | 513700 | 359 |
Nintendo Switch | 1780 | 642670 | 361 |
Jetson Nano | 2000 | 717500 | 358 |
Nintendo Switch | 2090 | 746680 | 357 |
FTC663 | |||
Phytium D2000 | 2300 | 828520 | 360 |
Carmel | |||
Jetson Xavier NX | 1890 | 706280 | 374 |
Apple Icestorm | |||
M1 Pro | 2060 | 784430 | 381 |
Cortex-A53 | |||
Armada 3700LP | 790 | 368330 | 466 |
S912 | 1000 | 466780 | 466 |
Allwinner A64 | 1050 | 491590 | 468 |
RK3328 | 1290 | 601200 | 466 |
Allwinner H5 | 1370 | 637980 | 465 |
RK3328 | 1380 | 644200 | 467 |
S5P6818 (64-bit) | 1400 | 653770 | 466 |
S5P6818 (32-bit) | 1400 | 651000 | 465 |
RTD1395 | 1400 | 651460 | 465 |
S905X | 1410 | 659460 | 467 |
S912 | 1420 | 659603 | 464 |
i.MX8M Quad | 1500 | 695540 | 463 |
RK3399 | 1510 | 695265 | 460 |
S905Y2 | 1800 | 838360 | 465 |
i.MX8M Quad | 1800 | 839321 | 466 |
RK3399 | 1800 | 839360 | 466 |
Allwinner H6 | 1800 | 839870 | 466 |
A311D | 2010 | 940425 | 467 |
A311D2 | 2010 | 941040 | 468 |
Cortex-A55 | |||
RK3588 | 915 | 427750 | 467 |
RK3588s | 1780 | 830640 | 467 |
QRB5165 | 1780 | 831950 | 467 |
RK3566 | 1800 | 845490 | 469 |
RK3588s | 1815 | 846760 | 467 |
S905X3 | 1908 | 890730 | 466 |
RK3568 | 1930 | 898610 | 465 |
RK3568 | 1950 | 911730 | 467 |
S905X3 | 2010 | 941590 | 468 |
S905X3 | 2100 | 981940 | 467 |
Cortex-A72 | |||
RK3399 | 1800 | 1023600 | 568 |
LX2160A | 1900 | 1079480 | 568 |
RK3399 | 2010 | 1144950 | 569 |
RK3399 | 2088 | 1184306 | 567 |
LX2160A | 2200 | 1251710 | 569 |
Amazon a1.xlarge | 2300 | 1297960 | 564 |
Cortex-A73 | |||
S922X | 1800 | 1024680 | 569 |
S922X | 1900 | 1085350 | 571 |
A311D2 | 2200 | 1252070 | 569 |
A311D | 2400 | 1365900 | 569 |
Neoverse-N1 | |||
Amazon m6g.8xlarge | 2500 | 1424770 | 570 |
Cortex-A76 | |||
RK3588 | 985 | 560200 | 569 |
RK3588s | 2330 | 1325370 | 569 |
Cortex-A77 | |||
QRB5165 | 2415 | 1345230 | 557 |
QRB5165 | 2830 | 1581487 | 559 |