Graviton Performance Runbook toplevel
This section describes multiple different optimization suggestions to try on Graviton based instances to attain higher performance for your service. Each sub-section defines some optimization recommendations that can help improve performance if you see a particular signature after measuring the performance using the previous checklists.
- On C/C++ applications,
-flto
,-Os
, and Feedback Directed Optimization can help with code layout using GCC. - On Java,
-XX:-TieredCompilation
,-XX:ReservedCodeCacheSize
and-XX:InitialCodeCacheSize
can be tuned to reduce the pressure the JIT places on the instruction footprint. The JDK defaults to setting up a 256MB region by default for the code-cache which over time can fill, become fragmented, and live code may become sparse.- We recommend setting the code cache initially to:
-XX:-TieredCompilation -XX:ReservedCodeCacheSize=64M -XX:InitialCodeCacheSize=64M
and then tuning the size up or down as required. - Experiment with setting
-XX:+TieredCompilation
to gain faster start-up time and better optimized code. - When tuning the code JVM code cache, watch for
code cache full
error messages in the logs indicating that the cache has been set too small. A full code cache can lead to worse performance.
- We recommend setting the code cache initially to:
A TLB (translation lookaside buffer) is a cache that holds recent virtual address to physical address translations for the CPU to use. Making sure this cache never misses can improve application performance.
- Enable Transparent Huge Pages (THP)
echo always > /sys/kernel/mm/transparent_hugepage/enabled
-or-echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
- On Linux kernels >=6.9 Transparent Huge Pages (THP) has been extended with Folios that create 16kB, and 64kB huge pages in addition to 2MB pages. This allows the Linux kernel to use huge pages in more places to increase performance by reducing TLB pressure. All folio sizes can be set using
inherit
to use the setting of the top-level THP setting, or set independently to select the sizes to use. Can also set each folio usingnever
,always
andmadvise
.- To use 16kB pages:
echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-16kB
- To use 64kB pages:
echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-64kB
- To use 2MB pages:
echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB
- To use 16kB pages:
- If your application can use pinned hugepages because it uses mmap directly, try reserving huge pages directly via the OS. This can be done by two methods.
- At runtime:
sysctl -w vm.nr_hugepages=X
- At boot time by specifying on the kernel command line in
/etc/default/grub
:hugepagesz=2M hugepages=512
- At runtime:
- For Java, hugepages can be used for both the code-heap and data-heap by adding the below flags to your JVM command line
-XX:+UseTransparentHugePages
when THP is set to at leastmadvise
-XX:+UseLargePages
if you have pre-allocated huge pages throughsysctl
or the kernel command line.
Using huge-pages should generally improve performance on all EC2 instance types, but there can be cases where using exclusively huge-pages may lead to performance degradation. Therefore, it is always recommended to fully test your application after enabling and/or allocating huge-pages.
- If you need to port an optimized routine that uses x86 vector instruction instrinsics to Graviton’s vector instructions (called NEON instructions), you can use the SSE2NEON library to assist in the porting. While SSE2NEON won’t produce optimal code, it generally gets close enough to reduce the performance penalty of not using the vector intrinsics.
- For additional information on the vector instructions used on Graviton
- Look for specialized back-off routines for custom locks tuned using x86
PAUSE
or the equivalent x86rep; nop
sequence. Graviton2 should use a singleISB
instruction as a drop in replacement, for an example and explanation see recent commit to the Wired Tiger storage layer. - If a locking routine tries to acquire a lock in a fast path before forcing the thread to sleep via the OS to wait, try experimenting with modifying the fast path to attempt the fast path a few additional times before executing down the slow path. An example of this from the Finagle code-base where on Graviton2 we will spin longer for a lock before sleeping.
- If you do not intend to run your application on Graviton1, try compiling your code on GCC using
-march=armv8.2-a
instead of using-moutline-atomics
to reduce overhead of using synchronization builtins.
- Check ENA device tunings with
ethtool -c ethN
whereN
is the device number and checkAdaptive RX
setting. By default on instances without extra ENI’s this will beeth0
.- Set to
ethtool -C ethN adpative-rx off
for a latency sensitive workload - ENA tunings via
ethtool
can be made permanent by editing/etc/sysconfig/network-scripts/ifcfg-ethN
files.
- Set to
- Disable
irqbalance
from dynamically moving IRQ processing between vCPUs and set dedicated cores to process each IRQ. Example script below:
# Assign eth0 ENA interrupts to the first N-1 cores
systemctl stop irqbalance
irqs=$(grep "eth0-Tx-Rx" /proc/interrupts | awk -F':' '{print $1}')
cpu=0
for i in $irqs; do
echo $cpu > /proc/irq/$i/smp_affinity_list
let cpu=${cpu}+1
done
- Disable Receive Packet Steering (RPS) to avoid contention and extra IPIs.
cat /sys/class/net/ethN/queues/rx-N/rps_cpus
and verify they are set to0
. In general RPS is not needed on Graviton2 and newer.- You can try using RPS if your situation is unique. Read the documentation on RPS to understand further how it might help. Also refer to Optimizing network intensive workloads on Amazon EC2 A1 Instances for concrete examples.
- If on Graviton2 and newer metal instances, try disabling the System MMU (Memory Management Unit) to speed up IO handling:
%> cd ~/aws-gravition-getting-started/perfrunbook/utilities
# Configure the SMMU to be off on metal, which is the default on x86.
# Leave the SMMU on if you require the additional security protections it offers.
# Virtualized instances do not expose an SMMU to instances.
%> sudo ./configure_graviton_metal_iommu.sh off
%> sudo shutdown now -r