Skip to content

System‐specific tuning and issues

Mikael Simberg edited this page Mar 12, 2024 · 2 revisions

LUMI

xpmem

Building DLA-Future on LUMI may end up linking to xpmem (indirectly, e.g. through hwloc). This can have a large detrimental impact on DLA-Future performance (up to 50% slower). Cray MPICH will default to using xpmem for on-node messages if it's linked into an application. To explicitly opt out of using xpmem set the environment variable MPICH_SMP_SINGLE_COPY_MODE=CMA.

MPI GPU support

If using DLAF_WITH_MPI_GPU_SUPPORT=ON on LUMI, the environment variable MPICH_GPU_SUPPORT_ENABLED=1 must be set. In addition, if the application is built without Cray's compiler wrappers you must ensure that the application links against libmpi_gtl_hsa.so. If this isn't done during link-time, you may preload the library with the environment variable LD_PRELOAD=/opt/cray/pe/lib64/libmpi_gtl_hsa.so.0. This assumes the use of the system installation of HIP and MPICH.

clariden

MPI GPU support

When using stackinator to build a HIP environment, the required HIP libraries are loaded dynamically from the environment by Cray MPICH. Unless HIP paths are added explicitly to LD_LIBRARY_PATH, GPU-aware MPI is likely to hang, in particular when using multiple nodes. Intra-node communication may work without setting the path. The path that should be added to LD_LIBRARY_PATH is the library directory of the hsa-rocr-dev package, e.g. with export LD_LIBRARY_PATH=$(spack location -i hsa-rocr-dev)/lib:$LD_LIBRARY_PATH. Some versions of HIP place the libraries under lib64.

Memory access fault by GPU node-N

When using GPU-aware MPI communication may fail inside a GPU kernel with Memory access fault by GPU node-N. According to HPE this is likely a bug in MPICH and the chances of this failure happening can be reduced by increasing the initial size of the Umpire memory pools in DLA-Future, e.g. to 16 GiB (the default is 1 GiB). This can be done with

export DLAF_UMPIRE_HOST_MEMORY_POOL_INITIAL_BYTES=$((1 << 34))
export DLAF_UMPIRE_DEVICE_MEMORY_POOL_INITIAL_BYTES=$((1 << 34))

Hangs during shutdown

Setting export FI_MR_CACHE_MAX_COUNT=0 may avoid hangs during shutdown.

Bad performance with HIP versions newer than 5.2.3

Not setting export FI_MR_CACHE_MAX_COUNT=0 may significantly degrade performance compared to other HIP versions. Setting it restores performance to similar values as with 5.2.3.

Hangs during algorithms

MPICH may deadlock on larger input matrices either without warning or explicitly with the following warning:

PE 31: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...

In that case setting export FI_CXI_RDZV_THRESHOLD=131072 or higher may help avoid hangs.