Skip to content
Benjamin Zaitlen edited this page Dec 16, 2021 · 4 revisions

Welcome to the ucx-py wiki!

Env Vars

DEBUG

UCX_PY_LOG_LEVEL=DEBUG # TRACE UCX_LOG_LEVEL=DEBUG # TRACE

UCX_MEMTYPE_CACHE

UCX Memory optimization known issues. UCX-PY regularly sets this to n -- toggles whether UCX library intercepts cualloc calls.

UCX_MEMTYPE_CACHE=n

UCX_RNDV_SCHEME

UCX_RNDV_SCHEME=put_zcopy

UCX_TLS (similified):

  • rc = ibv_post_send, ibv_post_recv, ibv_poll_cq
  • cuda_copy = cuMemHostRegister, cuMemcpyAsync
  • cuda_ipc = cuIpcCloseMemHandle , cuIpcOpenMemHandle, cuMemcpyAsync
  • sockcm = connection management over sockets
  • tcp = communication over TCP

Example Usage

IB -- Yes NVLINK

UCX_RNDV_SCHEME=put_zcopy UCX_MEMTYPE_CACHE=n UCX_TLS=rc,cuda_copy,cuda_ipc

TLS/Socket -- No NVLINK

UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,cuda_copy,sockcm UCX_SOCKADDR_TLS_PRIORITY=sockcm <SCRIPT>

TLS/Socket -- Yes NVLINK

UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm UCX_SOCKADDR_TLS_PRIORITY=sockcm <SCRIPT>

Benchmarking

Benchmark send receive on one machine (UCX < 1.10):
UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm python \
   send-recv-core.py --server-dev 2 --client-dev 1 \
   --object_type rmm --reuse-alloc --n-bytes 1GB
Benchmark send receive on one machine (UCX >= 1.10):
UCX_TLS=tcp,cuda_copy,cuda_ipc python send-recv-core.py \
       --server-dev 2 --client-dev 1 --object_type rmm \
       --reuse-alloc --n-bytes 1GB
Benchmark send receive on two machines (IB testing, UCX < 1.10):
# server process
UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=tcp,sockcm,cuda_copy,rc \
   UCX_SOCKADDR_TLS_PRIORITY=sockcm python send-recv-core.py \
   --server-dev 0 --client-dev 5 --object_type rmm --reuse-alloc \
   --n-bytes 1GB --server-only --port 13337 --n-iter 100
# client process
UCX_NET_DEVICES=mlx5_2:1 UCX_TLS=tcp,sockcm,cuda_copy,rc \
   UCX_SOCKADDR_TLS_PRIORITY=sockcm python send-recv-core.py \
   --server-dev 0 --client-dev 5 --object_type rmm --reuse-alloc \
   --n-bytes 1GB --client-only --server-address SERVER_IP --port 13337 \
   --n-iter 100
Benchmark send receive on two machines (IB testing, UCX >= 1.10):
# server process
UCX_MAX_RNDV_RAILS=1 UCX_TLS=tcp,cuda_copy,rc python send-recv-core.py \
       --server-dev 0 --client-dev 5 --object_type rmm --reuse-alloc \
       --n-bytes 1GB --server-only --port 13337 --n-iter 100
# client process
UCX_MAX_RNDV_RAILS=1 UCX_TLS=tcp,cuda_copy,rc python send-recv-core.py \
       --server-dev 0 --client-dev 5 --object_type rmm --reuse-alloc \
       --n-bytes 1GB --client-only --server-address SERVER_IP --port 13337 \
       --n-iter 100