Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed usage with TL/UCP and TL/MLX5:create tl_mlx5 ctx failed #1009

Closed
yanminglai opened this issue Aug 16, 2024 · 2 comments
Closed

Mixed usage with TL/UCP and TL/MLX5:create tl_mlx5 ctx failed #1009

yanminglai opened this issue Aug 16, 2024 · 2 comments

Comments

@yanminglai
Copy link

I am trying to mix with ucp and mlx5: use tl mlx5 for all2all and use tl ucp for all other collective operations.

how I configure ucc:
"${UCC_SRC_DIR}/configure" --with-ucx="${UCX_HOME}" \ --prefix="${UCC_INSTALL_DIR}" --with-mpi \ --with-ibverbs \ --with-rdmacm \ --with-tls=self,shm,ucp,mlx5 \
run command:
mpirun -x UCC_CLS=basic -x UCC_CL_BASIC_TLS=ucp,mlx5 -x UCC_TL_UCP_TUNE=alltoall:0 -x UCC_TL_MLX5_NET_DEVICES=mlx5_2:1 -np 4 ./ucc_test_mpi -c alltoall -o min

image

Then I also test use tl mlx5 only:
mpirun -x UCC_TLS=mlx5 -x UCC_TL_MLX5_NET_DEVICES=mlx5_2:1 -np 2 ./ucc_test_mpi -c alltoall

Also met the ctx create problem
image

here is my ib_dev and bw test
image

Two Questions:

  1. Is this the right way to mix tl usage? (by setting UCC_TL_UCP_TUNE=alltoall:0,it will 100% use tl mlx5 for all2all)
  2. how can I allocate the tl mlx5 ctx create problem?
@samnordmann
Copy link
Collaborator

samnordmann commented Aug 19, 2024

Hi @yanminglai
Thanks for this report.

  1. First of all, tl/mlx5/a2a has been temporarily disabled in the repo, but is re-enabled by this PR which is about to be merged: TL/MLX5: Fix segmentation fault in a2a mpi test #996
  2. in your command, when you try to use mlx5 only, please try to remove UCC_TLS=mlx5. It may seem counterintuitive, but the reason is that TL/MLX5 uses TL/UCP for service collectives.
  3. I was able to run tl/mlx5 successfully on upstream/master + TL/MLX5: Fix segmentation fault in a2a mpi test #996 by running the command line:
mpirun -x UCC_COLL_TRACE=info -x UCC_TL_MLX5_NET_DEVICES=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_1:1 -x UCC_TL_MLX5_TUNE=inf --mca coll_ucc_enable 0  --map-by ppr:2:node -np 4 test/mpi/ucc_test_mpi -c alltoall -t world -d uint8 -O 0 -v -m 1:128

Other remarks:

  • TL/MLX5/a2a only supports msgsize <= 128
  • TL/MLX5/a2a only supports setups with at least two nodes, and at least 2 processes per node
  • add --mca coll_ucc_enable 0 to your mpirun command. This prevents Open-MPI from initializing a second instance of TL/MLX5 which could preempt the entirety of the device memory.
  • For also preventing UCX from preempting device's memory, add -x UCX_NET_DEVICES=<another_device> set to another device than the one used for TL/MLX5, or, alternatively, add -x UCX_RC_MLX5_DM_COUNT=0 -x UCX_DC_MLX5_DM_COUNT=0

Hoping it will be useful. Let me know if you have further issues

@yanminglai
Copy link
Author

Thank you very much, it answers all my questions.
Gonna go ahead and close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants