Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/MLX5: Fix segmentation fault in a2a mpi test #996

Merged
merged 1 commit into from
Aug 20, 2024

Conversation

x41lakazam
Copy link
Collaborator

@x41lakazam x41lakazam commented Jul 3, 2024

Related bug: https://redmine.mellanox.com/issues/3706049

What

Set rcache alignment back from ucc_get_page_size() to UCS_PGT_ADDR_ALIGN
Re-activate tl/mlx5 alltoall

Why ?

This bug reproduces only when using ucx anterior to openucx/ucx@85d2d9d0f, which introduced dynamic rcache alignment.
#877 (specifically b13b87d) sets the alignment to ucc_get_page_size() whereas it was UCS_PGT_ADDR_ALIGN before.
Setting the alignment back to UCS_PGT_ADDR_ALIGN solves the bug.

The reason is yet to be found out.

Performance tests

Below is a comparison of the performances with tl/ucp and hcoll

TL/UCP:

$ mpirun \
--mca coll_ucc_cts alltoall \
--mca coll_ucc_enable 1 \
--mca coll_hcoll_enable 0 \
-x UCC_COLL_TRACE=info \
-x LD_PRELOAD=$PWD/install/lib/libucc.so \
-x UCX_NET_DEVICES=mlx5_2:1 \
-x UCC_TL_UCP_TUNE=inf \
-np 736 \
--map-by ppr:32:node \
$HPCX_OSU_DIR/osu_alltoall -m 1:128 -f

# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 8758.46 7628.46 9876.10 1000
2 8769.55 7674.58 9880.20 1000
4 8789.50 7658.54 9896.35 1000
8 8781.75 7664.65 9883.95 1000
16 8799.08 7647.83 9899.20 1000
32 8921.46 7744.85 10028.75 1000
64 8585.21 7491.80 9682.50 1000
128 8458.66 7740.09 9893.40 1000
 
 
TL/MLX5:

$ mpirun \
--mca coll_ucc_cts alltoall \
--mca coll_ucc_enable 1 \
--mca coll_hcoll_enable 0 \
-x UCC_COLL_TRACE=info \
-x LD_PRELOAD=$PWD/install/lib/libucc.so \
-x UCC_TL_MLXS_NET_DEVICES=mlx5_0:1 \
-x UCX_NET_DEVICES=mlx5_2:1 \
-x UCC_TL_MLX5_TUNE=inf \
-x UCC_TL_SHARP_TUNE=0 \
-x UCC_RC_MLX5_DM_COUNT=0 \
-x UCC_DC_MLX5_DM_COUNT=0 \
-np 736 \
--map-by ppr:32:node \
$HPCX_OSU_DIR/osu_alltoall -m 1:128 -f
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 95.85 60.23 114.01 1000
2 95.76 60.02 114.46 1000
4 100.96 68.69 119.43 1000
8 112.16 76.55 130.74 1000
16 178.82 132.88 210.77 1000
32 201.71 156.84 233.80 1000
64 336.13 270.47 431.30 1000
128 643.67 505.20 762.88 1000
 
HCOLL:

$ mpirun \
  --mca coll_ucc_enable 0 \
  --mca coll_hcoll_enable 1 \
  -x UCC_COLL_TRACE=info \
  -np $np \
  --map-by ppr:$ppn:node \
  $HPCX_OSU_DIR/osu_alltoall -m 1:128 -f


# OSU MPI All-to-All Personalized Exchange Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
     1                     112.15             73.49            148.97        1000
     2                     118.68             75.98            167.90        1000
     4                     139.34             79.86            189.09        1000
     8                     301.91            266.75            329.19        1000
     16                    119.70             88.27            145.37        1000
     32                    235.14            183.34            277.67        1000
     64                    379.98            301.66            454.02        1000
     128                   578.52            438.08            713.28        1000

@x41lakazam x41lakazam changed the title Fix bug #3706049: tl/mlx5 alltoall segmentation fault in ucc_test_mpi TL/MLX5 Fix bug #3706049: alltoall segmentation fault in ucc_test_mpi Jul 3, 2024
@x41lakazam x41lakazam marked this pull request as ready for review July 10, 2024 08:07
@x41lakazam x41lakazam marked this pull request as draft July 10, 2024 08:10
@x41lakazam x41lakazam marked this pull request as ready for review July 10, 2024 08:12
@x41lakazam x41lakazam changed the title TL/MLX5 Fix bug #3706049: alltoall segmentation fault in ucc_test_mpi TL/MLX5: Fix segmentation fault in a2a mpi test Jul 11, 2024
@samnordmann samnordmann self-requested a review July 31, 2024 08:37
@x41lakazam x41lakazam force-pushed the fix_mlx5_alltoall branch 3 times, most recently from 8e804ec to 894c71a Compare August 19, 2024 08:54
@Sergei-Lebedev Sergei-Lebedev merged commit 777df69 into openucx:master Aug 20, 2024
9 of 11 checks passed
MamziB pushed a commit to MamziB/ucc-forked that referenced this pull request Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants