Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in TL/CUDA for triggered reduce_scatterv #638

Open
samnordmann opened this issue Sep 29, 2022 · 1 comment
Open

Hang in TL/CUDA for triggered reduce_scatterv #638

samnordmann opened this issue Sep 29, 2022 · 1 comment

Comments

@samnordmann
Copy link
Collaborator

samnordmann commented Sep 29, 2022

Building from upstream/master, the following mpitest command leads to a hang on dgx node:

mpirun -x UCC_TL_CUDA_TUNE=inf -x UCC_TL_SHARP_TUNE=0 --mca coll ^hcoll -np 8 /.autodirect/mtrsysgwork/snordmann/ucc/build/test/mpi/ucc_test_mpi -d float32 -M cuda -v --triggered 1 -o sum -t world -r single:0 -c reduce_scatterv -m 512

here is a part of the printout with UCC_LOG_LEVEL=info:

===== UCC MPI TEST INFO =======
seed : 8205

[1667375117.094710] [swx-dgx02:35853:0] tl_cuda_team.c:324 TL_CUDA INFO initialized tl team: 0xc58f540
[1667375117.094904] [swx-dgx02:35853:0] tl_ucp_team.c:35 TL_UCP INFO posted tl team: 0xb358710
[1667375117.094909] [swx-dgx02:35853:0] tl_ucp_team.c:124 TL_UCP INFO initialized tl team: 0xb358710
[1667375117.094912] [swx-dgx02:35853:0] cl_basic_team.c:122 CL_BASIC INFO initialized tl cuda team
[1667375117.094915] [swx-dgx02:35853:0] cl_basic_team.c:126 CL_BASIC INFO failed to create tl self team: (-1)
[1667375117.094918] [swx-dgx02:35853:0] cl_basic_team.c:122 CL_BASIC INFO initialized tl shm team
[1667375117.094920] [swx-dgx02:35853:0] cl_basic_team.c:122 CL_BASIC INFO initialized tl ucp team
[1667375117.101207] [swx-dgx02:35903:0] ucc_ee.c:42 UCC INFO ee is created: 0xaa83fb0 ee_context: 0xbc9e910
[1667375117.101301] [swx-dgx02:35903:0] ucc_ee.c:112 UCC INFO EE Event Set. ee:0xaa83fb0, queue:0xaa83fd0 ev_type:COLL_POST
[1667375117.101309] [swx-dgx02:35903:0] ucc_ee.c:76 UCC INFO EE Event Get. ee:0xaa83fb0, queue:0xaa83fd0 ev_type:COLL_POST
Triggered tc=Reduce_scatterv team=world msgsize=512 inplace=0 persistent=0 dt=float32 op=sum
[1667375117.101183] [swx-dgx02:35848:0] ucc_ee.c:42 UCC INFO ee is created: 0x95e6fb0 ee_context: 0xb8f9560
[1667375117.101275] [swx-dgx02:35848:0] ucc_ee.c:112 UCC INFO EE Event Set. ee:0x95e6fb0, queue:0x95e6fd0 ev_type:COLL_POST
[1667375117.101282] [swx-dgx02:35848:0] ucc_ee.c:76 UCC INFO EE Event Get. ee:0x95e6fb0, queue:0x95e6fd0 ev_type:COLL_POST
[1667375117.101053] [swx-dgx02:35853:0] ucc_ee.c:42 UCC INFO ee is created: 0xc4befb0 ee_context: 0xb643910
[1667375117.101140] [swx-dgx02:35853:0] ucc_ee.c:112 UCC INFO EE Event Set. ee:0xc4befb0, queue:0xc4befd0 ev_type:COLL_POST
[1667375117.101147] [swx-dgx02:35853:0] ucc_ee.c:76 UCC INFO EE Event Get. ee:0xc4befb0, queue:0xc4befd0 ev_type:COLL_POST
[1667375117.104166] [swx-dgx02:35849:0] ucc_ee.c:42 UCC INFO ee is created: 0xa66cfb0 ee_context: 0xb6c59b0
[1667375117.104261] [swx-dgx02:35849:0] ucc_ee.c:112 UCC INFO EE Event Set. ee:0xa66cfb0, queue:0xa66cfd0 ev_type:COLL_POST
[1667375117.104268] [swx-dgx02:35849:0] ucc_ee.c:76 UCC INFO EE Event Get. ee:0xa66cfb0, queue:0xa66cfd0 ev_type:COLL_POST
[1667375117.104300] [swx-dgx02:35859:0] ucc_ee.c:42 UCC INFO ee is created: 0xbee8fb0 ee_context: 0xaf77190
[1667375117.104376] [swx-dgx02:35859:0] ucc_ee.c:112 UCC INFO EE Event Set. ee:0xbee8fb0, queue:0xbee8fd0 ev_type:COLL_POST
[1667375117.104383] [swx-dgx02:35859:0] ucc_ee.c:76 UCC INFO EE Event Get. ee:0xbee8fb0, queue:0xbee8fd0 ev_type:COLL_POST
[1667375117.104341] [swx-dgx02:35867:0] ucc_ee.c:42 UCC INFO ee is created: 0x876cfb0 ee_context: 0xa4578f0
[1667375117.104410] [swx-dgx02:35867:0] ucc_ee.c:112 UCC INFO EE Event Set. ee:0x876cfb0, queue:0x876cfd0 ev_type:COLL_POST
[1667375117.104417] [swx-dgx02:35867:0] ucc_ee.c:76 UCC INFO EE Event Get. ee:0x876cfb0, queue:0x876cfd0 ev_type:COLL_POST
[1667375117.103829] [swx-dgx02:35879:0] ucc_ee.c:42 UCC INFO ee is created: 0xbc5dfb0 ee_context: 0xaff3190
[1667375117.103898] [swx-dgx02:35879:0] ucc_ee.c:112 UCC INFO EE Event Set. ee:0xbc5dfb0, queue:0xbc5dfd0 ev_type:COLL_POST
[1667375117.103905] [swx-dgx02:35879:0] ucc_ee.c:76 UCC INFO EE Event Get. ee:0xbc5dfb0, queue:0xbc5dfd0 ev_type:COLL_POST
[1667375117.104383] [swx-dgx02:35887:0] ucc_ee.c:42 UCC INFO ee is created: 0x8a7efb0 ee_context: 0xbb17fd0
[1667375117.104451] [swx-dgx02:35887:0] ucc_ee.c:112 UCC INFO EE Event Set. ee:0x8a7efb0, queue:0x8a7efd0 ev_type:COLL_POST
[1667375117.104458] [swx-dgx02:35887:0] ucc_ee.c:76 UCC INFO EE Event Get. ee:0x8a7efb0, queue:0x8a7efd0 ev_type:COLL_POST

Then, the program hangs.

@samnordmann samnordmann changed the title Hang in TL/CUDA for triggered reduce_scatterv Hang in TL/UCP EC/CUDA for triggered reduce_scatterv Oct 26, 2022
@samnordmann samnordmann changed the title Hang in TL/UCP EC/CUDA for triggered reduce_scatterv Hang in TL/CUDA for triggered reduce_scatterv Nov 2, 2022
@samnordmann
Copy link
Collaborator Author

Using TL/UCP (i.e. removing -x UCC_TL_CUDA_TUNE=inf in the previous command line) leads to the same bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant