Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in MPI_tests for triggered reduce operations #668

Open
samnordmann opened this issue Nov 2, 2022 · 0 comments
Open

Hang in MPI_tests for triggered reduce operations #668

samnordmann opened this issue Nov 2, 2022 · 0 comments

Comments

@samnordmann
Copy link
Collaborator

(maybe related to Issue #638)
The following command leads to a hang (probably a deadlock) on dgx machine (swx-dgx02 from hpchead):
mpirun -x UCC_TL_CUDA_TUNE=inf -x UCC_TL_SHARP_TUNE=0 --mca coll ^hcoll -np 8 /.autodirect/mtrsysgwork/snordmann/ucc/build/test/mpi/ucc_test_mpi -d float32 -M cuda -v --triggered 1 -o sum -t world -r single:0 -c allreduce,reduce
Using TL_UCC (i.e. Removing the flag UCC_TL_CUDA_TUNE=inf) leads to the same bug. However, leaving only "reduce"or "allreduce" in the command line make the bug disappears

Here are the different backtraces of the processes:

#0  uct_rc_mlx5_iface_poll_tx (poll_flags=2, iface=0x2de0030) at rc/accel/rc_mlx5_iface.c:153
#1  uct_rc_mlx5_iface_progress (flags=2, arg=0x2de0030) at rc/accel/rc_mlx5_iface.c:190
#2  uct_rc_mlx5_iface_progress_cyclic (arg=0x2de0030) at rc/accel/rc_mlx5_iface.c:195
#3  0x00007f6e3c16995a in ucs_callbackq_dispatch (cbq=<optimized out>) at /build-result/src/hpcx-gcc-redhat7/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucs/datastruct/callbackq.h:211
#4  uct_worker_progress (worker=<optimized out>) at /build-result/src/hpcx-gcc-redhat7/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/uct/api/uct.h:2768
#5  ucp_worker_progress (worker=0x2a2d440) at core/ucp_worker.c:2807
#6  0x00007f6e405767ac in opal_progress () at runtime/opal_progress.c:231
#7  0x00007f6e4180a933 in ompi_request_default_test (rptr=0x7ffc316acc70, completed=0x7ffc316acc7c, status=0x0) at request/req_test.c:88
#8  0x00007f6e418303e5 in PMPI_Test (request=0x7ffc316acc70, completed=0x7ffc316acc7c, status=<optimized out>) at ptest.c:65
#9  0x000000000043872c in TestReduce::check (this=0xc0f94e0) at ../../../test/mpi/test_reduce.cc:84
#10 0x0000000000406943 in UccTestMpi::exec_tests (this=0x335a410, tcs=..., triggered=true, persistent=false) at ../../../test/mpi/test_mpi.cc:499
#11 0x00000000004075ab in UccTestMpi::run_all_at_team (this=0x335a410, team=..., rst=...) at ../../../test/mpi/test_mpi.cc:613
#12 0x0000000000407c46 in UccTestMpi::run_all (this=0x335a410, is_onesided=false) at ../../../test/mpi/test_mpi.cc:664
#13 0x0000000000417c3a in main (argc=16, argv=0x7ffc316ad5a8) at ../../../test/mpi/main.cc:576


0  ucs_callbackq_dispatch (cbq=<optimized out>) at /build-result/src/hpcx-gcc-redhat7/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucs/datastruct/callbackq.h:211
#1  uct_worker_progress (worker=<optimized out>) at /build-result/src/hpcx-gcc-redhat7/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/uct/api/uct.h:2768
#2  ucp_worker_progress (worker=0x1de9420) at core/ucp_worker.c:2807
#3  0x00007f3a85ab97ac in opal_progress () at runtime/opal_progress.c:231
#4  0x00007f3a86d4d933 in ompi_request_default_test (rptr=0x7ffc66f4e290, completed=0x7ffc66f4e29c, status=0x0) at request/req_test.c:88
#5  0x00007f3a86d733e5 in PMPI_Test (request=0x7ffc66f4e290, completed=0x7ffc66f4e29c, status=<optimized out>) at ptest.c:65
#6  0x000000000043872c in TestReduce::check (this=0xbf3ccd0) at ../../../test/mpi/test_reduce.cc:84
#7  0x0000000000406943 in UccTestMpi::exec_tests (this=0x3300120, tcs=..., triggered=true, persistent=false) at ../../../test/mpi/test_mpi.cc:499
#8  0x00000000004075ab in UccTestMpi::run_all_at_team (this=0x3300120, team=..., rst=...) at ../../../test/mpi/test_mpi.cc:613
#9  0x0000000000407c46 in UccTestMpi::run_all (this=0x3300120, is_onesided=false) at ../../../test/mpi/test_mpi.cc:664
#10 0x0000000000417c3a in main (argc=16, argv=0x7ffc66f4ebc8) at ../../../test/mpi/main.cc:576

#0  0x00007fff4f7ec6c2 in clock_gettime ()
#1  0x00007f4a08525c6d in clock_gettime () from /lib64/libc.so.6
#2  0x00007f4a09ae80ef in ?? () from /lib64/libcuda.so.1
#3  0x00007f4a099e136b in ?? () from /lib64/libcuda.so.1
#4  0x00007f4a09d26977 in ?? () from /lib64/libcuda.so.1
#5  0x00007f4a09988ba0 in ?? () from /lib64/libcuda.so.1
#6  0x00007f4a09b071d8 in ?? () from /lib64/libcuda.so.1
#7  0x00007f4a0b772a81 in uct_cuda_ipc_map_memhandle (key=key@entry=0xc0a1870, mapped_addr=mapped_addr@entry=0x7fff4f5fdbf0) at cuda_ipc/cuda_ipc_cache.c:272
#8  0x00007f4a0b771216 in uct_cuda_ipc_post_cuda_async_copy (iov=0x7fff4f5fdc68, iov=0x7fff4f5fdc68, direction=1, comp=<optimized out>, rkey=<optimized out>, remote_addr=<optimized out>, tl_ep=<optimized out>) at cuda_ipc/cuda_ipc_ep.c:70
#9  uct_cuda_ipc_ep_get_zcopy (tl_ep=<optimized out>, iov=0x7fff4f5fdc68, iovcnt=<optimized out>, remote_addr=140647904313344, rkey=201988208, comp=0xb911350) at cuda_ipc/cuda_ipc_ep.c:146
#10 0x00007f49f674dee4 in uct_ep_get_zcopy (comp=0xb911350, rkey=201988208, remote_addr=<optimized out>, iovcnt=1, iov=0x7fff4f5fdc68, ep=0xaa987d0) at /build-result/src/hpcx-gcc-redhat7/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/uct/api/uct.h:2960
#11 ucp_rndv_progress_rma_zcopy_common (proto=2, uct_rkey=201988208, lane=4 '\\004', req=0xb9112c0) at rndv/rndv.c:582
#12 ucp_rndv_progress_rma_get_zcopy (self=0xb911398) at rndv/rndv.c:2271
#13 0x00007f49f675235a in ucp_request_try_send (req=0xb9112c0) at /build-result/src/hpcx-gcc-redhat7/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/core/ucp_request.inl:334
#14 ucp_request_send (req=<optimized out>) at /build-result/src/hpcx-gcc-redhat7/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/core/ucp_request.inl:357
#15 ucp_rndv_req_send_rma_get (rkey_buf=<optimized out>, rndv_rts_hdr=0x7f49a8935d00, rreq=0xb910b40, rndv_req=0xb9112c0) at rndv/rndv.c:950
#16 ucp_rndv_receive (worker=worker@entry=0x848fcc0, rreq=rreq@entry=0xb910b40, rndv_rts_hdr=rndv_rts_hdr@entry=0x7f49a8935d00, rkey_buf=rkey_buf@entry=0x7f49a8935d29) at rndv/rndv.c:1730
#17 0x00007f49f6763991 in ucp_rndv_receive_start (rkey_length=<optimized out>, rkey_buf=0x7f49a8935d29, rndv_rts_hdr=0x7f49a8935d00, rreq=0xb910b40, worker=0x848fcc0) at /build-result/src/hpcx-gcc-redhat7/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/rndv/rndv.inl:35
#18 ucp_tag_rndv_matched (worker=worker@entry=0x848fcc0, rreq=rreq@entry=0xb910b40, rts_hdr=rts_hdr@entry=0x7f49a8935d00, hdr_length=<optimized out>) at tag/tag_rndv.c:27
#19 0x00007f49f6765a7f in ucp_tag_recv_common (debug_name=<synthetic pointer>, param=0x7fff4f5fdfd0, rdesc=0x7f49a8935cd0, req=0xb910b40, tag_mask=18446744073709551615, tag=985162418749441, datatype=<optimized out>, count=<optimized out>, buffer=<optimized out>, worker=0x848fcc0) at tag/tag_recv.c:175
#20 ucp_tag_recv_nbx (worker=0x848fcc0, buffer=buffer@entry=0x7f4974800000, count=count@entry=1, tag=985162418749441, tag_mask=tag_mask@entry=18446744073709551615, param=0x7fff4f5fdfd0) at tag/tag_recv.c:249
#21 0x00007f49aa2a61c7 in ucc_tl_ucp_recv_common (cb=<optimized out>, task=0xb8070c0, team=0x9c34250, dest_group_rank=4, mtype=UCC_MEMORY_TYPE_CUDA, msglen=2097152, buffer=0x7f4974800000) at ./tl_ucp_sendrecv.h:155
#22 ucc_tl_ucp_recv_nb (task=0xb8070c0, team=0x9c34250, dest_group_rank=4, mtype=UCC_MEMORY_TYPE_CUDA, msglen=2097152, buffer=0x7f4974800000) at ./tl_ucp_sendrecv.h:166
#23 ucc_tl_ucp_reduce_knomial_progress (coll_task=<optimized out>) at reduce/reduce_knomial.c:78
#24 0x00007f4a0b526bc1 in ucc_pq_st_progress (pq=0x8cf4610) at core/ucc_progress_queue_st.c:31
#25 0x00007f4a0b52197e in ucc_progress_queue (pq=<optimized out>) at core/ucc_progress_queue.h:46
#26 ucc_context_progress (context=0x33f21d0) at core/ucc_context.c:934
#27 0x00000000004245c2 in TestCase::tc_progress_ctx (this=0xc03a1d0) at ../../../test/mpi/test_case.cc:160
#28 0x0000000000406868 in UccTestMpi::exec_tests (this=0x295d460, tcs=..., triggered=true, persistent=false) at ../../../test/mpi/test_mpi.cc:495
#29 0x00000000004075ab in UccTestMpi::run_all_at_team (this=0x295d460, team=..., rst=...) at ../../../test/mpi/test_mpi.cc:613
#30 0x0000000000407c46 in UccTestMpi::run_all (this=0x295d460, is_onesided=false) at ../../../test/mpi/test_mpi.cc:664
#31 0x0000000000417c3a in main (argc=16, argv=0x7fff4f5fe9d8) at ../../../test/mpi/main.cc:576

#0  0x00007febb40a9901 in ucp_worker_progress (worker=0x957cce0) at core/ucp_worker.c:2803
#1  0x00007feb5af55e49 in ucc_tl_ucp_test (task=0xc8f40c0) at ./tl_ucp_coll.h:300
#2  ucc_tl_ucp_reduce_knomial_progress (coll_task=<optimized out>) at reduce/reduce_knomial.c:57
#3  0x00007febbb159bc1 in ucc_pq_st_progress (pq=0x9de1650) at core/ucc_progress_queue_st.c:31
#4  0x00007febbb15497e in ucc_progress_queue (pq=<optimized out>) at core/ucc_progress_queue.h:46
#5  ucc_context_progress (context=0x44def80) at core/ucc_context.c:934
#6  0x00000000004245c2 in TestCase::tc_progress_ctx (this=0xbe84a50) at ../../../test/mpi/test_case.cc:160
#7  0x0000000000406868 in UccTestMpi::exec_tests (this=0x3a4aa50, tcs=..., triggered=true, persistent=false) at ../../../test/mpi/test_mpi.cc:495
#8  0x00000000004075ab in UccTestMpi::run_all_at_team (this=0x3a4aa50, team=..., rst=...) at ../../../test/mpi/test_mpi.cc:613
#9  0x0000000000407c46 in UccTestMpi::run_all (this=0x3a4aa50, is_onesided=false) at ../../../test/mpi/test_mpi.cc:664
#10 0x0000000000417c3a in main (argc=16, argv=0x7ffc1ab09748) at ../../../test/mpi/main.cc:576

#0  0x00007fccee9a72e8 in ompi_coll_libnbc_progress () at coll_libnbc_component.c:427
#1  0x00007fcd1f53c7ac in opal_progress () at runtime/opal_progress.c:231
#2  0x00007fcd207d0933 in ompi_request_default_test (rptr=0x7ffc4cbed090, completed=0x7ffc4cbed09c, status=0x0) at request/req_test.c:88
#3  0x00007fcd207f63e5 in PMPI_Test (request=0x7ffc4cbed090, completed=0x7ffc4cbed09c, status=<optimized out>) at ptest.c:65
#4  0x000000000043872c in TestReduce::check (this=0xb0ac9c0) at ../../../test/mpi/test_reduce.cc:84
#5  0x0000000000406943 in UccTestMpi::exec_tests (this=0x2edc3a0, tcs=..., triggered=true, persistent=false) at ../../../test/mpi/test_mpi.cc:499
#6  0x00000000004075ab in UccTestMpi::run_all_at_team (this=0x2edc3a0, team=..., rst=...) at ../../../test/mpi/test_mpi.cc:613
#7  0x0000000000407c46 in UccTestMpi::run_all (this=0x2edc3a0, is_onesided=false) at ../../../test/mpi/test_mpi.cc:664
#8  0x0000000000417c3a in main (argc=16, argv=0x7ffc4cbed9c8) at ../../../test/mpi/main.cc:576

#0  opal_sys_timer_get_cycles () at ../../../../opal/include/opal/sys/x86_64/timer.h:42
#1  opal_timer_linux_get_cycles_sys_timer () at timer_linux_component.c:232
#2  0x00007fa248bb88c9 in opal_progress_events () at runtime/opal_progress.c:183
#3  opal_progress () at runtime/opal_progress.c:245
#4  0x00007fa249e4c933 in ompi_request_default_test (rptr=0x7fff1a594ad0, completed=0x7fff1a594adc, status=0x0) at request/req_test.c:88
#5  0x00007fa249e723e5 in PMPI_Test (request=0x7fff1a594ad0, completed=0x7fff1a594adc, status=<optimized out>) at ptest.c:65
#6  0x000000000043872c in TestReduce::check (this=0xbec94e0) at ../../../test/mpi/test_reduce.cc:84
#7  0x0000000000406943 in UccTestMpi::exec_tests (this=0x3125560, tcs=..., triggered=true, persistent=false) at ../../../test/mpi/test_mpi.cc:499
#8  0x00000000004075ab in UccTestMpi::run_all_at_team (this=0x3125560, team=..., rst=...) at ../../../test/mpi/test_mpi.cc:613
#9  0x0000000000407c46 in UccTestMpi::run_all (this=0x3125560, is_onesided=false) at ../../../test/mpi/test_mpi.cc:664
#10 0x0000000000417c3a in main (argc=16, argv=0x7fff1a595408) at ../../../test/mpi/main.cc:576

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant