Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4/hcoll: fix call hcoll_do_progress #7047

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Jun 30, 2024

Pull Request Description

Previously we added a vci parameter to progress hooks. We negelected update one of the two calls to hcoll_do_progress.

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

Previously we added a vci parameter to progress hooks. We negelected
update one of the two calls to hcoll_do_progress.
@hzhou hzhou marked this pull request as ready for review June 30, 2024 20:15
@hzhou
Copy link
Contributor Author

hzhou commented Aug 9, 2024

test:mpich/custom
netmod: ch4:ucx
config: hcoll

Hangs during init:

Thread 1 "cpi" received signal SIGINT, Interrupt.
0x00007ffff5d13e93 in progress () at src/mpid/common/hcoll/hcoll_rte.c:61
61  }
#0  0x00007ffff5d13e93 in progress () at src/mpid/common/hcoll/hcoll_rte.c:61
#1  0x00007ffff5658161 in wait_completion ()
   from /nfs/gce/projects/pmrs/opt/hpcx-v2.17.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hcoll/lib/libhcoll.so.1
#2  0x00007ffff55c713b in comm_allreduce_hcolrte_generic ()
   from /nfs/gce/projects/pmrs/opt/hpcx-v2.17.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hcoll/lib/libhcoll.so.1
#3  0x00007ffff55c7a14 in comm_allreduce_hcolrte ()
   from /nfs/gce/projects/pmrs/opt/hpcx-v2.17.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hcoll/lib/libhcoll.so.1
#4  0x00007ffff565bc7a in hcoll_get_context_from_cache ()
   from /nfs/gce/projects/pmrs/opt/hpcx-v2.17.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hcoll/lib/libhcoll.so.1
#5  0x00007ffff56582f5 in hcoll_create_context ()
   from /nfs/gce/projects/pmrs/opt/hpcx-v2.17.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hcoll/lib/libhcoll.so.1
#6  0x00007ffff5d0f166 in hcoll_comm_create (
    comm_ptr=comm_ptr@entry=0x7ffff7f5ec40 <MPIR_Comm_builtin>,
    param=param@entry=0x0) at src/mpid/common/hcoll/hcoll_init.c:158
#7  0x00007ffff5cc5a09 in MPIDI_UCX_mpi_comm_commit_pre_hook (
    comm=comm@entry=0x7ffff7f5ec40 <MPIR_Comm_builtin>)
    at src/mpid/ch4/netmod/ucx/ucx_comm.c:19
#8  0x00007ffff5ccb6ea in MPID_Comm_commit_pre_hook (
    comm=comm@entry=0x7ffff7f5ec40 <MPIR_Comm_builtin>)
    at src/mpid/ch4/src/ch4_comm.c:197
#9  0x00007ffff5c4f63d in MPIR_Comm_commit_internal (
    comm=comm@entry=0x7ffff7f5ec40 <MPIR_Comm_builtin>)
    at src/mpi/comm/commutil.c:584
#10 0x00007ffff5c55ea8 in MPIR_Comm_commit (
    comm=0x7ffff7f5ec40 <MPIR_Comm_builtin>) at src/mpi/comm/commutil.c:799
#11 0x00007ffff5c45cf5 in MPIR_init_comm_world ()
    at src/mpi/comm/builtin_comms.c:33
#12 0x00007ffff5c84ca5 in MPII_Init_thread (argc=argc@entry=0x7fffffffda3c,
    argv=argv@entry=0x7fffffffda30, user_required=<optimized out>,
    provided=provided@entry=0x7fffffffd9cc,
    p_session_ptr=p_session_ptr@entry=0x0) at src/mpi/init/mpir_init.c:267
#13 0x00007ffff5c8538a in MPIR_Init_impl (argc=argc@entry=0x7fffffffda3c,
    argv=argv@entry=0x7fffffffda30) at src/mpi/init/mpir_init.c:136
#14 0x00007ffff5b0411c in internal_Init (argv=0x7fffffffda30,
    argc=0x7fffffffda3c) at src/binding/c/c_binding.c:49972
#15 PMPI_Init (argc=0x7fffffffda3c, argv=0x7fffffffda30)
    at src/binding/c/c_binding.c:50023
#16 0x0000555555555347 in main ()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant