Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll: add coll_group to collective interfaces #7103

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Commits on Sep 14, 2024

  1. comm: store num_local and num_external in MPIR_Comm

    Store num_local and num_external in MPIR_Comm. Along with
    internode_table, they help construct internode subgroups.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    0499415 View commit details
    Browse the repository at this point in the history
  2. comm: remove node_count

    This is the same as num_external.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    b7d6412 View commit details
    Browse the repository at this point in the history
  3. comm/csel: remove reference to subcomms in csel prune_tree

    As the title.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    cae3828 View commit details
    Browse the repository at this point in the history
  4. coll: remove coll.pof2 field

    It does not take many instructions to calculate pof2 on the fly. Use of
    hard coded pof2 prevents collective algorithms to be used for
    non-trivial coll_group.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    438e7b8 View commit details
    Browse the repository at this point in the history
  5. comm: add MPIR_Subgroup

    Lightweight struct to describe sub-groups of a communicator. They intend
    to replace the subcomms.
    
    Preset a set of reserved subgroups to simplify common usages such as
    intranode group and crossnode group. Since we only expect limited number
    of dynamic subgroups and they should always be push/pop'ed within the
    scope, we don't need many dynamic slots.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    e7c88bd View commit details
    Browse the repository at this point in the history
  6. coll: add macros to get rank/size with coll_group

    Group collectives will have non-trivial coll_group that alter the rank
    and size of the communicator. Thease macros and functions will
    facilitate it.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    2b88398 View commit details
    Browse the repository at this point in the history
  7. coll: add coll_group argument to coll interfaces

    Add coll_group, index to comm->subgroups[], to all collectives except
    neighborhood collectives.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    e804cbc View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    e3969cc View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    024377a View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    6338e01 View commit details
    Browse the repository at this point in the history
  11. ch4: fallback to mpir if coll_group is non-zero

    Assuming the device layer collectives are not able to handle non-trivial
    coll_group, always fallback when coll_group != MPIR_SUBGROUP_NONE, for
    now.
    
    Also normalize the code style to use the fallback label. We should
    always fallback to mpir impl routines rather than the netmod routines
    (composition_beta). The composition_beta may fallback in the
    future when netmod coll become fancy, resulting in deadloop.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    049fab4 View commit details
    Browse the repository at this point in the history
  12. coll: add coll_group to csel signature

    Make csel coll_group aware.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    98fc2fc View commit details
    Browse the repository at this point in the history
  13. coll: threadcomm coll to use MPIR_SUBGROUP_THREADCOMM

    Use coll_group=MPIR_SUBGROUP_THREADCOMM for threadcomm collectives. This
    allows compositional collectives under threadcomm.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    20b3244 View commit details
    Browse the repository at this point in the history
  14. coll: check coll_group in MPIR_Comm_is_parent_comm

    We call MPIR_Comm_is_parent_comm to prevent recursively entering
    compositional algorithms such as the _smp algorithms. Check coll_group
    as well as we will switch to use subgroup rather than subcomms.
    Also check num_external directly for trivial comm. Subcomms and
    comm->hierarchy_kind will be removed in the future.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    a831867 View commit details
    Browse the repository at this point in the history
  15. coll: make non-compositional algorithm coll_group aware

    Use MPIR_COLL_RANK_SIZE if the algorithm is topology neutral.
    
    Use MPIR_COLL_RANK_SIZE_NO_GROUP if the algorithm is topology dependent.
    It adds an assertion on coll_group == MPIR_SUBGROUPS_NONE since
    coll_group may alter the topology assumptions.
    
    Intercomm does not work with non-zero coll_group.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    d2a6412 View commit details
    Browse the repository at this point in the history
  16. coll: modify bcast_intra_smp to use subgroups

    Replace the usage of subcomms with subgroups.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    a2f92c4 View commit details
    Browse the repository at this point in the history
  17. coll: avoid extra intra bcast in bcast_intra_smp

    When root is not local rank 0, instead of adding a extra intra-node
    send/recv or bcast, construct an inter group that includes the root
    process.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    370661e View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    ae6fe4e View commit details
    Browse the repository at this point in the history
  19. mpir: replace subcomm usage with subgroups

    Directly use information from MPIR_Process rather than from nodecomm in
    MPIR_Process.
    
    One step toward removing subcomms.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    0ba1a80 View commit details
    Browse the repository at this point in the history
  20. temp: fix csel

    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    3543718 View commit details
    Browse the repository at this point in the history
  21. coll: refactor caching tree in the comm struct

    Use a single "cached_tree" rather than 3 different fields for each tree
    type.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    7ea94c7 View commit details
    Browse the repository at this point in the history
  22. coll: add coll_group to treealgo routines

    The topology-aware tree utilities need check coll_group for correct
    world ranks.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    ce2274d View commit details
    Browse the repository at this point in the history
  23. coll: add nogroup restriction to certain algorithms

    Some algorithm, e.g. Allgather recexch, caches comm size-related info in
    communicator, thus won't work with none trivial coll_group. Add a
    restriction so it will fallback when coll_group != MPIR_SUBGROUP_NONE.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    2bf4890 View commit details
    Browse the repository at this point in the history
  24. coll: check coll_group in MPIR_Sched_next_tag

    All subgroup collectives should use the same tag within the parent
    collectives. This is because all processes in the communicator has to
    agree on the tag to use, but group collectives may not involve all
    processes. It is okay to use the same tag as long as the group
    collectives are always issued in order. This is the case since all group
    collectives are spawned under a parent collective, which has to obey the
    non-overlapping rule.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    757066a View commit details
    Browse the repository at this point in the history
  25. coll: refactor barrier_intra_k_dissemination

    Because the compiler can't figure out the arithmetic, it is warning:
        ‘MPIC_Waitall’ accessing 8 bytes in a region of size 0
    [-Wstringop-overflow=]
    
    Refactor to suppress warning and for better readability.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    513991d View commit details
    Browse the repository at this point in the history
  26. coll/allreduce: remove a leftover empty branch

    Commit ba1b4dd left an empty branch
    that should be removed.
    hzhou committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    10adb96 View commit details
    Browse the repository at this point in the history