coll: add coll_group to collective interfaces #7103

hzhou · 2024-08-16T23:01:59Z

Pull Request Description

Make all (most) collective algorithms able to work within a subgroup.

Replace subcomm with lightweight MPIR_Subgroup
MPIR_Subgroup differs from MPIR_Group as the latter does not live inside a communicator, thus overly complex and inefficient to use.
Communicator is a very complex object that takes on all kinds of tricks such as attributes, hints, cache, context id ... Maintaining sub-communicator compounds these complexities and are expensive to maintain. The goal of this PR is to eventually remove sub-communicators.
This PR will provide a few examples.
Use MPIR_SUBGROUP_NONE in place of coll_group argument will provide backward collective semantics, i.e. the whole communicator collective.

typedef struct MPIR_Subgroup {
    int size;
    int rank;
    int *proc_table;
} MPIR_Subgroup;

mpi_errno = MPIR_Bcast_impl(buffer, count, datatype, root, comm, coll_group, errflag);
                                                                 ^^^^^^^^^^

Replacing subcomm usage with subgroup

MPIR_Bcast(buf, count, datatype, root, node_comm, MPIR_ERR_NONE);

become

MPIR_Bcast(buf, count, datatype, root, comm, MPIR_SUBGROUP_NODE, MPIR_ERR_NONE);

One of the goals of this PR is to make all mpir- layer intra-collectives coll_group aware.

inter-collectives only work with MPIR_SUBGROUP_NONE (no group inter collectives)
compositional algorithms (i.e. _smp) only work with MPIR_SUBGROUP_NONE, algo selection need make sure not to create recursive compositional situation
All non-compositional intra algorithms should work with all coll_group. So it should not directly access (size,rank)-related fields in the communicator structure.
[skip warnings]

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

hzhou · 2024-08-23T03:38:22Z

test:mpich/ch4/most
test:mpich/ch3/most

Only ch4-ofi-shm 42 failures in allgather2

hzhou · 2024-08-24T22:22:32Z

test:mpich/ch4/most
test:mpich/ch3/most

hzhou · 2024-08-25T01:38:21Z

test:mpich/ch4/most
test:mpich/ch3/most

1 failure - ch4-ucx-external - [coll.01376 - ./coll/reduce 10 MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM=release_gather MPIR_CVAR_COLL_SHM_LIMIT_PER_NODE=131072 MPIR_CVAR_REDUCE_INTRANODE_BUFFER_TOTAL_SIZE=32768 MPIR_CVAR_REDUCE_INTRANODE_NUM_CELLS=4 MPIR_CVAR_REDUCE_INTRANODE_TREE_KVAL=8 MPIR_CVAR_REDUCE_INTRANODE_TREE_TYPE=knomial_2](https://jenkins-pmrs.cels.anl.gov/job/mpich-review-ch4-ucx/3401/jenkins_configure=external,label=ubuntu22.04_review/testReport/junit/(root)/coll/01376_____coll_reduce_10__MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM_release_gather_MPIR_CVAR_COLL_SHM_LIMIT_PER_NODE_131072_MPIR_CVAR_REDUCE_INTRANODE_BUFFER_TOTAL_SIZE_32768_MPIR_CVAR_REDUCE_INTRANODE_NUM_CELLS_4_MPIR_CVAR_REDUCE_INTRANODE_TREE_KVAL_8_MPIR_CVAR_REDUCE_INTRANODE_TREE_TYPE_knomial_2/)

hzhou · 2024-08-26T15:14:56Z

Try get a clean test:

test:mpich/ch4/ucx

src/mpi/comm/commutil.c

raffenet · 2024-09-06T13:39:03Z

MPIR_Subgroup differs from MPIR_Group as the latter does not live inside a communicator, thus overly complex and inefficient to use.

What is meant by MPIR_Group lives inside a communicator? Groups are independent objects, no? I am starting to wonder why we are adding the complexity of a whole new object and allocation scheme vs using what we have in MPI[R]_Group.

hzhou · 2024-09-06T14:04:13Z

MPIR_Subgroup differs from MPIR_Group as the latter does not live inside a communicator, thus overly complex and inefficient to use.

What is meant by MPIR_Group lives inside a communicator? Groups are independent objects, no? I am starting to wonder why we are adding the complexity of a whole new object and allocation scheme vs using what we have in MPI[R]_Group.

MPIR_Group is not associated with any communicator. Unlike MPIR_Subgroup, it's inside a communicator.

When we use MPIR_Group for communication -

we need to find a communicator for its communication context
we need translate the lpid used in MPIR_Group into the ranks in that communicator
So it's more cumbersome to use, and easily can confuse developers who are not familiar.
Also, MPIR_Group represents and is confined by MPI_Group. Thus it's less flexible to adapt to our internal usages.

src/mpi/coll/allreduce/allreduce_intra_k_reduce_scatter_allgather.c

src/mpid/common/sched/mpidu_sched.c

hzhou · 2024-09-12T23:05:27Z

test:mpich/ch4/most
test:mpich/ch3/most

Only 2 timouts in ch4-ofi-default due to congestions:

datatype.01767 - ./datatype/large_type_sendrec 2 33 
coll.00127 - ./coll/gather_big 8

Store num_local and num_external in MPIR_Comm. Along with internode_table, they help construct internode subgroups.

This is the same as num_external.

As the title.

It does not take many instructions to calculate pof2 on the fly. Use of hard coded pof2 prevents collective algorithms to be used for non-trivial coll_group.

Lightweight struct to describe sub-groups of a communicator. They intend to replace the subcomms. Preset a set of reserved subgroups to simplify common usages such as intranode group and crossnode group. Since we only expect limited number of dynamic subgroups and they should always be push/pop'ed within the scope, we don't need many dynamic slots.

Group collectives will have non-trivial coll_group that alter the rank and size of the communicator. Thease macros and functions will facilitate it.

Add coll_group, index to comm->subgroups[], to all collectives except neighborhood collectives.

Assuming the device layer collectives are not able to handle non-trivial coll_group, always fallback when coll_group != MPIR_SUBGROUP_NONE, for now. Also normalize the code style to use the fallback label. We should always fallback to mpir impl routines rather than the netmod routines (composition_beta). The composition_beta may fallback in the future when netmod coll become fancy, resulting in deadloop.

Make csel coll_group aware.

Use coll_group=MPIR_SUBGROUP_THREADCOMM for threadcomm collectives. This allows compositional collectives under threadcomm.

We call MPIR_Comm_is_parent_comm to prevent recursively entering compositional algorithms such as the _smp algorithms. Check coll_group as well as we will switch to use subgroup rather than subcomms. Also check num_external directly for trivial comm. Subcomms and comm->hierarchy_kind will be removed in the future.

Use MPIR_COLL_RANK_SIZE if the algorithm is topology neutral. Use MPIR_COLL_RANK_SIZE_NO_GROUP if the algorithm is topology dependent. It adds an assertion on coll_group == MPIR_SUBGROUPS_NONE since coll_group may alter the topology assumptions. Intercomm does not work with non-zero coll_group.

Replace the usage of subcomms with subgroups.

When root is not local rank 0, instead of adding a extra intra-node send/recv or bcast, construct an inter group that includes the root process.

Directly use information from MPIR_Process rather than from nodecomm in MPIR_Process. One step toward removing subcomms.

Use a single "cached_tree" rather than 3 different fields for each tree type.

The topology-aware tree utilities need check coll_group for correct world ranks.

Some algorithm, e.g. Allgather recexch, caches comm size-related info in communicator, thus won't work with none trivial coll_group. Add a restriction so it will fallback when coll_group != MPIR_SUBGROUP_NONE.

All subgroup collectives should use the same tag within the parent collectives. This is because all processes in the communicator has to agree on the tag to use, but group collectives may not involve all processes. It is okay to use the same tag as long as the group collectives are always issued in order. This is the case since all group collectives are spawned under a parent collective, which has to obey the non-overlapping rule.

Because the compiler can't figure out the arithmetic, it is warning: ‘MPIC_Waitall’ accessing 8 bytes in a region of size 0 [-Wstringop-overflow=] Refactor to suppress warning and for better readability.

Commit ba1b4dd left an empty branch that should be removed.

hzhou · 2024-09-14T14:55:11Z

test:mpich/pmi
test:mpich/ch4/xpmem

raffenet

I don't have any strong objections to these changes at this stage. What would be good to know is the scope of additional changes coming for 4.3 based on the new interface. Comparisons vs. the previous implementation would also help, especially for review. Please add a commit summarizing the work to CHANGES.

raffenet · 2024-09-16T16:12:21Z

src/mpi/coll/bcast/bcast_intra_smp.c

+ if (local_rank == local_root) {
 mpi_errno = MPIR_Bcast(buffer, count, datatype, inter_root,
 comm_ptr, node_roots_group, errflag);
 MPIR_ERR_CHECK(mpi_errno);
 }

- /* perform the intranode broadcast on all except for the root's node */
+ /* perform the intranode broadcast */
 if (local_size > 1) {
 mpi_errno = MPIR_Bcast(buffer, count, datatype, 0, comm_ptr, node_group, errflag);
 MPIR_ERR_CHECK(mpi_errno);
 }


I definitely appreciate the simplicity of this approach vs the extra messaging. Can we capture any data to confirm this is not any slower vs the previous code with non-zero root?

raffenet · 2024-09-16T16:17:47Z

src/mpi/coll/src/csel.c

@@ -643,14 +643,14 @@ static csel_node_s *prune_tree(csel_node_s * root, MPIR_Comm * comm_ptr)
 else
 node = node->success;
 break;
-
+/*


out of curiousity what is the eventual change to this block?

Since we will no longer have subcomm, we will only remove hierarchy kind, and there is no point to do hierarchy kind check. We need check the current usages and rename the check. If it is checking for whether it's subgroup, we can remove the prune here; it has to be checked dynamically (on whether coll_group is 0). If it is checking for topology, we should replace the check with something more specific.

hzhou force-pushed the 2408_coll_group branch 10 times, most recently from 295e2e0 to 736c1d1 Compare August 23, 2024 03:19

hzhou force-pushed the 2408_coll_group branch 3 times, most recently from cefbfd9 to d8d297a Compare August 24, 2024 22:12

hzhou force-pushed the 2408_coll_group branch from d8d297a to 6cb5afa Compare August 25, 2024 01:38

hzhou marked this pull request as ready for review August 25, 2024 01:38

hzhou requested a review from raffenet August 25, 2024 01:38

hzhou mentioned this pull request Aug 29, 2024

coll: add coll_attr and comm subgroups #6590

Closed

4 tasks

raffenet reviewed Sep 3, 2024

View reviewed changes

src/mpi/comm/commutil.c Show resolved Hide resolved

raffenet reviewed Sep 10, 2024

View reviewed changes

src/mpi/coll/allreduce/allreduce_intra_k_reduce_scatter_allgather.c Show resolved Hide resolved

raffenet reviewed Sep 10, 2024

View reviewed changes

src/mpid/common/sched/mpidu_sched.c Outdated Show resolved Hide resolved

hzhou force-pushed the 2408_coll_group branch 2 times, most recently from b3477a3 to b6cecd0 Compare September 12, 2024 22:56

comm: store num_local and num_external in MPIR_Comm

0499415

Store num_local and num_external in MPIR_Comm. Along with internode_table, they help construct internode subgroups.

hzhou added 25 commits September 14, 2024 09:51

comm: remove node_count

b7d6412

This is the same as num_external.

comm/csel: remove reference to subcomms in csel prune_tree

cae3828

As the title.

coll: remove coll.pof2 field

438e7b8

It does not take many instructions to calculate pof2 on the fly. Use of hard coded pof2 prevents collective algorithms to be used for non-trivial coll_group.

coll: add macros to get rank/size with coll_group

2b88398

Group collectives will have non-trivial coll_group that alter the rank and size of the communicator. Thease macros and functions will facilitate it.

coll: add coll_group argument to coll interfaces

e804cbc

Add coll_group, index to comm->subgroups[], to all collectives except neighborhood collectives.

continue: add coll_group to collective interfaces

e3969cc

coll: add coll_group argument to MPIC/sched/TSP routines

024377a

continue: add coll_group in MPIC/sched/TSP routines

6338e01

coll: add coll_group to csel signature

98fc2fc

Make csel coll_group aware.

coll: threadcomm coll to use MPIR_SUBGROUP_THREADCOMM

20b3244

Use coll_group=MPIR_SUBGROUP_THREADCOMM for threadcomm collectives. This allows compositional collectives under threadcomm.

coll: modify bcast_intra_smp to use subgroups

a2f92c4

Replace the usage of subcomms with subgroups.

coll: avoid extra intra bcast in bcast_intra_smp

370661e

When root is not local rank 0, instead of adding a extra intra-node send/recv or bcast, construct an inter group that includes the root process.

coll: modify smp algorithms to use MPIR_Subgroup

ae6fe4e

mpir: replace subcomm usage with subgroups

0ba1a80

Directly use information from MPIR_Process rather than from nodecomm in MPIR_Process. One step toward removing subcomms.

temp: fix csel

3543718

coll: refactor caching tree in the comm struct

7ea94c7

Use a single "cached_tree" rather than 3 different fields for each tree type.

coll: add coll_group to treealgo routines

ce2274d

The topology-aware tree utilities need check coll_group for correct world ranks.

coll: add nogroup restriction to certain algorithms

2bf4890

Some algorithm, e.g. Allgather recexch, caches comm size-related info in communicator, thus won't work with none trivial coll_group. Add a restriction so it will fallback when coll_group != MPIR_SUBGROUP_NONE.

coll: refactor barrier_intra_k_dissemination

513991d

Because the compiler can't figure out the arithmetic, it is warning: ‘MPIC_Waitall’ accessing 8 bytes in a region of size 0 [-Wstringop-overflow=] Refactor to suppress warning and for better readability.

coll/allreduce: remove a leftover empty branch

10adb96

Commit ba1b4dd left an empty branch that should be removed.

hzhou force-pushed the 2408_coll_group branch from 331eb14 to 10adb96 Compare September 14, 2024 14:54

raffenet reviewed Sep 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coll: add coll_group to collective interfaces #7103

coll: add coll_group to collective interfaces #7103

hzhou commented Aug 16, 2024 •

edited

Loading

hzhou commented Aug 23, 2024 •

edited

Loading

hzhou commented Aug 24, 2024

hzhou commented Aug 25, 2024 •

edited

Loading

hzhou commented Aug 26, 2024

raffenet commented Sep 6, 2024

hzhou commented Sep 6, 2024

hzhou commented Sep 12, 2024 •

edited

Loading

hzhou commented Sep 14, 2024

raffenet left a comment

raffenet Sep 16, 2024

raffenet Sep 16, 2024

hzhou Sep 17, 2024

coll: add coll_group to collective interfaces #7103

Are you sure you want to change the base?

coll: add coll_group to collective interfaces #7103

Conversation

hzhou commented Aug 16, 2024 • edited Loading

Pull Request Description

Author Checklist

hzhou commented Aug 23, 2024 • edited Loading

hzhou commented Aug 24, 2024

hzhou commented Aug 25, 2024 • edited Loading

hzhou commented Aug 26, 2024

raffenet commented Sep 6, 2024

hzhou commented Sep 6, 2024

hzhou commented Sep 12, 2024 • edited Loading

hzhou commented Sep 14, 2024

raffenet left a comment

Choose a reason for hiding this comment

raffenet Sep 16, 2024

Choose a reason for hiding this comment

raffenet Sep 16, 2024

Choose a reason for hiding this comment

hzhou Sep 17, 2024

Choose a reason for hiding this comment

hzhou commented Aug 16, 2024 •

edited

Loading

hzhou commented Aug 23, 2024 •

edited

Loading

hzhou commented Aug 25, 2024 •

edited

Loading

hzhou commented Sep 12, 2024 •

edited

Loading