Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll: add coll_group to collective interfaces #7103

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Aug 16, 2024

Pull Request Description

Make all (most) collective algorithms able to work within a subgroup.

  • Replace subcomm with lightweight MPIR_Subgroup
  • MPIR_Subgroup differs from MPIR_Group as the latter does not live inside a communicator, thus overly complex and inefficient to use.
  • Communicator is a very complex object that takes on all kinds of tricks such as attributes, hints, cache, context id ... Maintaining sub-communicator compounds these complexities and are expensive to maintain. The goal of this PR is to eventually remove sub-communicators.
  • This PR will provide a few examples.
  • Use MPIR_SUBGROUP_NONE in place of coll_group argument will provide backward collective semantics, i.e. the whole communicator collective.
typedef struct MPIR_Subgroup {
    int size;
    int rank;
    int *proc_table;
} MPIR_Subgroup; 
mpi_errno = MPIR_Bcast_impl(buffer, count, datatype, root, comm, coll_group, errflag);
                                                                 ^^^^^^^^^^
  • Replacing subcomm usage with subgroup
MPIR_Bcast(buf, count, datatype, root, node_comm, MPIR_ERR_NONE);

become

MPIR_Bcast(buf, count, datatype, root, comm, MPIR_SUBGROUP_NODE, MPIR_ERR_NONE);

One of the goals of this PR is to make all mpir- layer intra-collectives coll_group aware.

  • inter-collectives only work with MPIR_SUBGROUP_NONE (no group inter collectives)
  • compositional algorithms (i.e. _smp) only work with MPIR_SUBGROUP_NONE, algo selection need make sure not to create recursive compositional situation
  • All non-compositional intra algorithms should work with all coll_group. So it should not directly access (size,rank)-related fields in the communicator structure.
    [skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2408_coll_group branch 10 times, most recently from 295e2e0 to 736c1d1 Compare August 23, 2024 03:19
@hzhou
Copy link
Contributor Author

hzhou commented Aug 23, 2024

test:mpich/ch4/most
test:mpich/ch3/most

Only ch4-ofi-shm 42 failures in allgather2

@hzhou hzhou force-pushed the 2408_coll_group branch 3 times, most recently from cefbfd9 to d8d297a Compare August 24, 2024 22:12
@hzhou
Copy link
Contributor Author

hzhou commented Aug 24, 2024

test:mpich/ch4/most
test:mpich/ch3/most

@hzhou
Copy link
Contributor Author

hzhou commented Aug 25, 2024

test:mpich/ch4/most
test:mpich/ch3/most

1 failure - ch4-ucx-external - [coll.01376 - ./coll/reduce 10 MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM=release_gather MPIR_CVAR_COLL_SHM_LIMIT_PER_NODE=131072 MPIR_CVAR_REDUCE_INTRANODE_BUFFER_TOTAL_SIZE=32768 MPIR_CVAR_REDUCE_INTRANODE_NUM_CELLS=4 MPIR_CVAR_REDUCE_INTRANODE_TREE_KVAL=8 MPIR_CVAR_REDUCE_INTRANODE_TREE_TYPE=knomial_2](https://jenkins-pmrs.cels.anl.gov/job/mpich-review-ch4-ucx/3401/jenkins_configure=external,label=ubuntu22.04_review/testReport/junit/(root)/coll/01376_____coll_reduce_10__MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM_release_gather_MPIR_CVAR_COLL_SHM_LIMIT_PER_NODE_131072_MPIR_CVAR_REDUCE_INTRANODE_BUFFER_TOTAL_SIZE_32768_MPIR_CVAR_REDUCE_INTRANODE_NUM_CELLS_4_MPIR_CVAR_REDUCE_INTRANODE_TREE_KVAL_8_MPIR_CVAR_REDUCE_INTRANODE_TREE_TYPE_knomial_2/)

@hzhou hzhou marked this pull request as ready for review August 25, 2024 01:38
@hzhou hzhou requested a review from raffenet August 25, 2024 01:38
@hzhou
Copy link
Contributor Author

hzhou commented Aug 26, 2024

Try get a clean test:

test:mpich/ch4/ucx

@raffenet
Copy link
Contributor

raffenet commented Sep 6, 2024

  • MPIR_Subgroup differs from MPIR_Group as the latter does not live inside a communicator, thus overly complex and inefficient to use.

What is meant by MPIR_Group lives inside a communicator? Groups are independent objects, no? I am starting to wonder why we are adding the complexity of a whole new object and allocation scheme vs using what we have in MPI[R]_Group.

@hzhou
Copy link
Contributor Author

hzhou commented Sep 6, 2024

  • MPIR_Subgroup differs from MPIR_Group as the latter does not live inside a communicator, thus overly complex and inefficient to use.

What is meant by MPIR_Group lives inside a communicator? Groups are independent objects, no? I am starting to wonder why we are adding the complexity of a whole new object and allocation scheme vs using what we have in MPI[R]_Group.

MPIR_Group is not associated with any communicator. Unlike MPIR_Subgroup, it's inside a communicator.

When we use MPIR_Group for communication -

  1. we need to find a communicator for its communication context
  2. we need translate the lpid used in MPIR_Group into the ranks in that communicator
  3. So it's more cumbersome to use, and easily can confuse developers who are not familiar.
  4. Also, MPIR_Group represents and is confined by MPI_Group. Thus it's less flexible to adapt to our internal usages.

@hzhou hzhou force-pushed the 2408_coll_group branch 2 times, most recently from b3477a3 to b6cecd0 Compare September 12, 2024 22:56
@hzhou
Copy link
Contributor Author

hzhou commented Sep 12, 2024

test:mpich/ch4/most
test:mpich/ch3/most

Only 2 timouts in ch4-ofi-default due to congestions:

datatype.01767 - ./datatype/large_type_sendrec 2 33 
coll.00127 - ./coll/gather_big 8

Store num_local and num_external in MPIR_Comm. Along with
internode_table, they help construct internode subgroups.
This is the same as num_external.
It does not take many instructions to calculate pof2 on the fly. Use of
hard coded pof2 prevents collective algorithms to be used for
non-trivial coll_group.
Lightweight struct to describe sub-groups of a communicator. They intend
to replace the subcomms.

Preset a set of reserved subgroups to simplify common usages such as
intranode group and crossnode group. Since we only expect limited number
of dynamic subgroups and they should always be push/pop'ed within the
scope, we don't need many dynamic slots.
Group collectives will have non-trivial coll_group that alter the rank
and size of the communicator. Thease macros and functions will
facilitate it.
Add coll_group, index to comm->subgroups[], to all collectives except
neighborhood collectives.
Assuming the device layer collectives are not able to handle non-trivial
coll_group, always fallback when coll_group != MPIR_SUBGROUP_NONE, for
now.

Also normalize the code style to use the fallback label. We should
always fallback to mpir impl routines rather than the netmod routines
(composition_beta). The composition_beta may fallback in the
future when netmod coll become fancy, resulting in deadloop.
Make csel coll_group aware.
Use coll_group=MPIR_SUBGROUP_THREADCOMM for threadcomm collectives. This
allows compositional collectives under threadcomm.
We call MPIR_Comm_is_parent_comm to prevent recursively entering
compositional algorithms such as the _smp algorithms. Check coll_group
as well as we will switch to use subgroup rather than subcomms.
Also check num_external directly for trivial comm. Subcomms and
comm->hierarchy_kind will be removed in the future.
Use MPIR_COLL_RANK_SIZE if the algorithm is topology neutral.

Use MPIR_COLL_RANK_SIZE_NO_GROUP if the algorithm is topology dependent.
It adds an assertion on coll_group == MPIR_SUBGROUPS_NONE since
coll_group may alter the topology assumptions.

Intercomm does not work with non-zero coll_group.
Replace the usage of subcomms with subgroups.
When root is not local rank 0, instead of adding a extra intra-node
send/recv or bcast, construct an inter group that includes the root
process.
Directly use information from MPIR_Process rather than from nodecomm in
MPIR_Process.

One step toward removing subcomms.
Use a single "cached_tree" rather than 3 different fields for each tree
type.
The topology-aware tree utilities need check coll_group for correct
world ranks.
Some algorithm, e.g. Allgather recexch, caches comm size-related info in
communicator, thus won't work with none trivial coll_group. Add a
restriction so it will fallback when coll_group != MPIR_SUBGROUP_NONE.
All subgroup collectives should use the same tag within the parent
collectives. This is because all processes in the communicator has to
agree on the tag to use, but group collectives may not involve all
processes. It is okay to use the same tag as long as the group
collectives are always issued in order. This is the case since all group
collectives are spawned under a parent collective, which has to obey the
non-overlapping rule.
Because the compiler can't figure out the arithmetic, it is warning:
    ‘MPIC_Waitall’ accessing 8 bytes in a region of size 0
[-Wstringop-overflow=]

Refactor to suppress warning and for better readability.
Commit ba1b4dd left an empty branch
that should be removed.
@hzhou
Copy link
Contributor Author

hzhou commented Sep 14, 2024

test:mpich/pmi
test:mpich/ch4/xpmem

Copy link
Contributor

@raffenet raffenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any strong objections to these changes at this stage. What would be good to know is the scope of additional changes coming for 4.3 based on the new interface. Comparisons vs. the previous implementation would also help, especially for review. Please add a commit summarizing the work to CHANGES.

Comment on lines 79 to 89
if (local_rank == local_root) {
mpi_errno = MPIR_Bcast(buffer, count, datatype, inter_root,
comm_ptr, node_roots_group, errflag);
MPIR_ERR_CHECK(mpi_errno);
}

/* perform the intranode broadcast on all except for the root's node */
/* perform the intranode broadcast */
if (local_size > 1) {
mpi_errno = MPIR_Bcast(buffer, count, datatype, 0, comm_ptr, node_group, errflag);
MPIR_ERR_CHECK(mpi_errno);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely appreciate the simplicity of this approach vs the extra messaging. Can we capture any data to confirm this is not any slower vs the previous code with non-zero root?

@@ -643,14 +643,14 @@ static csel_node_s *prune_tree(csel_node_s * root, MPIR_Comm * comm_ptr)
else
node = node->success;
break;

/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiousity what is the eventual change to this block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we will no longer have subcomm, we will only remove hierarchy kind, and there is no point to do hierarchy kind check. We need check the current usages and rename the check. If it is checking for whether it's subgroup, we can remove the prune here; it has to be checked dynamically (on whether coll_group is 0). If it is checking for topology, we should replace the check with something more specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants