-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stream: add stream synchronize to non-stream operations #7023
base: main
Are you sure you want to change the base?
Conversation
Useful for GPU stream based MPI extensions.
Extend the functionality of MPIX stream by allowing passing a stream communication with local gpu stream to regular MPI functions. The semantics is to run an implicit gpu stream synchronize before the MPI operation. This is semantically serializing the MPI operation with the GPU stream, albeit in a heavy handed way.
If a stream communicator is backed with a gpu stream and we call regular pt2pt and collective functions instead of the enqueue functions, we should run streamsynchronize to ensure the buffer is cleared for the pt2pt and coll operations. There is no need to stream synchronize for completion functions, i.e. Test and Wait, since the buffer safety is asserted by the nonblocking semantics and offloading calls issued after the completion function are safe to use the buffer. Amend: add rma operations too. Following the same reason, we don't need stream synchronize for Win_fence, Win_lock, etc. The RMA synchronize calls are essentially the host-side counterpart to stream synchronize.
Is the goal to preserve the MPI semantics of the existing MPI calls such that they guarantee local or remote completion on return to the calling context (which is, I think, the correct goal). In standard MPI semantics, if a call to |
Yes.
So I think upon "return" of |
|
If you pass a gpu-stream backed stream communicator to a legacy library, say PETSc, PETSc won't be aware of the stream semantics and thus won't fulfill the stream sync burden. If you place the burden on the caller of PETSc, that may work. It should work without this PR.
That is correct for the We do not desire to change the semantics of existing APIs such as |
I think in the case you state with PetSC, it's still on the programmer; handing PetSC a communicator with pending ISEnds and IRecvs on it is not much different than handing it one with enqueued MPI operations on it. I completely agree we don't want to change the semantics of |
That was one of the initial possibilities, but I think it is now evident that it will create too much confusion. Before our conversation, I was leaning toward banning using GPU-stream-communicators on traditional MPI operations, but you convinced me there is merit in making it work similar to how we made To the user, I think it is semantically equivalent whether the implementation is enqueuing the operation then stream synchronizing, or stream synchronizing then carrying out the operation in the host context. Thus it is an implementation detail. Because there will be a stream synchronization either way, I don't think there is much performance difference. Of course, the latter is currently much easier to implement as you see in this PR. |
Aha, I see. We were thinking of two different things. I was considering f Under the later frame, synchronizing before performing the operation in the host context is certainly semantically necessary, but is it sufficient, and how does it interact with MPI thread semantics? Could another thread MPI_Send_enqueue to the GPU context at the same time or does that violate the semantics of the underlying MPIX_Stream? |
The basic semantic for MPIX stream is a serial execution context, so users are required to ensure there is no concurrent calls from multiple threads for operations on the same stream. The stream synchronize at the beginning of The assumption is during the call of e.g. |
Cools, that's what I hoped (and one of the reasons I like the MPIX_Stream concept). |
Pull Request Description
In the original proposal, if a stream communicator is backed by a local GPU stream, we only can issue
_enqueue
operations, such asMPI_Send_enqueue
,MPI_Isend_enqueue
,MPI_Wait_enqueue
, etc. However, as pointed out by others, it is convenient and useful to allow gpu-backed stream communicators to be used with regular MPI operations so that it can be readily used with legacy libraries. To be semantically correct, we need insert stream synchronization calls, i.e.cudaStreamSynchronize
to ensure buffer safety.We only need call stream synchronize before the start of MPI operation to ensure the buffers are cleared from the GPU side. There is no need for stream synchronize after the MPI completion (e.g.
MPI_Wait
) since the offloading operations issued after the MPI completion is safe to access the buffers.The MPI synchronization, e.g.
MPI_Test
,MPI_Wait
,MPI_Win_fence
, are essentially the host-side equivalent of GPU-side stream synchronize.[skip warnings]
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.