stream: add stream synchronize to non-stream operations #7023

hzhou · 2024-06-05T15:09:30Z

Pull Request Description

In the original proposal, if a stream communicator is backed by a local GPU stream, we only can issue _enqueue operations, such as MPI_Send_enqueue, MPI_Isend_enqueue, MPI_Wait_enqueue, etc. However, as pointed out by others, it is convenient and useful to allow gpu-backed stream communicators to be used with regular MPI operations so that it can be readily used with legacy libraries. To be semantically correct, we need insert stream synchronization calls, i.e. cudaStreamSynchronize to ensure buffer safety.

We only need call stream synchronize before the start of MPI operation to ensure the buffers are cleared from the GPU side. There is no need for stream synchronize after the MPI completion (e.g. MPI_Wait) since the offloading operations issued after the MPI completion is safe to access the buffers.

The MPI synchronization, e.g. MPI_Test, MPI_Wait, MPI_Win_fence, are essentially the host-side equivalent of GPU-side stream synchronize.

[skip warnings]

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

Useful for GPU stream based MPI extensions.

Extend the functionality of MPIX stream by allowing passing a stream communication with local gpu stream to regular MPI functions. The semantics is to run an implicit gpu stream synchronize before the MPI operation. This is semantically serializing the MPI operation with the GPU stream, albeit in a heavy handed way.

If a stream communicator is backed with a gpu stream and we call regular pt2pt and collective functions instead of the enqueue functions, we should run streamsynchronize to ensure the buffer is cleared for the pt2pt and coll operations. There is no need to stream synchronize for completion functions, i.e. Test and Wait, since the buffer safety is asserted by the nonblocking semantics and offloading calls issued after the completion function are safe to use the buffer. Amend: add rma operations too. Following the same reason, we don't need stream synchronize for Win_fence, Win_lock, etc. The RMA synchronize calls are essentially the host-side counterpart to stream synchronize.

patrickb314 · 2024-06-05T17:10:00Z

Is the goal to preserve the MPI semantics of the existing MPI calls such that they guarantee local or remote completion on return to the calling context (which is, I think, the correct goal). In standard MPI semantics, if a call to MPI_Recv returnes successfully, the calling context (not just subsequently called MPI operations) has guarantees about the state of the buffer on return from the call; I don't think this provides that.

hzhou · 2024-06-05T17:42:26Z

Is the goal to preserve the MPI semantics of the existing MPI calls such that they guarantee local or remote completion on return to the calling context (which is, I think, the correct goal).

Yes.

In standard MPI semantics, if a call to MPI_Recv returnes successfully, the calling context (not just subsequently called MPI operations) has guarantees about the state of the buffer on return from the call; I don't think this provides that.

MPI_Recv is a blocking call. It may be better to split it into MPI_Irecv + MPI_Wait for the sake of discussion.
Because MPI_Irecv starts an MPI operation that will access the message buffer, a stream synchronize is needed to clear all previously issued (GPU-stream) operations that potentially will access the buffer.
Between MPI_Irecv and MPI_Wait, users are not allowed (or incorrect) to issue operations that may access the buffer. This includes issuing GPU offloading operations.
MPI_Wait does not change the state of buffer access ownership, so it doesn't need another stream synchronization. On return, the buffer is released back to user. Because there shouldn't be any pending GPU operations that depends on the buffer, there is no need to stream synchronize.
After MPI_Wait, users can issue operations that access the Recv buffer including issuing offloading operations.

So I think upon "return" of MPI_Wait, it does provide the guarantee about the state of the buffer, as much as in the traditional case, that is, assuming user didn't violate the nonblocking requirement that they are not supposed to access the buffer between MPI starting call and the corresponding completion call.

patrickb314 · 2024-06-05T17:57:41Z

The caller make guarantees about the state of a buffer before calling MPI with a buffer, in the case of IRecv that the buffer is ready for MPI to work on. If a stream sync is necessary here, IMO traditional MPI semantics place the burden of that on the programmer, not MPI, right?
My understanding is that MPI_Wait_enqueue (or MPI_Wait on a GPU stream communicator) enqueued the wait to the other execution stream and returns immediately to the calling context. In particular, the MPIX_Stream paper notes "For example, MPIX_Send_enqueue, as with all enqueuing APIs, returns immediately after registering the operation." Is that incorrect?

hzhou · 2024-06-05T18:27:11Z

The caller make guarantees about the state of a buffer before calling MPI with a buffer, in the case of IRecv that the buffer is ready for MPI to work on. If a stream sync is necessary here, IMO traditional MPI semantics place the burden of that on the programmer, not MPI, right?

If you pass a gpu-stream backed stream communicator to a legacy library, say PETSc, PETSc won't be aware of the stream semantics and thus won't fulfill the stream sync burden. If you place the burden on the caller of PETSc, that may work. It should work without this PR.

My understanding is that MPI_Wait_enqueue (or MPI_Wait on a GPU stream communicator) enqueued the wait to the other execution stream and returns immediately to the calling context. In particular, the MPIX_Stream paper notes "For example, MPIX_Send_enqueue, as with all enqueuing APIs, returns immediately after registering the operation." Is that incorrect?

That is correct for the _enqueue operations, but not MPI_Wait. MPI_Wait does not have the stream "enqueue" semantics, and it will progress and wait within the host execution context. Thus, on its return, the buffer is ready.

We do not desire to change the semantics of existing APIs such as MPI_Wait e.g. by adding an _enqueue semantics, or did I misunderstand?

patrickb314 · 2024-06-05T23:14:22Z

I think in the case you state with PetSC, it's still on the programmer; handing PetSC a communicator with pending ISEnds and IRecvs on it is not much different than handing it one with enqueued MPI operations on it.

I completely agree we don't want to change the semantics of MPI_Wait. My understanding was that Send, Recv, and Wait called on stream communicators with an associated GPU stream (even when not called through the _enqueue interface) resulted in enqueuing operations; perhaps I misunderstood.

hzhou · 2024-06-06T01:40:37Z

My understanding was that Send, Recv, and Wait called on stream communicators with an associated GPU stream (even when not called through the _enqueue interface) resulted in enqueuing operations; perhaps I misunderstood.

That was one of the initial possibilities, but I think it is now evident that it will create too much confusion. Before our conversation, I was leaning toward banning using GPU-stream-communicators on traditional MPI operations, but you convinced me there is merit in making it work similar to how we made MPI_THREAD_MULTIPLE work.

To the user, I think it is semantically equivalent whether the implementation is enqueuing the operation then stream synchronizing, or stream synchronizing then carrying out the operation in the host context. Thus it is an implementation detail. Because there will be a stream synchronization either way, I don't think there is much performance difference. Of course, the latter is currently much easier to implement as you see in this PR.

patrickb314 · 2024-06-10T13:26:47Z

Aha, I see. We were thinking of two different things. I was considering f MPI_Recv on GPU stream communicators as enqueueing the operation and needing to be modified to preserve traditional semantics. You were thinking out it as a seperate operation that always waited for completion to return but needs to synchronize with the stream triggered operations already enqueued to preserve ordering.

Under the later frame, synchronizing before performing the operation in the host context is certainly semantically necessary, but is it sufficient, and how does it interact with MPI thread semantics? Could another thread MPI_Send_enqueue to the GPU context at the same time or does that violate the semantics of the underlying MPIX_Stream?

hzhou · 2024-06-10T17:56:03Z

how does it interact with MPI thread semantics? Could another thread MPI_Send_enqueue to the GPU context at the same time or does that violate the semantics of the underlying MPIX_Stream?

The basic semantic for MPIX stream is a serial execution context, so users are required to ensure there is no concurrent calls from multiple threads for operations on the same stream. The stream synchronize at the beginning of MPI_Send (etc) clears the GPU stream, thus effectively serializes the GPU stream and the calling thread.

The assumption is during the call of e.g. MPI_Send (before it return), no additional GPU stream operations will be issued. This assumption is the requirement of MPIX_Stream semantic -- no two threads can concurrently operate on the same stream.

patrickb314 · 2024-06-10T18:04:47Z

The assumption is during the call of e.g. MPI_Send (before it return), no additional GPU stream operations will be issued. This assumption is the requirement of MPIX_Stream semantic -- no two threads can concurrently operate on the same stream.

Cools, that's what I hoped (and one of the reasons I like the MPIX_Stream concept).

hzhou added 3 commits June 5, 2024 09:59

mpl: add MPL_gpu_stream_synchronize

36ade32

Useful for GPU stream based MPI extensions.

hzhou force-pushed the 2406_stream_sync branch from 61cbd22 to a10c9e5 Compare June 5, 2024 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stream: add stream synchronize to non-stream operations #7023

stream: add stream synchronize to non-stream operations #7023

hzhou commented Jun 5, 2024 •

edited

Loading

patrickb314 commented Jun 5, 2024

hzhou commented Jun 5, 2024

patrickb314 commented Jun 5, 2024

hzhou commented Jun 5, 2024 •

edited

Loading

patrickb314 commented Jun 5, 2024

hzhou commented Jun 6, 2024 •

edited

Loading

patrickb314 commented Jun 10, 2024

hzhou commented Jun 10, 2024

patrickb314 commented Jun 10, 2024

stream: add stream synchronize to non-stream operations #7023

Are you sure you want to change the base?

stream: add stream synchronize to non-stream operations #7023

Conversation

hzhou commented Jun 5, 2024 • edited Loading

Pull Request Description

Author Checklist

patrickb314 commented Jun 5, 2024

hzhou commented Jun 5, 2024

patrickb314 commented Jun 5, 2024

hzhou commented Jun 5, 2024 • edited Loading

patrickb314 commented Jun 5, 2024

hzhou commented Jun 6, 2024 • edited Loading

patrickb314 commented Jun 10, 2024

hzhou commented Jun 10, 2024

patrickb314 commented Jun 10, 2024

hzhou commented Jun 5, 2024 •

edited

Loading

hzhou commented Jun 5, 2024 •

edited

Loading

hzhou commented Jun 6, 2024 •

edited

Loading