Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stream: add stream synchronize to non-stream operations #7023

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Jun 5, 2024

Pull Request Description

In the original proposal, if a stream communicator is backed by a local GPU stream, we only can issue _enqueue operations, such as MPI_Send_enqueue, MPI_Isend_enqueue, MPI_Wait_enqueue, etc. However, as pointed out by others, it is convenient and useful to allow gpu-backed stream communicators to be used with regular MPI operations so that it can be readily used with legacy libraries. To be semantically correct, we need insert stream synchronization calls, i.e. cudaStreamSynchronize to ensure buffer safety.

We only need call stream synchronize before the start of MPI operation to ensure the buffers are cleared from the GPU side. There is no need for stream synchronize after the MPI completion (e.g. MPI_Wait) since the offloading operations issued after the MPI completion is safe to access the buffers.

The MPI synchronization, e.g. MPI_Test, MPI_Wait, MPI_Win_fence, are essentially the host-side equivalent of GPU-side stream synchronize.

image

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

hzhou added 3 commits June 5, 2024 09:59
Useful for GPU stream based MPI extensions.
Extend the functionality of MPIX stream by allowing passing a stream
communication with local gpu stream to regular MPI functions. The
semantics is to run an implicit gpu stream synchronize before the MPI
operation. This is semantically serializing the MPI operation with the
GPU stream, albeit in a heavy handed way.
If a stream communicator is backed with a gpu stream and we call regular
pt2pt and collective functions instead of the enqueue functions, we
should run streamsynchronize to ensure the buffer is cleared for the
pt2pt and coll operations.

There is no need to stream synchronize for completion functions, i.e.
Test and Wait, since the buffer safety is asserted by the nonblocking
semantics and offloading calls issued after the completion function are
safe to use the buffer.

Amend: add rma operations too. Following the same reason, we don't need
stream synchronize for Win_fence, Win_lock, etc. The RMA synchronize
calls are essentially the host-side counterpart to stream synchronize.
@patrickb314
Copy link

Is the goal to preserve the MPI semantics of the existing MPI calls such that they guarantee local or remote completion on return to the calling context (which is, I think, the correct goal). In standard MPI semantics, if a call to MPI_Recv returnes successfully, the calling context (not just subsequently called MPI operations) has guarantees about the state of the buffer on return from the call; I don't think this provides that.

@hzhou
Copy link
Contributor Author

hzhou commented Jun 5, 2024

Is the goal to preserve the MPI semantics of the existing MPI calls such that they guarantee local or remote completion on return to the calling context (which is, I think, the correct goal).

Yes.

In standard MPI semantics, if a call to MPI_Recv returnes successfully, the calling context (not just subsequently called MPI operations) has guarantees about the state of the buffer on return from the call; I don't think this provides that.

MPI_Recv is a blocking call. It may be better to split it into MPI_Irecv + MPI_Wait for the sake of discussion.
Because MPI_Irecv starts an MPI operation that will access the message buffer, a stream synchronize is needed to clear all previously issued (GPU-stream) operations that potentially will access the buffer.
Between MPI_Irecv and MPI_Wait, users are not allowed (or incorrect) to issue operations that may access the buffer. This includes issuing GPU offloading operations.
MPI_Wait does not change the state of buffer access ownership, so it doesn't need another stream synchronization. On return, the buffer is released back to user. Because there shouldn't be any pending GPU operations that depends on the buffer, there is no need to stream synchronize.
After MPI_Wait, users can issue operations that access the Recv buffer including issuing offloading operations.

So I think upon "return" of MPI_Wait, it does provide the guarantee about the state of the buffer, as much as in the traditional case, that is, assuming user didn't violate the nonblocking requirement that they are not supposed to access the buffer between MPI starting call and the corresponding completion call.

@patrickb314
Copy link

  1. The caller make guarantees about the state of a buffer before calling MPI with a buffer, in the case of IRecv that the buffer is ready for MPI to work on. If a stream sync is necessary here, IMO traditional MPI semantics place the burden of that on the programmer, not MPI, right?
  2. My understanding is that MPI_Wait_enqueue (or MPI_Wait on a GPU stream communicator) enqueued the wait to the other execution stream and returns immediately to the calling context. In particular, the MPIX_Stream paper notes "For example, MPIX_Send_enqueue, as with all enqueuing APIs, returns immediately after registering the operation." Is that incorrect?

@hzhou
Copy link
Contributor Author

hzhou commented Jun 5, 2024

  1. The caller make guarantees about the state of a buffer before calling MPI with a buffer, in the case of IRecv that the buffer is ready for MPI to work on. If a stream sync is necessary here, IMO traditional MPI semantics place the burden of that on the programmer, not MPI, right?

If you pass a gpu-stream backed stream communicator to a legacy library, say PETSc, PETSc won't be aware of the stream semantics and thus won't fulfill the stream sync burden. If you place the burden on the caller of PETSc, that may work. It should work without this PR.

  1. My understanding is that MPI_Wait_enqueue (or MPI_Wait on a GPU stream communicator) enqueued the wait to the other execution stream and returns immediately to the calling context. In particular, the MPIX_Stream paper notes "For example, MPIX_Send_enqueue, as with all enqueuing APIs, returns immediately after registering the operation." Is that incorrect?

That is correct for the _enqueue operations, but not MPI_Wait. MPI_Wait does not have the stream "enqueue" semantics, and it will progress and wait within the host execution context. Thus, on its return, the buffer is ready.

We do not desire to change the semantics of existing APIs such as MPI_Wait e.g. by adding an _enqueue semantics, or did I misunderstand?

@patrickb314
Copy link

I think in the case you state with PetSC, it's still on the programmer; handing PetSC a communicator with pending ISEnds and IRecvs on it is not much different than handing it one with enqueued MPI operations on it.

I completely agree we don't want to change the semantics of MPI_Wait. My understanding was that Send, Recv, and Wait called on stream communicators with an associated GPU stream (even when not called through the _enqueue interface) resulted in enqueuing operations; perhaps I misunderstood.

@hzhou
Copy link
Contributor Author

hzhou commented Jun 6, 2024

My understanding was that Send, Recv, and Wait called on stream communicators with an associated GPU stream (even when not called through the _enqueue interface) resulted in enqueuing operations; perhaps I misunderstood.

That was one of the initial possibilities, but I think it is now evident that it will create too much confusion. Before our conversation, I was leaning toward banning using GPU-stream-communicators on traditional MPI operations, but you convinced me there is merit in making it work similar to how we made MPI_THREAD_MULTIPLE work.

To the user, I think it is semantically equivalent whether the implementation is enqueuing the operation then stream synchronizing, or stream synchronizing then carrying out the operation in the host context. Thus it is an implementation detail. Because there will be a stream synchronization either way, I don't think there is much performance difference. Of course, the latter is currently much easier to implement as you see in this PR.

@patrickb314
Copy link

Aha, I see. We were thinking of two different things. I was considering f MPI_Recv on GPU stream communicators as enqueueing the operation and needing to be modified to preserve traditional semantics. You were thinking out it as a seperate operation that always waited for completion to return but needs to synchronize with the stream triggered operations already enqueued to preserve ordering.

Under the later frame, synchronizing before performing the operation in the host context is certainly semantically necessary, but is it sufficient, and how does it interact with MPI thread semantics? Could another thread MPI_Send_enqueue to the GPU context at the same time or does that violate the semantics of the underlying MPIX_Stream?

@hzhou
Copy link
Contributor Author

hzhou commented Jun 10, 2024

how does it interact with MPI thread semantics? Could another thread MPI_Send_enqueue to the GPU context at the same time or does that violate the semantics of the underlying MPIX_Stream?

The basic semantic for MPIX stream is a serial execution context, so users are required to ensure there is no concurrent calls from multiple threads for operations on the same stream. The stream synchronize at the beginning of MPI_Send (etc) clears the GPU stream, thus effectively serializes the GPU stream and the calling thread.

The assumption is during the call of e.g. MPI_Send (before it return), no additional GPU stream operations will be issued. This assumption is the requirement of MPIX_Stream semantic -- no two threads can concurrently operate on the same stream.

@patrickb314
Copy link

The assumption is during the call of e.g. MPI_Send (before it return), no additional GPU stream operations will be issued. This assumption is the requirement of MPIX_Stream semantic -- no two threads can concurrently operate on the same stream.

Cools, that's what I hoped (and one of the reasons I like the MPIX_Stream concept).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants