You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On some systems (e.g., Frontier) GPU memory is directly attached to the NIC, and motioning data to the host before transfer (and receiving in host memory then moving it to GPU) is very expensive. Thus we want to enable data transfers to
initiate from the GPU manager and feed GPU data buffers to MPI
allocate GPU memory at the receiver from the communication system, and pass these data buffers to the recv (RGETs)
Describe the challenges
GPU manager side
The GPU manager being 'off' at times makes deferring actions from the comm thread to the GPU manager very complex, we don't want the comm thread to act as a GPU manager that would cause performance problems presumably
change code so that the GPU manager is always active on one thread
Send side
The gpu manager can call send_activate, and that will issue the send of the Ctl messages (we may want to delegate the send-activate to the MPI threads.
The PUT already automatically use the data_out, so if we don't do the PUSHOUT, we believe that it will behave correctly, except
remove the pushout and see how it blows up (hopefully not)
removing the pushout, will not prevent successor tasks from reusing the data_out locally and potentially modify it while we are reading it; PUSHOUT today inserts a kernel_pop event in the stream and that would not be done anymore, we would need to replicate that behavior to prevent WAR accesses. @abouteiller asks: not clear why this is needed when the PTG gpu code never had that problem before?
DATA_COPY_RELEASE (as seen in remote_dep_complete_and_cleanup): this is valid only for CPU data copies, we may want to investigate if we can have a specialized destructor for GPU data copies (that returns to the LRU etc, maybe defer it to the gpu manager).
provide specialized gpu data copies with destructors for copies that safe-thread decrease or defer the decrement of readers and push them to the LRUs.
Recv side
GPU allocator that can be executed from the recv comm thread, that's an async operation
Problem 1: that's an async operation (GPU memory may be full, we cannot allocate at this time)
Problem 2: we cannot allocate partially inputs for tasks without running the risk of live-locking tasks each ones with some data ready, some data not allocated
only when all the inputs of a tasks are ready do we try to allocate the inputs and schedule the GET orders (PULL model vs. PUSH model, we may want to support both and have task decorators tell us which is best)
Running evaluate on the progress thread is ok, but get_best_device early, before the task is ready, may be problematic (this may be fixed by item above)
How to notify transfer completion to the GPU manager?
(bad?) idea: repurpose stage_in to execute the the MPI_Irecv, and generate completion events on the GPU streams. Problem is that linking recv completion with stream events is hard/impossible with current state of MPI :(
other idea: use cuda graph?
other other idea: split the gpu task lifecycle: { alloc, stagein, mpirecv, exec, stakeout, finalize }
this looks like the closest to existing code way of doing it atm: we will have a supplementary stage for non-local stagein that will simulate an extra stream (that will be triggered by mpi_test outcomes).
problem: 2 gpu managers ask for the same remote copy (this is analogous to nvlink decision)
idea1: the stage_in_comm may bounce back to stage_in to trigger an nvlink copy from the completed recv
idea2: the stage_in_comm always execute before the stage-in, stage-in that happens while the comm-in is active for any gpu on that data copy is HOOK_AGAIN, so that we will do nvlink later.
idea3 (easiest to code): the comm-in follows the existing PUSH model: only difference is we allocate on GPU and mpi_recv to it, we don't schedule any stage-in before all comm-in are finished. The allocation will be per-data (not per inputs of tasks), on a GPU (potentially arbitrary, if we can't decide better), normal stage-in will move data with D2D as needed later.
try this model
upgrade so that OOM on GPU spills over in CPU memory
fast path memory allocation: if zone_malloc or read-LRU can give us a data NOW we can start asap, no bouncing events for allocation completion.
The text was updated successfully, but these errors were encountered:
Description
On some systems (e.g., Frontier) GPU memory is directly attached to the NIC, and motioning data to the host before transfer (and receiving in host memory then moving it to GPU) is very expensive. Thus we want to enable data transfers to
Describe the challenges
GPU manager side
Send side
send_activate
, and that will issue the send of the Ctl messages (we may want to delegate the send-activate to the MPI threads.Recv side
stage_in
to execute the the MPI_Irecv, and generate completion events on the GPU streams. Problem is that linking recv completion with stream events is hard/impossible with current state of MPI :(The text was updated successfully, but these errors were encountered: