Direct GPU-GPU inter-node data transfer #661

abouteiller · 2024-06-11T18:38:44Z

Description

On some systems (e.g., Frontier) GPU memory is directly attached to the NIC, and motioning data to the host before transfer (and receiving in host memory then moving it to GPU) is very expensive. Thus we want to enable data transfers to

initiate from the GPU manager and feed GPU data buffers to MPI
allocate GPU memory at the receiver from the communication system, and pass these data buffers to the recv (RGETs)

Describe the challenges

GPU manager side

The GPU manager being 'off' at times makes deferring actions from the comm thread to the GPU manager very complex, we don't want the comm thread to act as a GPU manager that would cause performance problems presumably
- change code so that the GPU manager is always active on one thread

Send side

The gpu manager can call send_activate, and that will issue the send of the Ctl messages (we may want to delegate the send-activate to the MPI threads.
The PUT already automatically use the data_out, so if we don't do the PUSHOUT, we believe that it will behave correctly, except
- remove the pushout and see how it blows up (hopefully not)
removing the pushout, will not prevent successor tasks from reusing the data_out locally and potentially modify it while we are reading it; PUSHOUT today inserts a kernel_pop event in the stream and that would not be done anymore, we would need to replicate that behavior to prevent WAR accesses. @abouteiller asks: not clear why this is needed when the PTG gpu code never had that problem before?
DATA_COPY_RELEASE (as seen in remote_dep_complete_and_cleanup): this is valid only for CPU data copies, we may want to investigate if we can have a specialized destructor for GPU data copies (that returns to the LRU etc, maybe defer it to the gpu manager).
- provide specialized gpu data copies with destructors for copies that safe-thread decrease or defer the decrement of readers and push them to the LRUs.

Recv side

GPU allocator that can be executed from the recv comm thread, that's an async operation
- Problem 1: that's an async operation (GPU memory may be full, we cannot allocate at this time)
- Problem 2: we cannot allocate partially inputs for tasks without running the risk of live-locking tasks each ones with some data ready, some data not allocated
- only when all the inputs of a tasks are ready do we try to allocate the inputs and schedule the GET orders (PULL model vs. PUSH model, we may want to support both and have task decorators tell us which is best)
Running evaluate on the progress thread is ok, but get_best_device early, before the task is ready, may be problematic (this may be fixed by item above)
How to notify transfer completion to the GPU manager?
- (bad?) idea: repurpose stage_in to execute the the MPI_Irecv, and generate completion events on the GPU streams. Problem is that linking recv completion with stream events is hard/impossible with current state of MPI :(
- other idea: use cuda graph?
- other other idea: split the gpu task lifecycle: { alloc, stagein, mpirecv, exec, stakeout, finalize }
  - this looks like the closest to existing code way of doing it atm: we will have a supplementary stage for non-local stagein that will simulate an extra stream (that will be triggered by mpi_test outcomes).
problem: 2 gpu managers ask for the same remote copy (this is analogous to nvlink decision)
- idea1: the stage_in_comm may bounce back to stage_in to trigger an nvlink copy from the completed recv
- idea2: the stage_in_comm always execute before the stage-in, stage-in that happens while the comm-in is active for any gpu on that data copy is HOOK_AGAIN, so that we will do nvlink later.
- idea3 (easiest to code): the comm-in follows the existing PUSH model: only difference is we allocate on GPU and mpi_recv to it, we don't schedule any stage-in before all comm-in are finished. The allocation will be per-data (not per inputs of tasks), on a GPU (potentially arbitrary, if we can't decide better), normal stage-in will move data with D2D as needed later.
  - try this model
  - upgrade so that OOM on GPU spills over in CPU memory
  - fast path memory allocation: if zone_malloc or read-LRU can give us a data NOW we can start asap, no bouncing events for allocation completion.

The text was updated successfully, but these errors were encountered:

abouteiller added the enhancement New feature or request label Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct GPU-GPU inter-node data transfer #661

Direct GPU-GPU inter-node data transfer #661

abouteiller commented Jun 11, 2024

Direct GPU-GPU inter-node data transfer #661

Direct GPU-GPU inter-node data transfer #661

Comments

abouteiller commented Jun 11, 2024

Description

Describe the challenges

GPU manager side

Send side

Recv side