Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct GPU-GPU inter-node data transfer #661

Open
8 tasks
abouteiller opened this issue Jun 11, 2024 · 0 comments
Open
8 tasks

Direct GPU-GPU inter-node data transfer #661

abouteiller opened this issue Jun 11, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@abouteiller
Copy link
Contributor

Description

On some systems (e.g., Frontier) GPU memory is directly attached to the NIC, and motioning data to the host before transfer (and receiving in host memory then moving it to GPU) is very expensive. Thus we want to enable data transfers to

  1. initiate from the GPU manager and feed GPU data buffers to MPI
  2. allocate GPU memory at the receiver from the communication system, and pass these data buffers to the recv (RGETs)

Describe the challenges

GPU manager side

  • The GPU manager being 'off' at times makes deferring actions from the comm thread to the GPU manager very complex, we don't want the comm thread to act as a GPU manager that would cause performance problems presumably
    • change code so that the GPU manager is always active on one thread

Send side

  • The gpu manager can call send_activate, and that will issue the send of the Ctl messages (we may want to delegate the send-activate to the MPI threads.
  • The PUT already automatically use the data_out, so if we don't do the PUSHOUT, we believe that it will behave correctly, except
    • remove the pushout and see how it blows up (hopefully not)
  • removing the pushout, will not prevent successor tasks from reusing the data_out locally and potentially modify it while we are reading it; PUSHOUT today inserts a kernel_pop event in the stream and that would not be done anymore, we would need to replicate that behavior to prevent WAR accesses. @abouteiller asks: not clear why this is needed when the PTG gpu code never had that problem before?
  • DATA_COPY_RELEASE (as seen in remote_dep_complete_and_cleanup): this is valid only for CPU data copies, we may want to investigate if we can have a specialized destructor for GPU data copies (that returns to the LRU etc, maybe defer it to the gpu manager).
    • provide specialized gpu data copies with destructors for copies that safe-thread decrease or defer the decrement of readers and push them to the LRUs.

Recv side

  • GPU allocator that can be executed from the recv comm thread, that's an async operation
    • Problem 1: that's an async operation (GPU memory may be full, we cannot allocate at this time)
    • Problem 2: we cannot allocate partially inputs for tasks without running the risk of live-locking tasks each ones with some data ready, some data not allocated
    • only when all the inputs of a tasks are ready do we try to allocate the inputs and schedule the GET orders (PULL model vs. PUSH model, we may want to support both and have task decorators tell us which is best)
  • Running evaluate on the progress thread is ok, but get_best_device early, before the task is ready, may be problematic (this may be fixed by item above)
  • How to notify transfer completion to the GPU manager?
    • (bad?) idea: repurpose stage_in to execute the the MPI_Irecv, and generate completion events on the GPU streams. Problem is that linking recv completion with stream events is hard/impossible with current state of MPI :(
    • other idea: use cuda graph?
    • other other idea: split the gpu task lifecycle: { alloc, stagein, mpirecv, exec, stakeout, finalize }
      • this looks like the closest to existing code way of doing it atm: we will have a supplementary stage for non-local stagein that will simulate an extra stream (that will be triggered by mpi_test outcomes).
  • problem: 2 gpu managers ask for the same remote copy (this is analogous to nvlink decision)
    • idea1: the stage_in_comm may bounce back to stage_in to trigger an nvlink copy from the completed recv
    • idea2: the stage_in_comm always execute before the stage-in, stage-in that happens while the comm-in is active for any gpu on that data copy is HOOK_AGAIN, so that we will do nvlink later.
    • idea3 (easiest to code): the comm-in follows the existing PUSH model: only difference is we allocate on GPU and mpi_recv to it, we don't schedule any stage-in before all comm-in are finished. The allocation will be per-data (not per inputs of tasks), on a GPU (potentially arbitrary, if we can't decide better), normal stage-in will move data with D2D as needed later.
      • try this model
      • upgrade so that OOM on GPU spills over in CPU memory
      • fast path memory allocation: if zone_malloc or read-LRU can give us a data NOW we can start asap, no bouncing events for allocation completion.
@abouteiller abouteiller added the enhancement New feature or request label Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant