RFC.1

GPU and mixed MPI OpenMP units

see also https://github.com/radical-cybertools/radical.pilot/issues/1365

Proposal

Supporting units which require GPUs or a mix of MPI and OpenMP turns out to put similar requirements on the agent scheduler and unit launch methods. Specifically, the scheduler needs to allocate resources on which no processes are spawned, and the launch method needs to spawn processes on a subset of allocated resources, but at the same time needs to span nodes.

The current compute unit description has insufficient attributes to specify the respective required layout. This RFC proposes a change to the compute unit description to remedy that.

The proposal results in the following CU description (Option 1):

  {
    'cpu_processes'           : 6,
    'threads_per_cpu_process' : 3,
    'gpu_processes'           : 2,
    'mpi'                     : True,
    'open_mp'                 : True
  }

Alternative (AM) (Option 2):

  {
    'cpu_processes'           : '6:3',    # 6 MPI processes with  3 OpenMP threads each
    'gpu_processes'           : '2:128',  # 2 MPI processes with 28 CUDA   threads each 
    'cpu_process_type'        : 'MPI',    # MPI    | OS (default)
    'cpu_thread_type'         : 'OpenMP', # OpenMP | OS (default)
    'gpu_process_type'        : 'OS',     # CUDA   | OS (default)
    'gpu_thread_type'         : 'OS'      # OpenCL | OS (default)
  }

Alternative (MT) (Option 3):

  {
    'cpu_processes'           : '6',      # 6  MPI processes
    'cpu_threads'             : '3',      # 3  OpenMP threads for each process
    'gpu_processes'           : '2',      # 2  MPI processes
    'gpu_threads'             : '128',    # 28 CUDA threads for each process 
    'cpu_process_type'        : 'MPI',    # MPI    | OS (default)
    'cpu_thread_type'         : 'OpenMP', # OpenMP | OS (default)
    'gpu_process_type'        : 'OS'      # CUDA   | OS (default)
    'gpu_thread_type'         : 'OS'      # OpenCL | OS (default)
  }

Discussion (Option 1?)

no Cores, GPUs

The CUD does not specify a number of cores or GPUs anymore. That is implicit (processes * threads_per_process).

Resource Independence

The main objective is to not reason in the notion of node, core and cpu configuration of the target machine, but to describe the CU in terms of processes, threads, and their relation (multi-node, single-node, etc). We allow for some resource dependent parameters, specifically:

$CPN: cores per node
$GPN: GPUs per node

which can also be used in simply expressions like

{
    'cpu_processes'           : '12',
    'threads_pre_cpu_process' : '$CPN / 3',
    'mpi'                     : False,
    'open_mp'                 : True
}

This CU would run 12 processes, and each process would get 1/3rd of a node to fill with threads. Independent from the node layout, this will result in 4 nodes, and OMP_NUM_THREADS set to the correct value (4 on a 12-core-node, 8 on a 24-core-node, 10 on a 32-core-node - 2 cores are wasted here).

Naming

"There are three hard problems in CS: naming things, and off-by-one-errors."

cpu_processes etc are unwieldy - we allow for processes and threads_per_process as alias (for cpu only).

Backward Compatibility

We interpret the current cores as cpu_processes, and use the following default:

{
  'open_mp'                 : False,
  'gpu_processes'           : 0,
  'threads_pre_cpu_process' : 1,
  'threads_pre_gpu_process' : 1
}

Implementation constrains

We will not be able to support all layout options with all launch method / executor / scheduler configurations -- but that is actually not different from what we have now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly