-
Notifications
You must be signed in to change notification settings - Fork 23
RFC.1
Supporting units which require GPUs or a mix of MPI and OpenMP turns out to put similar requirements on the agent scheduler and unit launch methods. Specifically, the scheduler needs to allocate resources on which no processes are spawned, and the launch method needs to spawn processes on a subset of allocated resources, but at the same time needs to span nodes.
The current compute unit description has insufficient attributes to specify the respective required layout. This RFC proposes a change to the compute unit description to remedy that.
The proposal results in the following CU description (Option 1):
{
'cpu_processes' : 6,
'threads_per_cpu_process' : 3,
'gpu_processes' : 2,
'mpi' : True,
'open_mp' : True
}
{
'cpu_processes' : '6:3', # 6 MPI processes with 3 OpenMP threads each
'gpu_processes' : '2:128', # 2 MPI processes with 28 CUDA threads each
'cpu_process_type' : 'MPI', # MPI | OS (default)
'cpu_thread_type' : 'OpenMP', # OpenMP | OS (default)
'gpu_process_type' : 'OS', # CUDA | OS (default)
'gpu_thread_type' : 'OS' # OpenCL | OS (default)
}
{
'cpu_processes' : '6', # 6 MPI processes
'cpu_threads' : '3', # 3 OpenMP threads for each process
'gpu_processes' : '2', # 2 MPI processes
'gpu_threads' : '128', # 28 CUDA threads for each process
'cpu_process_type' : 'MPI', # MPI | OS (default)
'cpu_thread_type' : 'OpenMP', # OpenMP | OS (default)
'gpu_process_type' : 'OS' # CUDA | OS (default)
'gpu_thread_type' : 'OS' # OpenCL | OS (default)
}
The CUD does not specify a number of cores or GPUs anymore. That is implicit (processes * threads_per_process
).
The main objective is to not reason in the notion of node, core and cpu configuration of the target machine, but to describe the CU in terms of processes, threads, and their relation (multi-node, single-node, etc). We allow for some resource dependent parameters, specifically:
-
$CPN
: cores per node -
$GPN
: GPUs per node
which can also be used in simply expressions like
{
'cpu_processes' : '12',
'threads_pre_cpu_process' : '$CPN / 3',
'mpi' : False,
'open_mp' : True
}
This CU would run 12 processes, and each process would get 1/3rd of a node to fill with threads. Independent from the node layout, this will result in 4 nodes, and OMP_NUM_THREADS
set to the correct value (4 on a 12-core-node, 8 on a 24-core-node, 10 on a 32-core-node - 2 cores are wasted here).
"There are three hard problems in CS: naming things, and off-by-one-errors."
cpu_processes
etc are unwieldy - we allow for processes
and threads_per_process
as alias (for cpu
only).
We interpret the current cores
as cpu_processes
, and use the following default:
{
'open_mp' : False,
'gpu_processes' : 0,
'threads_pre_cpu_process' : 1,
'threads_pre_gpu_process' : 1
}
We will not be able to support all layout options with all launch method / executor / scheduler configurations -- but that is actually not different from what we have now.