-
Notifications
You must be signed in to change notification settings - Fork 23
RFC.7
Supporting tasks which require a mix of MPI and OpenMP, and a mix of CPU cores and GPUs are currently difficult or impossible to express in RP. That has two reasons:
- the naming of
gpu_processes
is outdated, as that field is interpreted asgpus_per_process
- RP assumes homogeneous tasks where all processes or ranks acquire the same resources
This RFC proposes a change to the compute unit description to remedy that.
The proposal results in the following CU description:
[
{
'processes' : 6,
'cores_per_process' : 2,
'gpus_per_process' : 1,
'process_type' : rp.MPI,
'thread_type' : rp.OpenMP,
'gpu_type' : rp.CUDA
},
{
'processes' : 1,
'nodes' : 0.5,
'process_type' : rp.MPI
}
]
The proposal supports tasks in which individual processes have different resource requirements. In the given example, one process will own half a node (irregardless of cores or GPUs per node), and all other processes (6) will own 2 cores and one GPU each.
The CUD does not specify a number of threads it intents to create, but specifies how many cores it requires to run. It can then use the assigned cores to spawn threads or processes.
By supporting the notion of nodes
, the application granularity can in many cases adapt to different resources w/o changing the task description.
This CU would run 12 processes, and each process would get 1/3rd of a node to fill with threads. Independent from the node layout, this will result in 4 nodes, and OMP_NUM_THREADS
set to the correct value (4 on a 12-core-node, 8 on a 24-core-node, 10 on a 32-core-node - 2 cores are wasted here).
"There are two hard problems in CS: naming things, cache invalidation, and off-by-one-errors."
The changes are a major shift in the API. It might be a good opportunity to pair that change with a shift to TaskDescription
instead of ComputeUnitDescritption
. That would also make it easy to provide backward compatibility: the original CUD can be kept and internally be translated into the new TaskDescription
.
We will not be able to support all layout options with all launch method / executor / scheduler configurations -- but that is actually not different from what we have now.
The schedule already shifted to allocating individual slots per process, and is thus well prepared to allocate heterogeneous tasks. We may need to relax the notion on core continuity further though, as heterogeneous tasks are less likely to match to continuous sets of resources.