RFC.7

Heterogeneous Tasks

Proposal

Supporting tasks which require a mix of MPI and OpenMP, and a mix of CPU cores and GPUs are currently difficult or impossible to express in RP. That has two reasons:

the naming of gpu_processes is outdated, as that field is interpreted as gpus_per_process
RP assumes homogeneous tasks where all processes or ranks acquire the same resources

This RFC proposes a change to the compute unit description to remedy that.

The proposal results in the following CU description:

  [
     {
       'processes'           : 6,
       'cores_per_process'   : 2,
       'gpus_per_process'    : 1,
       'process_type'        : rp.MPI,
       'thread_type'         : rp.OpenMP,
       'gpu_type'            : rp.CUDA
     },
     {
       'processes'           : 1,
       'nodes'               : 0.5,
       'process_type'        : rp.MPI
     }
  ]

Discussion

Heterogeneity

The proposal supports tasks in which individual processes have different resource requirements. In the given example, one process will own half a node (irregardless of cores or GPUs per node), and all other processes (6) will own 2 cores and one GPU each.

no thread numbers

The CUD does not specify a number of threads it intents to create, but specifies how many cores it requires to run. It can then use the assigned cores to spawn threads or processes.

Resource Independence

By supporting the notion of nodes, the application granularity can in many cases adapt to different resources w/o changing the task description.

This CU would run 12 processes, and each process would get 1/3rd of a node to fill with threads. Independent from the node layout, this will result in 4 nodes, and OMP_NUM_THREADS set to the correct value (4 on a 12-core-node, 8 on a 24-core-node, 10 on a 32-core-node - 2 cores are wasted here).

Naming, Backward Compatibility

"There are two hard problems in CS: naming things, cache invalidation, and off-by-one-errors."

The changes are a major shift in the API. It might be a good opportunity to pair that change with a shift to TaskDescription instead of ComputeUnitDescritption. That would also make it easy to provide backward compatibility: the original CUD can be kept and internally be translated into the new TaskDescription.

Implementation constrains

We will not be able to support all layout options with all launch method / executor / scheduler configurations -- but that is actually not different from what we have now.

The schedule already shifted to allocating individual slots per process, and is thus well prepared to allocate heterogeneous tasks. We may need to relax the notion on core continuity further though, as heterogeneous tasks are less likely to match to continuous sets of resources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly