Skip to content
Andre Merzky edited this page Jul 30, 2017 · 1 revision

Combining Launch Methods

Problem

The agent configuration contains lines like this (here for xsede.stampede_ssh):

    "agent_spawner"               : "POPEN",
    "agent_launch_method"         : "SSH",
    "task_launch_method"          : "SSH",
    "mpi_launch_method"           : "MPIRUN_RSH",

which mean:

  • mpi units are launched via mpirun_rsh
  • other units are launched via ssh
  • sub-agents are started via ssh, too.
  • mpirun_rsh and ssh processes are spawned via Python's popen

Now, with RFC.1 implemented, we do not have a clear separation between mpi- and non-mpi units anymore. Specifically, we promise to set OMP_NUM_THREADS for the applications to inform it about the set of cores available for each process to use. The current fixed structure would require us to either centrally always set that env variable, or to set it specifically for each launch method, ie. to spread it over different source code files.

Proposal:

Change the configuration to:

   "launch_methods" : ["ssh", "open_mp", "mpirun_rsh"]

and dynamically decide on each unit dynamically which LM should be invoked:

  cud.cpu_processes = 2
  cud.cpu_threads   = 3
  cud.cpu_process_type = rp.MPI
  cud.cpu_thread_type  = rp.OpenMP

Each LM would register for one or more thread and process types to handle, and the agent would invoke all LMs which apply to a given unit (here MPI and OpenMP). The OpenMP LM would only set the env variable, really, and could thus easily combined with the MPI LM - but just as easily be used with a Non-MPI LM like ssh.

Impact:

This is technically not difficult to implement, as the core of the launch methods would basically remain unchanged. It is somewhat tedious to introduce the abstraction, as the LM needs to register what it is able to handle, and the executor needs to sort through that depending on the unit description. This is not really difficult though, and I also don't expect this to significantly affect performance.

As always when introducing a new abstraction, anybody looking at the code needs to understand what the abstraction is about, and why it was introduced - code complexity is increasing in this respect. But, OTOH, code clarity ultimately improves, for the reasons discussed in the problem statement.

Clone this wiki locally