How to handle clusters that are node-exclusive rather than core-exclusive? #3143

SteVwonder · 2020-08-14T22:39:45Z

SteVwonder
Aug 14, 2020
Maintainer

In LC, we typically have two kinds of systems: node-exclusive and core-exclusive. The difference can be summarized with the answer to "what happens when a user asks for a single task and a single core?". In a node-exclusive cluster, the user gets the whole node; in a core-exclusive cluster, the user gets a single core on a single node that they share with other users.

Right now, with the current combination of jobspec V1, flux mini * commands, job-ingest validator, Fluxion scheduler, and job-shell, there is no way to enforce node-level/node-exclusive scheduling. Users can "opt-in" to node-level scheduling with a carefully crafted jobspec, but there is no way currently to guarantee node-exclusive allocations "out-of-the-box".

One solution that was discussed during coffee hour was to add a new configuration key/value that would turn on node-exclusive allocations. When that configuration is set:

flux mini would always include a node resource as well as adding exclusive: true under the node resource in the jobspec
job-ingest validator would ensure that the node resource is present and set to be exclusively allocation

This solution requires no modifications to the Fluxion scheduler. One open question would be what the default count value for the node resource should be when -N is not provided explicitly.

Since there are potentially many different ways to handle this problem, we decided to start with a discussion until we coalesce around a solution, then we can either open an issue or convert this discussion to one (if that's allowed).

Answered by grondo

Mar 13, 2022

We have a solution for Fluxion provided by flux-framework/flux-sched#900. The key parameter is to configure the [sched-fluxion-resource] table with

[sched-fluxion-resource]
match-policy = "lonodex"

See flux-config-sched-fluxion-resource(5) for more details.

The one remaining question was posed by @ryanday36 above:

The folks who are working on the CTS-2 procurement are wondering if we'll be able to have node scheduled and core scheduled queues / partitions on the same cluster. Is there anything about the approach that was discussed here that would get in the way of that?

Since the match-policy for the resource module is configured as a whole, I don't think there is a way to do this with …

View full answer

grondo · 2020-08-14T23:18:30Z

grondo
Aug 14, 2020
Maintainer

On the call we also discussed that it would be nice if jobspec could be modified after user submission. Since the jobspec is signed by the user this idea was discarded out of hand.

However, it occurs to me that there is nothing in the security architecture of Flux that requires the scheduler to use the unmodified resources section of user-submitted jobspec when finding a matching resource set. The scheduler could therefore internally modify the request as above when configuration dictates, and use the modified jobspec in its matching policy.

A problem with this approach is that the job shell currently simply takes R and the tasks section of jobspec to determine the number of tasks to launch. On a node-exclusive cluster, a jobspec generated with flux mini run -n1 -c1 which then is instead assigned a node with multiple cores, would then run multiple tasks.

In general though, the job shell should be smarter than this. A scheduler only has to match at a minimum the requested resources, and assuming that the user wanted to run exactly as many tasks as there are slots in the final resource set assignment will be error prone.

1 reply

SteVwonder Aug 15, 2020
Maintainer Author

However, it occurs to me that there is nothing in the security architecture of Flux that requires the scheduler to use the unmodified resources section of user-submitted jobspec when finding a matching resource set. The scheduler could therefore internally modify the request as above when configuration dictates, and use the modified jobspec in its matching policy.

I like that this would make these issues transparent to the user, but I'm also worried that it could be surprising. In particular, I think it would be confusing if users are feeding in their "blessed" campaign/workflow jobspec and getting different resource allocations depending on the system. I guess ultimately there is a trade-off between minimal user intervention and consistent semantics.

grondo · 2020-08-15T14:41:31Z

grondo
Aug 15, 2020
Maintainer

Great point! However, any scheduler may allocate resources at a minimum, so users should expect different behavior depending on the loaded scheduler and it's configuration. The nice thing about Flux is that you can at least have a hope of consistent behavior by launching your own instance and using a scheduler of your choice with known configuration. Most workflows would be submitted to a single user instance anyway, not the system instance right? I'm not necessarily saying one approach is better than the other, but just something to think about. I don't like the idea of giving an enclosing instance an ability to modify jobspec right before it is signed by the user. That gives me pause.

…

On Fri, Aug 14, 2020, 8:41 PM Stephen Herbein ***@***.***> wrote: However, it occurs to me that there is nothing in the security architecture of Flux that requires the scheduler to use the unmodified resources section of user-submitted jobspec when finding a matching resource set. The scheduler could therefore internally modify the request as above when configuration dictates, and use the modified jobspec in its matching policy. I like that this would make these issues transparent to the user, but I'm also worried that it could be surprising. In particular, I think it would be confusing if users are feeding in their "blessed" campaign/workflow jobspec and getting different resource allocations depending on the system. I guess ultimately there is a trade-off between minimal user intervention and consistent semantics. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3143 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFVEUX5T7ZWHN2HKZ4P72LSAX7XDANCNFSM4P75MRSA> .

2 replies

ryanday36 Feb 3, 2021

I'm playing around on fluke right now, and thinking about this issue. I like the discussion here, especially @grondo's comment about the user's job spec needing to be seen as a minimum resource set. It seems like it should be possible for the scheduler in the multi-user environment to see one set of resources as being reserved by the job while the job shell sees the resource set that the user wants to actually run on.

In the original discussion, @SteVwonder mentions that:

One open question would be what the default count value for the node resource should be when -N is not provided explicitly.

I would argue that the node count should be the minimum required to satisfy the rest of the user's request. This might be another discussion, but I think that should always be the default, especially in a multi-user environment. i.e. right now, if I request an allocation with 4 tasks, I get one task on each of 4 nodes:

[day36@fluke108:~]$ flux mini alloc -n4
[day36@fluke16:~]$ flux resource list -v
     STATE NNODES   NCORES    NGPUS LIST
      free      4        4        0 rank[0-3]/core0
 allocated      0        0        0 
      down      0        0        0 
[day36@fluke16:~]$ exit

I think that the behavior, especially in a multi-user environment, should be that flux mini alloc -n4 on a system with at least 4 cores per node would act like flux mini alloc -N1 -n4.

SteVwonder Feb 4, 2021
Maintainer Author

I would argue that the node count should be the minimum required to satisfy the rest of the user's request.

Yeah, I agree. If we assume the cluster is homogenous and that we are using jobspec V1, then calculating that at submit time is pretty straightforward. We could just insert a node resource into the jobspec with count == max (num cores / cores per node, num gpus / gpus per node).

I'm not sure what to do in the case of a heterogeneous cluster. If we had a scheduler and jobspec version that support unbounded counts, we could do something like:

version: 999
resources:
  - type: node
    count: {min: 1}
    label: default
    exclusive: true
    with:
      - type: slot
        count: 1
        with:
            - type: core
               count: 4
tasks:
  - command: [ "hostname" ]
    slot: default
    count:
      per_slot: 1
attributes: {}

Maybe we should just not worry about doing node-exclusive scheduling on heterogeneous clusters for now?

ryanday36 · 2021-09-30T20:27:12Z

ryanday36
Sep 30, 2021

The folks who are working on the CTS-2 procurement are wondering if we'll be able to have node scheduled and core scheduled queues / partitions on the same cluster. Is there anything about the approach that was discussed here that would get in the way of that?

1 reply

dongahn Oct 25, 2021
Maintainer

@grondo and I are discussing the use of a Jobspec min/max approach to enable node-exclusive scheduling on a system instance. In a nutshell, we want a new batch interface that generates a Jobspec with min/max to schedule whole nodes. We still need to way to "enforce" node exclusivity and doing this with a job validator plugin is an approach we will be testing. I believe if we add Fluxion queue support into this plugin, we can easily have queues with different exclusivity level.

grondo · 2022-03-13T16:13:58Z

grondo
Mar 13, 2022
Maintainer

We have a solution for Fluxion provided by flux-framework/flux-sched#900. The key parameter is to configure the [sched-fluxion-resource] table with

[sched-fluxion-resource]
match-policy = "lonodex"

See flux-config-sched-fluxion-resource(5) for more details.

The one remaining question was posed by @ryanday36 above:

The folks who are working on the CTS-2 procurement are wondering if we'll be able to have node scheduled and core scheduled queues / partitions on the same cluster. Is there anything about the approach that was discussed here that would get in the way of that?

Since the match-policy for the resource module is configured as a whole, I don't think there is a way to do this with the current solution. Fluxion would need to be extended to allow the qmanager or resource to select the match-policy based on queue or some other parameter. Perhaps properties could be used for this purpose (#4143).

cc: @dongahn

3 replies

dongahn Mar 13, 2022
Maintainer

Indeed, adding multi-queue support is our next short-term roadmap item.

Three things to note:

We already have multi-queue (MTQ) support within qmanager so that we can apply different queueing policies for different queues (one with EASY backfilling and another with HTC optimized FCFS and the other with convervative backfilling etc.). But we didn't expose that capability into the man page yet because the specification may change a bit to make it more user friendly.
MTQ needs to be added to fluxion-resource. As @grondo mentioned, we are using use case: heterogeneous clusters #4143's properties support prototype is being used to build up our internal tests.
@cmoussa1 and I are discussing ways to validate queue information (e.g., names) as seen by fluxion and flux-accounting.

grondo Mar 13, 2022
Maintainer

Thanks @dongahn. Shall we consider this question answered by this comment, and open a separate issue on a way to schedule some resources as node exclusive and others not?

dongahn Mar 13, 2022
Maintainer

Sounds good to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle clusters that are node-exclusive rather than core-exclusive? #3143

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How to handle clusters that are node-exclusive rather than core-exclusive? #3143

SteVwonder Aug 14, 2020 Maintainer

Replies: 4 comments · 7 replies

grondo Aug 14, 2020 Maintainer

SteVwonder Aug 15, 2020 Maintainer Author

grondo Aug 15, 2020 Maintainer

ryanday36 Feb 3, 2021

SteVwonder Feb 4, 2021 Maintainer Author

ryanday36 Sep 30, 2021

dongahn Oct 25, 2021 Maintainer

grondo Mar 13, 2022 Maintainer

dongahn Mar 13, 2022 Maintainer

grondo Mar 13, 2022 Maintainer

dongahn Mar 13, 2022 Maintainer

SteVwonder
Aug 14, 2020
Maintainer

Replies: 4 comments 7 replies

grondo
Aug 14, 2020
Maintainer

SteVwonder Aug 15, 2020
Maintainer Author

grondo
Aug 15, 2020
Maintainer

SteVwonder Feb 4, 2021
Maintainer Author

ryanday36
Sep 30, 2021

dongahn Oct 25, 2021
Maintainer

grondo
Mar 13, 2022
Maintainer

dongahn Mar 13, 2022
Maintainer

grondo Mar 13, 2022
Maintainer

dongahn Mar 13, 2022
Maintainer