Running on Summit in a killable queue #2364

igory1999 · 2020-12-02T19:54:49Z

igory1999
Dec 2, 2020

Requested resources:
time: 12h,
cpus cores: 572,
gpus: 131,
queue 'killable'

Problem:
After about 2h the job got partially killed presumably by the Summit scheduler for preemption.
Some processes continued running probably for another hour or so before Radical decided that the job is over.
After that the job was never restarted by the scheduler (as it should in the killable queue) and disappeared from the queue according to bjobs which probably means that the scheduler thinks that it is completed.

The logs are on Summit in /gpfs/alpine/csc299/world-shared/iyakushin/ticket.tar.gz

mturilli · 2020-12-03T13:42:08Z

mturilli
Dec 3, 2020
Maintainer

@andre-merzky this is something to discuss at RP level. We need to study how the killable queue works based on Igor's feedback and see whether there is a (very) short term solution and a more long-term solution to support that kind of queue behavior.

@lee212 do you confirm you observed the same as Igor with this queue on Summit?

0 replies

andre-merzky · 2020-12-04T10:45:23Z

andre-merzky
Dec 4, 2020
Maintainer

It is unlikely that the pilot will survive kill and restart. Even if the restart actually works, the pilot will not pick up any tasks which have been interrupted, or in fact any tasks which have arrived in the pilot before termination...

0 replies

andre-merzky · 2022-04-14T08:28:42Z

andre-merzky
Apr 14, 2022
Maintainer

This has been discussed in a different context again. A possible approach would be the following:

start a pilot (potentially on a single node) in a non-killable queue (master pilot)
start additional pilots in killable queues
have the killable ones register their resources on the master pilot
master pilot distributes tasks to the killable pilots

Implementing this scheme would require three major elements:

scheduler needs to be able to register / unregister resources from additional pilots
scheduler must not schedule tasks across pilot boundaries
tasks running while a killable pilot dies need to be restarted

This is likely easier to implement once we have partitions - at that point, any additional pilot would behave like an external partition.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running on Summit in a killable queue #2364

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Running on Summit in a killable queue #2364

igory1999 Dec 2, 2020

Replies: 3 comments

mturilli Dec 3, 2020 Maintainer

andre-merzky Dec 4, 2020 Maintainer

andre-merzky Apr 14, 2022 Maintainer

igory1999
Dec 2, 2020

mturilli
Dec 3, 2020
Maintainer

andre-merzky
Dec 4, 2020
Maintainer

andre-merzky
Apr 14, 2022
Maintainer