Replies: 3 comments
-
@andre-merzky this is something to discuss at RP level. We need to study how the killable queue works based on Igor's feedback and see whether there is a (very) short term solution and a more long-term solution to support that kind of queue behavior. @lee212 do you confirm you observed the same as Igor with this queue on Summit? |
Beta Was this translation helpful? Give feedback.
-
It is unlikely that the pilot will survive kill and restart. Even if the restart actually works, the pilot will not pick up any tasks which have been interrupted, or in fact any tasks which have arrived in the pilot before termination... |
Beta Was this translation helpful? Give feedback.
-
This has been discussed in a different context again. A possible approach would be the following:
Implementing this scheme would require three major elements:
This is likely easier to implement once we have partitions - at that point, any additional pilot would behave like an external partition. |
Beta Was this translation helpful? Give feedback.
-
Requested resources:
time: 12h,
cpus cores: 572,
gpus: 131,
queue 'killable'
Problem:
After about 2h the job got partially killed presumably by the Summit scheduler for preemption.
Some processes continued running probably for another hour or so before Radical decided that the job is over.
After that the job was never restarted by the scheduler (as it should in the killable queue) and disappeared from the queue according to bjobs which probably means that the scheduler thinks that it is completed.
The logs are on Summit in /gpfs/alpine/csc299/world-shared/iyakushin/ticket.tar.gz
Beta Was this translation helpful? Give feedback.
All reactions