You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to https://github.com/basnijholt/adaptive-scheduler/blob/master/adaptive_scheduler/scheduler.py#L594
the automatic requeing by slurm is disabled in adaptive-scheduled jobs. I ran into an issue, where the node that was hosting the job faltered and the job hung in preparation state for a while (50 min). I was able to fix it by requeing the job (one can override --no-requeue with scontrol later), and adaptive-scheduler happily picked up the job and showed it as running.
So, I was wondering what's the reason behind forced --no-requeue?
The text was updated successfully, but these errors were encountered:
I think without that command, a node will crash and get requeued while adaptive-scheduler will already start a new job. That requeued job will not correctly receive a new learner.
According to https://github.com/basnijholt/adaptive-scheduler/blob/master/adaptive_scheduler/scheduler.py#L594
the automatic requeing by slurm is disabled in adaptive-scheduled jobs. I ran into an issue, where the node that was hosting the job faltered and the job hung in preparation state for a while (50 min). I was able to fix it by requeing the job (one can override --no-requeue with scontrol later), and adaptive-scheduler happily picked up the job and showed it as running.
So, I was wondering what's the reason behind forced
--no-requeue
?The text was updated successfully, but these errors were encountered: