forced --no-requeue #56

aeantipov · 2020-12-11T07:01:37Z

According to https://github.com/basnijholt/adaptive-scheduler/blob/master/adaptive_scheduler/scheduler.py#L594
the automatic requeing by slurm is disabled in adaptive-scheduled jobs. I ran into an issue, where the node that was hosting the job faltered and the job hung in preparation state for a while (50 min). I was able to fix it by requeing the job (one can override --no-requeue with scontrol later), and adaptive-scheduler happily picked up the job and showed it as running.

So, I was wondering what's the reason behind forced --no-requeue?

The text was updated successfully, but these errors were encountered:

basnijholt · 2020-12-15T17:26:32Z

To be honest I don't remember exactly.

I think without that command, a node will crash and get requeued while adaptive-scheduler will already start a new job. That requeued job will not correctly receive a new learner.

basnijholt added the question Further information is requested label Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

forced --no-requeue #56

forced --no-requeue #56

aeantipov commented Dec 11, 2020

basnijholt commented Dec 15, 2020

forced --no-requeue #56

forced --no-requeue #56

Comments

aeantipov commented Dec 11, 2020

basnijholt commented Dec 15, 2020