WIP: Node mark reboot helper #65

martijnkruiten · 2018-10-01T14:01:16Z

I added a helper script to mark nodes for reboot. It's based on node-mark-offline, but executes scontrol reboot ASAP <node> instead. This helper script can be used by setting OFFLINE_NODE to $HELPERDIR/node-mark-reboot. This is useful for checks that need a reboot when failed. It's compatible with Slurm only.

martijnkruiten · 2018-10-09T08:03:36Z

This can already be done with SLURM_SC_OFFLINE_ARGS, so I'm closing this pull request.

martijnkruiten · 2018-10-09T08:23:23Z

I closed it too soon. node-mark-offline is currently incompatible with reboot ASAP, because it expects fewer arguments:

SLURM_SC_OFFLINE_ARGS="update State=DRAIN"
exec $SLURM_SCONTROL $SLURM_SC_OFFLINE_ARGS NodeName=$HOSTNAME Reason="$LEADER $NOTE"

Versus:

SLURM_SC_OFFLINE_ARGS="reboot ASAP"
exec $SLURM_SCONTROL $SLURM_SC_OFFLINE_ARGS $HOSTNAME

martijnkruiten · 2018-10-10T13:24:18Z

I'm working on an improved version with Slurm 18.08 support (NextState and Reason arguments), handling of existing notes (similar to node-mark-offline) and renamed variables (SLURM_SC_OFFLINE_ARGS becomes SLURM_SC_REBOOT_ARGS). This is done in a private repository, but eventually I will push it to this branch.

martijnkruiten · 2020-05-22T13:49:20Z

I've got an internal version that we use. I'm going to push it to this branch.

martijnkruiten · 2020-05-22T15:50:52Z

Ok, so the difference between node-mark-offline and node-mark-reboot is only a few lines, so they can easily be merged into one helper. The main issue is that scontrol reboot expects the nodenames in a different format, so the helps should be aware of the value of SLURM_SC_REBOOT_ARGS, or there should be another environment variable to set it to reboot. I guess the latter is a lot cleaner.

The mark-node-online helper can cancel pending reboots (if the node is healthy again) or mark nodes online after a reboot. We've opted to reboot them with NextState=DOWN to avoid boot loops. We are running NHC during the boot sequence and at that point they are either left in a drained state or resumed. If NHC is run in the prolog and/or epilog only a different approach would be to set NextState=RESUME and do something inside the helper to avoid boot loops.

martijnkruiten · 2022-03-22T13:14:21Z

For anyone looking to use this helper: this would work perfectly with something like this (I'm referring to the service file). That's because we've opted to let the node return in a drained state, so it will only be resumed if NHC is run during the boot process (or manually). That's by design, because we don't want to trigger a boot loop during the prologue, and we also like to avoid scheduling a job on a node before we know for sure that it's in a good state after the reboot.

We use it like this in nhc.conf:

<target> || export OFFLINE_NODE="$HELPERDIR/node-mark-reboot"
<target> || <test that should trigger reboot>
<target> || export OFFLINE_NODE="$HELPERDIR/node-mark-offline"
<target> || <test that should trigger drain>

Alternatively, there is this pull request that tries to handle it differently.

martijnkruiten closed this Oct 9, 2018

martijnkruiten deleted the node-mark-reboot-helper branch October 9, 2018 08:03

martijnkruiten restored the node-mark-reboot-helper branch October 9, 2018 08:19

martijnkruiten reopened this Oct 9, 2018

martijnkruiten changed the title ~~Node mark reboot helper~~ WIP: Node mark reboot helper Oct 10, 2018

mej added the enhancement label Jan 1, 2019

mej self-requested a review January 1, 2019 05:50

mej added this to the 1.4.4 Release milestone Jan 1, 2019

hintron mentioned this pull request Apr 2, 2019

Handle Changes to Slurm's REBOOT workflow #81

Closed

martijnkruiten force-pushed the node-mark-reboot-helper branch from 57329fa to 7486112 Compare May 22, 2020 13:59

mej mentioned this pull request Aug 17, 2020

Allow NHC to work with recent changes to Slurm reboot #84

Open

mej added the bug label Aug 17, 2020

mej self-assigned this Aug 17, 2020

mej modified the milestones: 1.4.4 Release, 1.4.3 Release Aug 17, 2020

mej removed this from the 1.4.3 Release milestone Apr 18, 2021

mej added this to the 1.5 Release milestone Apr 18, 2021

Martijn Kruiten and others added 3 commits March 22, 2022 13:31

Node mark reboot helper

751a6b3

Update to newest version that we use internally

8de1ea7

Rebased on upstream master branch

eebeed7

martijnkruiten force-pushed the node-mark-reboot-helper branch from 3df3e9c to eebeed7 Compare March 22, 2022 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Node mark reboot helper #65

WIP: Node mark reboot helper #65

martijnkruiten commented Oct 1, 2018

martijnkruiten commented Oct 9, 2018

martijnkruiten commented Oct 9, 2018

martijnkruiten commented Oct 10, 2018 •

edited

Loading

martijnkruiten commented May 22, 2020

martijnkruiten commented May 22, 2020 •

edited

Loading

martijnkruiten commented Mar 22, 2022

WIP: Node mark reboot helper #65

Are you sure you want to change the base?

WIP: Node mark reboot helper #65

Conversation

martijnkruiten commented Oct 1, 2018

martijnkruiten commented Oct 9, 2018

martijnkruiten commented Oct 9, 2018

martijnkruiten commented Oct 10, 2018 • edited Loading

martijnkruiten commented May 22, 2020

martijnkruiten commented May 22, 2020 • edited Loading

martijnkruiten commented Mar 22, 2022

martijnkruiten commented Oct 10, 2018 •

edited

Loading

martijnkruiten commented May 22, 2020 •

edited

Loading