Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Node mark reboot helper #65

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

martijnkruiten
Copy link
Contributor

I added a helper script to mark nodes for reboot. It's based on node-mark-offline, but executes scontrol reboot ASAP <node> instead. This helper script can be used by setting OFFLINE_NODE to $HELPERDIR/node-mark-reboot. This is useful for checks that need a reboot when failed. It's compatible with Slurm only.

@martijnkruiten
Copy link
Contributor Author

This can already be done with SLURM_SC_OFFLINE_ARGS, so I'm closing this pull request.

@martijnkruiten martijnkruiten deleted the node-mark-reboot-helper branch October 9, 2018 08:03
@martijnkruiten martijnkruiten restored the node-mark-reboot-helper branch October 9, 2018 08:19
@martijnkruiten
Copy link
Contributor Author

I closed it too soon. node-mark-offline is currently incompatible with reboot ASAP, because it expects fewer arguments:

SLURM_SC_OFFLINE_ARGS="update State=DRAIN"
exec $SLURM_SCONTROL $SLURM_SC_OFFLINE_ARGS NodeName=$HOSTNAME Reason="$LEADER $NOTE"

Versus:

SLURM_SC_OFFLINE_ARGS="reboot ASAP"
exec $SLURM_SCONTROL $SLURM_SC_OFFLINE_ARGS $HOSTNAME

@martijnkruiten martijnkruiten reopened this Oct 9, 2018
@martijnkruiten
Copy link
Contributor Author

martijnkruiten commented Oct 10, 2018

I'm working on an improved version with Slurm 18.08 support (NextState and Reason arguments), handling of existing notes (similar to node-mark-offline) and renamed variables (SLURM_SC_OFFLINE_ARGS becomes SLURM_SC_REBOOT_ARGS). This is done in a private repository, but eventually I will push it to this branch.

@martijnkruiten martijnkruiten changed the title Node mark reboot helper WIP: Node mark reboot helper Oct 10, 2018
@mej mej added the enhancement label Jan 1, 2019
@mej mej self-requested a review January 1, 2019 05:50
@mej mej added this to the 1.4.4 Release milestone Jan 1, 2019
@martijnkruiten
Copy link
Contributor Author

I've got an internal version that we use. I'm going to push it to this branch.

@martijnkruiten
Copy link
Contributor Author

martijnkruiten commented May 22, 2020

Ok, so the difference between node-mark-offline and node-mark-reboot is only a few lines, so they can easily be merged into one helper. The main issue is that scontrol reboot expects the nodenames in a different format, so the helps should be aware of the value of SLURM_SC_REBOOT_ARGS, or there should be another environment variable to set it to reboot. I guess the latter is a lot cleaner.

The mark-node-online helper can cancel pending reboots (if the node is healthy again) or mark nodes online after a reboot. We've opted to reboot them with NextState=DOWN to avoid boot loops. We are running NHC during the boot sequence and at that point they are either left in a drained state or resumed. If NHC is run in the prolog and/or epilog only a different approach would be to set NextState=RESUME and do something inside the helper to avoid boot loops.

@mej mej added the bug label Aug 17, 2020
@mej mej self-assigned this Aug 17, 2020
@mej mej modified the milestones: 1.4.4 Release, 1.4.3 Release Aug 17, 2020
@mej mej removed this from the 1.4.3 Release milestone Apr 18, 2021
@mej mej added this to the 1.5 Release milestone Apr 18, 2021
@martijnkruiten
Copy link
Contributor Author

For anyone looking to use this helper: this would work perfectly with something like this (I'm referring to the service file). That's because we've opted to let the node return in a drained state, so it will only be resumed if NHC is run during the boot process (or manually). That's by design, because we don't want to trigger a boot loop during the prologue, and we also like to avoid scheduling a job on a node before we know for sure that it's in a good state after the reboot.

We use it like this in nhc.conf:

<target> || export OFFLINE_NODE="$HELPERDIR/node-mark-reboot"
<target> || <test that should trigger reboot>
<target> || export OFFLINE_NODE="$HELPERDIR/node-mark-offline"
<target> || <test that should trigger drain>

Alternatively, there is this pull request that tries to handle it differently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants