Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow NHC to work with recent changes to Slurm reboot #84

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 32 additions & 2 deletions helpers/node-mark-offline
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,8 @@ elif [[ "$NHC_RM" == "slurm" ]]; then
OLD_NOTE_LEADER="${LINE[1]}"
OLD_NOTE="${LINE[*]:2}"
case "$STATUS" in
alloc*|comp*|drain*|drng*|fail*|idle*|maint*|mix*|resume*|resv*|undrain*)
# Node states: src/common/slurm_protocol_defs.c --> node_state_string()
alloc*|comp*|drain*|drng*|fail*|idle*|maint*|mix*|resume*|resv*|undrain*|boot*)
case "$STATUS" in
drain*|drng*|fail*|maint*)
# If the node is already offline, and there is no old note, and
Expand All @@ -73,9 +74,38 @@ elif [[ "$NHC_RM" == "slurm" ]]; then
exit 0
fi
;;
boot*)
# Offline node after reboot if vanilla `scontrol reboot` was
# called, so jobs can't run until NHC onlines the node.
# Note: This won't happen while node is waiting to boot,
# because $STATUS would show MIX@ or ALLOC@, not BOOT.
# See src/common/slurm_protocol_defs.c-->node_state_string()
SHOW_NODE_OUTPUT="$($SLURM_SCONTROL show node $HOSTNAME)"
if [[ $SHOW_NODE_OUTPUT == *"State=REBOOT "* ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to change this to "State=REBOOT" without the trailing space. While a node was booting this is the state I saw: State=REBOOT*+DRAIN.

MSG="Temporarily offlining $HOSTNAME after reboot until NHC can online it"
echo "$0: $MSG"
$SLURM_SCONTROL update State=DRAIN NodeName=$HOSTNAME Reason="$LEADER $MSG"
exit 0
fi

# If `Reboot ASAP` has been cleared, then the node is
# already set to stay in DRAIN until NHC onlines it, so exit
if [[ "$OLD_NOTE_LEADER" != "Reboot" && "$OLD_NOTE" != "ASAP" ]]; then
echo "$0: $HOSTNAME already set to remain offline after reboot until NHC onlines it"
exit 0
fi
;;
esac
# `scontrol reboot asap` will set the node state to REBOOT+DRAIN and
# reason to `Reboot ASAP`. Then, after boot, and after NHC runs
# once, Slurm will set the node base state to IDLE. If reason ==
# `Reboot ASAP`, Slurm will also clear the DRAIN flag. We want
# NHC to clear the DRAIN flag, not Slurm, so delete the
# `Reboot ASAP` reason by not preserving it below.
# See https://slurm.schedmd.com/scontrol.html --> reboot

# If there's an old note that wasn't set by NHC, preserve it.
if [[ "$OLD_NOTE_LEADER" != "none" && "$OLD_NOTE_LEADER" != "$LEADER" ]]; then
if [[ "$OLD_NOTE_LEADER" != "none" && "$OLD_NOTE_LEADER" != "$LEADER" && "$OLD_NOTE_LEADER" != "Reboot" && "$OLD_NOTE" != "ASAP" ]]; then
LEADER="$OLD_NOTE_LEADER"
NOTE="$OLD_NOTE"
fi
Expand Down
7 changes: 6 additions & 1 deletion helpers/node-mark-online
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,12 @@ elif [[ "$NHC_RM" == "slurm" ]]; then
# Slurm does not run the HealthCheckProgram on nodes in the DOWN state,
# but if someone runs NHC by hand, we want to be able to do the right thing.
case "$STATUS" in
down*|drain*|drng*|fail*|maint*)
*@)
# Onlining a node will cancel a pending reboot, so prevent this
echo "$0: Not onlining $HOSTNAME: Reboot is pending."
exit 0
;;
down*|drain*|drng*|fail*|maint*|boot*)
# If there is no old note, and we've not been told to ignore that, do not online the node.
if [[ "$OLD_NOTE_LEADER" == "none" && "$IGNORE_EMPTY_NOTE" != "1" ]]; then
echo "$0: Not onlining $HOSTNAME: No note set."
Expand Down