-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NHC must understand the Slurm node state "resv" (Reserved) #82
Comments
I agree with this. This is similar to what we are doing in #81 - expanding the Slurm states understood by NHC. I think you'll also want to add One final thought: maybe |
src/common/slurm_protocol_defs.c -> |
Hi Michael, thanks a lot for your thorough work! I hope this will make it into the next release of NHC. |
We're installing some new nodes in our Slurm cluster and their fabric cables are not yet in place, so the Node Health Check (NHC) gives an error as expected:
[root@b001 ~]# nhc
ERROR: nhc: Health check failed: check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).
However, because we have temporarily set the Slurm state of these nodes to "resv" (Reserved), some warning messages are printed in /var/log/nhc.log:
ERROR: nhc: Health check failed: check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).
20190409 13:20:33 /usr/libexec/nhc/node-mark-offline b001 check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).
/usr/libexec/nhc/node-mark-offline: Not sure how to handle node state "resv" on b001
I would like to request the addition of Slurm state "resv" to the /usr/libexec/nhc/node-mark-offline script as in this diff:
--- /usr/libexec/nhc/node-mark-offline.orig 2015-11-11 22:46:52.000000000 +0100
+++ /usr/libexec/nhc/node-mark-offline 2019-04-09 13:29:48.587902690 +0200
@@ -63,7 +63,7 @@
OLD_NOTE_LEADER="${LINE[1]}"
OLD_NOTE="${LINE[*]:2}"
case "$STATUS" in
With this change I do get the expected behavior of NHC, and the nhc.log shows:
ERROR: nhc: Health check failed: check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).
20190409 13:29:51 /usr/libexec/nhc/node-mark-offline b001 check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).
/usr/libexec/nhc/node-mark-offline: Marking resv b001 offline: NHC: check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).
See also this Slurm bug report: https://bugs.schedmd.com/show_bug.cgi?id=6816
Thanks,
Ole
The text was updated successfully, but these errors were encountered: