Helper scripts are not called when the node fails the health check with Slurm #147

szhengac · 2023-11-08T19:04:03Z

Hi,

I am testing nhc with Slurm to automatically drain the nodes with ECC uncorrectable error. The nhc log shows the health check fails on the problematic node, but no helper scripts are executed to put the node into drain state. If I manually call the helper script like sudo NHC_RM=slurm bash /usr/libexec/nhc/node-mark-offline ib-vm-25, the node will be put on drain state. How can I enable Slurm and nhc to call the helper scripts automatically when the node fails the health check? Thanks!

/var/log/nhc.log:

Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1

The text was updated successfully, but these errors were encountered:

OleHolmNielsen · 2023-11-08T19:09:45Z

Did you configure slurm.conf to call NHC? We use the line:
HealthCheckProgram=/usr/sbin/nhc

szhengac · 2023-11-08T19:11:34Z

Yes, this was configured. I can see that nhc was called by Slurm, since slurmd.log has the following lines:

[2023-11-08T08:14:23.445] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1

[2023-11-08T08:19:23.705] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1

[2023-11-08T08:24:23.953] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1

[2023-11-08T08:29:24.274] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1

[2023-11-08T08:34:23.492] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1

OleHolmNielsen · 2023-11-08T19:20:11Z

I've never seen the check named "check_xid_errors", I wonder where that came from? Did you define this in your nhc.conf file?

szhengac · 2023-11-08T19:25:49Z

Yes. this is from https://github.com/NVIDIA/deepops. I add this line ib-vm-25 || check_xid_errors to nhc.conf

check_xid_errors() {
        excluded_xid='94'
        xid_list=$(journalctl -b 0  --since "1 hour ago" --no-pager 2> /dev/null | grep "NVRM: Xid" | sed 's/^.*\] \(.*\)/\1/' | awk '{print $9}' | sed 's/,//' | sort -n | uniq | grep -v -E "${excluded_xid}" | paste -s -d,)
        if [ x"$xid_list" != x"" ]; then
                echo "Found XID errors: $xid_list"
                return 1
        fi
        return 0
}

OleHolmNielsen · 2023-11-08T19:53:20Z

I don't know about this check. You could try to configure a "fake" check in nhc.conf on the node, like adding a check of check_hw_physmem for values that are definitely wrong. This should cause slurmd to mark the node offline next time it calls NHC. Make sure to configure all the NHC parameters in slurm.conf, for example:

HealthCheckProgram=/usr/sbin/nhc
HealthCheckInterval=3600
HealthCheckNodeState=ANY

The default value of HealthCheckInterval is 0 which disables NHC!

BTW, which version of Slurm do you run?

szhengac · 2023-11-08T20:39:01Z

I am using HealthCheckNodeState=IDLE. Do I need to use ANY? HealthCheckInterval=300 in my slurm.conf

I am using slurm 23.02.4

szhengac · 2023-11-08T20:55:36Z

I tried the standard check check_hw_cpuinfo in nhc but still got no luck. The helper scripts are still not called.

[2023-11-08T20:44:23.753] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  check_hw_cpuinfo:  Actual CPU thread count (176) does not match expected (1760).

szhengac · 2023-11-09T21:47:05Z

@OleHolmNielsen I think the helper script should be run by nhc rather than Slurm? Based on the log, nhc is definitely executed by Slurm.

szhengac · 2023-11-10T22:13:19Z

I can now confirm that it is a bug in nhc 1.4.3. Reinstalling with 1.4.3 again does not work, but reinstalling with 1.4.2 corrects this bug.

KasperSkytte · 2024-01-05T11:31:22Z

I can confirm 1.4.3 doesn't run the /usr/libexec/nhc/node-mark-offline script to drain a node with failing checks. I downgraded to 1.4.2 as @szhengac suggested, and it works fine. I tried debugging a bit, but didn't find the cause.

KasperSkytte · 2024-01-05T11:32:51Z

add: the node-mark-offline script itself works just fine.

jbd · 2024-02-06T16:22:33Z

Hello,

fwiw, I had the problem in 1.4.3 because the scontrol was not in the PATH and the auto-detection didn't work in nhcmain_find_rm. Setting NHC_RM in /etc/sysconfig/nhc worked for me.

Zoidmania · 2024-02-28T21:37:32Z

I believe we're being affected by this issue as well. Any movement on this? I'm experiencing exactly the same behavior as @szhengac, and I'm at my wit's end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helper scripts are not called when the node fails the health check with Slurm #147

Helper scripts are not called when the node fails the health check with Slurm #147

szhengac commented Nov 8, 2023

OleHolmNielsen commented Nov 8, 2023

szhengac commented Nov 8, 2023

OleHolmNielsen commented Nov 8, 2023

szhengac commented Nov 8, 2023

OleHolmNielsen commented Nov 8, 2023

szhengac commented Nov 8, 2023

szhengac commented Nov 8, 2023 •

edited

Loading

szhengac commented Nov 9, 2023

szhengac commented Nov 10, 2023

KasperSkytte commented Jan 5, 2024

KasperSkytte commented Jan 5, 2024

jbd commented Feb 6, 2024

Zoidmania commented Feb 28, 2024

Helper scripts are not called when the node fails the health check with Slurm #147

Helper scripts are not called when the node fails the health check with Slurm #147

Comments

szhengac commented Nov 8, 2023

OleHolmNielsen commented Nov 8, 2023

szhengac commented Nov 8, 2023

OleHolmNielsen commented Nov 8, 2023

szhengac commented Nov 8, 2023

OleHolmNielsen commented Nov 8, 2023

szhengac commented Nov 8, 2023

szhengac commented Nov 8, 2023 • edited Loading

szhengac commented Nov 9, 2023

szhengac commented Nov 10, 2023

KasperSkytte commented Jan 5, 2024

KasperSkytte commented Jan 5, 2024

jbd commented Feb 6, 2024

Zoidmania commented Feb 28, 2024

szhengac commented Nov 8, 2023 •

edited

Loading