Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helper scripts are not called when the node fails the health check with Slurm #147

Open
szhengac opened this issue Nov 8, 2023 · 13 comments

Comments

@szhengac
Copy link

szhengac commented Nov 8, 2023

Hi,

I am testing nhc with Slurm to automatically drain the nodes with ECC uncorrectable error. The nhc log shows the health check fails on the problematic node, but no helper scripts are executed to put the node into drain state. If I manually call the helper script like sudo NHC_RM=slurm bash /usr/libexec/nhc/node-mark-offline ib-vm-25, the node will be put on drain state. How can I enable Slurm and nhc to call the helper scripts automatically when the node fails the health check? Thanks!

/var/log/nhc.log:

Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
@OleHolmNielsen
Copy link

Did you configure slurm.conf to call NHC? We use the line:
HealthCheckProgram=/usr/sbin/nhc

@szhengac
Copy link
Author

szhengac commented Nov 8, 2023

Yes, this was configured. I can see that nhc was called by Slurm, since slurmd.log has the following lines:

[2023-11-08T08:14:23.445] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1

[2023-11-08T08:19:23.705] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1

[2023-11-08T08:24:23.953] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1

[2023-11-08T08:29:24.274] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1

[2023-11-08T08:34:23.492] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1

@OleHolmNielsen
Copy link

I've never seen the check named "check_xid_errors", I wonder where that came from? Did you define this in your nhc.conf file?

@szhengac
Copy link
Author

szhengac commented Nov 8, 2023

Yes. this is from https://github.com/NVIDIA/deepops. I add this line ib-vm-25 || check_xid_errors to nhc.conf

check_xid_errors() {
        excluded_xid='94'
        xid_list=$(journalctl -b 0  --since "1 hour ago" --no-pager 2> /dev/null | grep "NVRM: Xid" | sed 's/^.*\] \(.*\)/\1/' | awk '{print $9}' | sed 's/,//' | sort -n | uniq | grep -v -E "${excluded_xid}" | paste -s -d,)
        if [ x"$xid_list" != x"" ]; then
                echo "Found XID errors: $xid_list"
                return 1
        fi
        return 0
}

@OleHolmNielsen
Copy link

I don't know about this check. You could try to configure a "fake" check in nhc.conf on the node, like adding a check of check_hw_physmem for values that are definitely wrong. This should cause slurmd to mark the node offline next time it calls NHC. Make sure to configure all the NHC parameters in slurm.conf, for example:

HealthCheckProgram=/usr/sbin/nhc
HealthCheckInterval=3600
HealthCheckNodeState=ANY

The default value of HealthCheckInterval is 0 which disables NHC!

BTW, which version of Slurm do you run?

@szhengac
Copy link
Author

szhengac commented Nov 8, 2023

I am using HealthCheckNodeState=IDLE. Do I need to use ANY? HealthCheckInterval=300 in my slurm.conf

I am using slurm 23.02.4

@szhengac
Copy link
Author

szhengac commented Nov 8, 2023

I tried the standard check check_hw_cpuinfo in nhc but still got no luck. The helper scripts are still not called.

[2023-11-08T20:44:23.753] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  check_hw_cpuinfo:  Actual CPU thread count (176) does not match expected (1760).

@szhengac
Copy link
Author

szhengac commented Nov 9, 2023

@OleHolmNielsen I think the helper script should be run by nhc rather than Slurm? Based on the log, nhc is definitely executed by Slurm.

@szhengac
Copy link
Author

I can now confirm that it is a bug in nhc 1.4.3. Reinstalling with 1.4.3 again does not work, but reinstalling with 1.4.2 corrects this bug.

@KasperSkytte
Copy link

I can confirm 1.4.3 doesn't run the /usr/libexec/nhc/node-mark-offline script to drain a node with failing checks. I downgraded to 1.4.2 as @szhengac suggested, and it works fine. I tried debugging a bit, but didn't find the cause.

@KasperSkytte
Copy link

add: the node-mark-offline script itself works just fine.

@jbd
Copy link

jbd commented Feb 6, 2024

Hello,

fwiw, I had the problem in 1.4.3 because the scontrol was not in the PATH and the auto-detection didn't work in nhcmain_find_rm. Setting NHC_RM in /etc/sysconfig/nhc worked for me.

@Zoidmania
Copy link

I believe we're being affected by this issue as well. Any movement on this? I'm experiencing exactly the same behavior as @szhengac, and I'm at my wit's end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants