Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Custom Check, How to exit without any changes, i.e. leave node in current state? #139

Open
flakrat opened this issue Aug 15, 2023 · 3 comments
Assignees
Labels

Comments

@flakrat
Copy link

flakrat commented Aug 15, 2023

Howdy, we have a custom check that retrieves a metric value from Prometheus using curl.

Edit: we are using Slurm as our resource manager.

The check works great, however I need to add code to the check to prevent NHC from changing the state of the node (drained, un-drained) if the curl command fails, examples:

  • The Prometheus server is not responding
  • The query doesn't return any metric (could happen if node_exporter died on the node)

Is there a way to return from the function where NHC would not make any changes to the node?

  • return 0 indicates no failure and triggers an un-drain if the node is already drained, so I can't use that
  • return 1 or any number indicates failure and drains the node.

Thanks,

Mike Hanby
UAB IT Research Computing

@flakrat
Copy link
Author

flakrat commented Aug 16, 2023

I ran a few tests and it appears that calling nhcmain_finish works to bypass the code that drain/un-drains the node, however I believe that this would also bypass processing checks further down the line.

I guess putting this particular check at the end of nhc.conf would mitigate this, but it's still hacky.

@mej
Copy link
Owner

mej commented Aug 29, 2023

So to make sure I understand... You want the check to fail if the correctly curl'd metric is above a certain threshold, but you want it to pass if it can't obtain a valid metric to test against, though in this case you don't want the node put back into service either?

At present, NHC doesn't really have a "soft fail" or a concept of a partially (un)healthy node, and that was really by design. You can, however, make changes to existing configuration values from within the code for your check. So if you wanted the check to pass but disallow "undraining" of the node, you can do something like ONLINE_NODE=: and then return 0 from your check. This would still allow subsequent checks to drain the node if they failed but keep an otherwise healthy node from being returned to service. Is that what you're wanting?

Feel free to share the code in question if that might help clarify what you're shooting for here! 😀

@mej mej self-assigned this Aug 29, 2023
@mej mej added the question label Aug 29, 2023
@flakrat
Copy link
Author

flakrat commented Aug 31, 2023

So if you wanted the check to pass but disallow "undraining" of the node, you can do something like ONLINE_NODE=: and then return 0 from your check. This would still allow subsequent checks to drain the node if they failed but keep an otherwise healthy node from being returned to service.

This is what I'm after, thanks:

Here's the code: https://gitlab.rc.uab.edu/rc/rc-nhc/-/blob/main/uabrc_hw.nhc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants