Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question for faster execution: Seeing cpu_info add 10 secs to execution #141

Open
jebbaxley opened this issue Sep 1, 2023 · 4 comments
Open
Assignees

Comments

@jebbaxley
Copy link

Strangely when I add this cpu_info check the script takes 10 sec. longer to execute.

  • || check_hw_cpuinfo 2 128 255

am I adding this incorrectly? Also, how can I be sure the nhc is running the checks in parrallel for faster execution? attempting to minimize health checking.

time with:
real 0m11.548s
user 0m0.246s
sys 0m10.159s

time without:
real 0m0.119s
user 0m0.062s
sys 0m0.018s

@mej
Copy link
Owner

mej commented Sep 19, 2023

Hey Jeb! Great to hear from you again! 😃

Not sure how I missed seeing this before... Good thing I checked the Pulse page. 😖

What version of NHC is it that you're running? For this specific check, I'd strongly recommend using the NHC 1.5 code currently in the dev branch; while 1.5 hasn't been released yet, the dev branch has a fix for this exact issue -- #121 (commit 7e2a8c6). (At least I think that's what you're seeing.)

Feedback on the fix is definitely welcome!

You might also be able to get away with just dropping in the scripts/lbnl_hw.nhc from the dev branch. I've never tried this myself, exactly, but they should be pretty self-contained. Of course, you'd also need test/test_lbnl_hw.nhc dropped in too if you wanted to run the unit tests for the new module. Feedback on this method is also welcome, if you decide to try it.

Of course, if it would make things easier on you, I'm happy to provide snapshot tarballs and/or RPMs; just let me know!

@mej mej self-assigned this Sep 19, 2023
@jebbaxley
Copy link
Author

jebbaxley commented Sep 19, 2023 via email

@mej
Copy link
Owner

mej commented Sep 19, 2023

Thanks for getting back to me!  I’m currently trying to incorporate this with a new workload manager.  Is there a simple way to provide scripts that drain and undrain?

In the default configuration, the scripts that handle draining/offlining and undraining/onlining nodes are node-mark-offline and node-mark-online, respectively. By default, they get installed into /usr/libexec/nhc/ (or /usr/lib/nhc/ on Debian). Modifying those scripts is one option -- and if you're considering contributing your support for this other WLM to the upstream project, this would definitely be the way to go! -- since the handling of the different RM/WLM products is pretty straightforward. Another option would be to change the values of the OFFLINE_NODE and ONLINE_NODE config variables; those control what commands NHC will use to drain or resume a node.

@jebbaxley
Copy link
Author

jebbaxley commented Sep 20, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants