Question for faster execution: Seeing cpu_info add 10 secs to execution #141

jebbaxley · 2023-09-01T08:41:34Z

Strangely when I add this cpu_info check the script takes 10 sec. longer to execute.

|| check_hw_cpuinfo 2 128 255

am I adding this incorrectly? Also, how can I be sure the nhc is running the checks in parrallel for faster execution? attempting to minimize health checking.

time with:
real 0m11.548s
user 0m0.246s
sys 0m10.159s

time without:
real 0m0.119s
user 0m0.062s
sys 0m0.018s

mej · 2023-09-19T01:03:55Z

Hey Jeb! Great to hear from you again! 😃

Not sure how I missed seeing this before... Good thing I checked the Pulse page. 😖

What version of NHC is it that you're running? For this specific check, I'd strongly recommend using the NHC 1.5 code currently in the dev branch; while 1.5 hasn't been released yet, the dev branch has a fix for this exact issue -- #121 (commit 7e2a8c6). (At least I think that's what you're seeing.)

Feedback on the fix is definitely welcome!

You might also be able to get away with just dropping in the scripts/lbnl_hw.nhc from the dev branch. I've never tried this myself, exactly, but they should be pretty self-contained. Of course, you'd also need test/test_lbnl_hw.nhc dropped in too if you wanted to run the unit tests for the new module. Feedback on this method is also welcome, if you decide to try it.

Of course, if it would make things easier on you, I'm happy to provide snapshot tarballs and/or RPMs; just let me know!

jebbaxley · 2023-09-19T10:59:28Z

Thanks for getting back to me! I’m currently trying to incorporate this with a new workload manager. Is there a simple way to provide scripts that drain and undrain? Sent from my iPhoneOn Sep 19, 2023, at 03:04, Michael Jennings ***@***.***> wrote: Hey Jeb! Great to hear from you again! 😃 Not sure how I missed seeing this before... Good thing I checked the Pulse page. 😖 What version of NHC is it that you're running? For this specific check, I'd strongly recommend using the NHC 1.5 code currently in the dev branch; while 1.5 hasn't been released yet, the dev branch has a fix for this exact issue -- #121 (commit 7e2a8c6). (At least I think that's what you're seeing.) Feedback on the fix is definitely welcome! You might also be able to get away with just dropping in the scripts/lbnl_hw.nhc from the dev branch. I've never tried this myself, exactly, but they should be pretty self-contained. Of course, you'd also need test/test_lbnl_hw.nhc dropped in too if you wanted to run the unit tests for the new module. Feedback on this method is also welcome, if you decide to try it. Of course, if it would make things easier on you, I'm happy to provide snapshot tarballs and/or RPMs; just let me know! —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

mej · 2023-09-19T22:46:24Z

Thanks for getting back to me! I’m currently trying to incorporate this with a new workload manager. Is there a simple way to provide scripts that drain and undrain?

In the default configuration, the scripts that handle draining/offlining and undraining/onlining nodes are node-mark-offline and node-mark-online, respectively. By default, they get installed into /usr/libexec/nhc/ (or /usr/lib/nhc/ on Debian). Modifying those scripts is one option -- and if you're considering contributing your support for this other WLM to the upstream project, this would definitely be the way to go! -- since the handling of the different RM/WLM products is pretty straightforward. Another option would be to change the values of the OFFLINE_NODE and ONLINE_NODE config variables; those control what commands NHC will use to drain or resume a node.

jebbaxley · 2023-09-20T06:36:34Z

Thanks! I had found those as well. I’ll ask if the team wants to push it upstream, but doubt they’ll want to as the wlm was built in house for their specific workload. I saw frontier was released, how’s the new cluster doing? And how’s the team? Hope the crazy on call has calmed downSent from my iPhoneOn Sep 20, 2023, at 00:46, Michael Jennings ***@***.***> wrote: Thanks for getting back to me! I’m currently trying to incorporate this with a new workload manager. Is there a simple way to provide scripts that drain and undrain? In the default configuration, the scripts that handle draining/offlining and undraining/onlining nodes are node-mark-offline and node-mark-online, respectively. By default, they get installed into /usr/libexec/nhc/ (or /usr/lib/nhc/ on Debian). Modifying those scripts is one option -- and if you're considering contributing your support for this other WLM to the upstream project, this would definitely be the way to go! -- since the handling of the different RM/WLM products is pretty straightforward. Another option would be to change the values of the OFFLINE_NODE and ONLINE_NODE config variables; those control what commands NHC will use to drain or resume a node. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

mej self-assigned this Sep 19, 2023

mej added bug question labels Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question for faster execution: Seeing cpu_info add 10 secs to execution #141

Question for faster execution: Seeing cpu_info add 10 secs to execution #141

jebbaxley commented Sep 1, 2023

mej commented Sep 19, 2023

jebbaxley commented Sep 19, 2023 via email

mej commented Sep 19, 2023

jebbaxley commented Sep 20, 2023 via email

Question for faster execution: Seeing cpu_info add 10 secs to execution #141

Question for faster execution: Seeing cpu_info add 10 secs to execution #141

Comments

jebbaxley commented Sep 1, 2023

mej commented Sep 19, 2023

jebbaxley commented Sep 19, 2023 via email

mej commented Sep 19, 2023

jebbaxley commented Sep 20, 2023 via email