Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot find check_nvsmi_healthmon() in 1.4.3 #113

Open
OleHolmNielsen opened this issue Mar 9, 2022 · 4 comments · Fixed by #114
Open

Cannot find check_nvsmi_healthmon() in 1.4.3 #113

OleHolmNielsen opened this issue Mar 9, 2022 · 4 comments · Fixed by #114
Assignees

Comments

@OleHolmNielsen
Copy link

This new feature in 1.4.3:

check_nvsmi_healthmon(): New check from CSC for GPU health monitoring via nvidia-smi

doesn't seem to be present in the release RPM file lbnl_nv.nhc.
How does one use this new check?
Thanks,
Ole

@mej mej self-assigned this Mar 11, 2022
@mej mej added the need info Additional information required from user or community label Mar 11, 2022
@mej
Copy link
Owner

mej commented Mar 11, 2022

Hey Ole!

The check_nvsmi_healthmon() check is defined at scripts/csc_nvidia_smi.nhc:27; it's not in lbnl_nv.nhc like the original nVidia checks. Do you see the same thing as that link shows, or do you have something different?

@OleHolmNielsen
Copy link
Author

The file scripts/csc_nvidia_smi.nhc is absent from the 1.4.3 RPM package.
Could you add it?
Thanks,
Ole

@mej mej added bug help wanted and removed need info Additional information required from user or community labels Mar 18, 2022
@mej
Copy link
Owner

mej commented Mar 18, 2022

Hey @OleHolmNielsen!

The file scripts/csc_nvidia_smi.nhc is absent from the 1.4.3 RPM package. Could you add it?

Well, yes...and no. :-)

I do see exactly what the problem is. When I merged #5, I failed to notice that the PR created the file with the check in the correct location and with the correct name but did not also add the file to the list of packaged script files in Makefile.am. All that needs to happen is for csc_nvidia_smi.nhc to be added to the nobase_dist_conf_DATA variable at that location.

Having said that... The Fine Folks ("Feyn Folks?" 🤣 🤦) at the Feynman Center for Innovation are actively reviewing my disclosure filing and my request to open-source and publish our future DOE/LANL/Triad-owned contributions directly here on GitHub. They've never had a situation quite like this before (since most of NHC is already DOE/LBNL/UC-owned and was disclosed to DOE several years ago), so to avoid making it any more complicated than it already is, I've promised not to work on NHC publicly while they're trying to get all this figured out. And while I didn't anticipate this, I'm confident that it will totally be worth it in the long run!

So I can see 2 possible solutions here:

  1. You (or someone else?) can submit a Pull Request that makes the above-described change, and I can merge it.
  2. I can apply the above fix locally myself, but I won't be able to push it to GitHub or Bitbucket until FCI completes their review/approval process.

Thoughts?

@OleHolmNielsen
Copy link
Author

OleHolmNielsen commented Mar 23, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants