Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPFS health check #71

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from
Open

Add GPFS health check #71

wants to merge 6 commits into from

Conversation

treydock
Copy link
Contributor

@treydock treydock commented Nov 1, 2018

I have only deployed this onto one system and one where I knew there were GPFS network issues with nodes not using RDMA that was configured:

[root@p0001 ~]# nhc
ERROR:  nhc:  Health check failed:  check_gpfs_health NETWORK: GPFS health for "NETWORK" is FAILED

Configured check:

* || check_gpfs_health NETWORK

One thing I am not sure on for behavior is what to do if the configured component isn't found in output, right now if you do check_gpfs_health FOO, there is no warning of failure.

@treydock
Copy link
Contributor Author

treydock commented Nov 1, 2018

One thing that could probably be improved is allowing path to mmhealth to be changed to avoid hardcoding the value.

@treydock
Copy link
Contributor Author

treydock commented Nov 5, 2018

Made path to mmhealth configurable and updated README.

@treydock
Copy link
Contributor Author

We noticed something with GPFS can cause mmhealth to be unreliable but that mmfsadm test verbs status is another way to test that GPFS is actually using RDMA and not ethernet fallback. Added check_gpfs_verbs_status check.

@mej mej self-assigned this Jan 1, 2019
@mej mej added the enhancement label Jan 1, 2019
@mej mej added this to the 1.4.4 Release milestone Jan 1, 2019
@mej mej modified the milestones: 1.4.4 Release, 1.4.4 Release (new), 1.5 Release Apr 17, 2021
@novosirj
Copy link

We currently run mmfsadm test verbs status as a test for the same reason. Our test looks like this:

# Make sure GPFS RDMA VERBS started
* || check_cmd_output -m "VERBS RDMA status: started" /usr/lpp/mmfs/bin/mmfsadm test verbs status

So very similar. Worth nothing is that this was broken for a little while pretty recently (some version of 4.2.3.x it must have been), and in the interim we had to do this instead:

* || check_cmd_output -m '/^\ +VerbsRdmaStarted\ +:\ yes$/' /usr/lpp/mmfs/bin/mmfsadm test verbs config

IBM helped us figure that one out (which I guess is only fair as they broke mmfsadm).

For mmhealth though, don't run the monitoring stuff on our compute nodes, so mmhealth node show isn't of any use to us. We've considered it, but I'm not sure it's that commonplace on compute nodes.

[root@hal0003 ~]# /usr/lpp/mmfs/bin/mmhealth node show
The monitoring service is down and does not respond, please restart it with 'mmsysmoncontrol restart'

@mej
Copy link
Owner

mej commented Apr 18, 2021

This looks awesome, Trey! This will go into nhc/dev as soon as 1.4.3 is out the door. Thanks much!

@mej mej self-requested a review April 18, 2021 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants