Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NHC returns false "OK" when checking for mounted GPFS filesystems #77

Open
novosirj opened this issue Jan 11, 2019 · 2 comments
Open

NHC returns false "OK" when checking for mounted GPFS filesystems #77

novosirj opened this issue Jan 11, 2019 · 2 comments
Assignees
Labels
usability Confusing, strange, misleading, or otherwise problematic UX
Milestone

Comments

@novosirj
Copy link

novosirj commented Jan 11, 2019

To be honest, I'm not exactly sure if this is because GPFS is doing something non-standard, or this would happen with any stale remote filesystem type.

[root@node001 ~]# nhc -a

[root@node001 ~]# mount | grep projectsn
projectsn on /projectsn type gpfs (rw,relatime)

[root@node001 ~]# df -h /projectsn
df: '/projectsn': Stale file handle

It makes the filesystem check pretty unreliable, as this is one of the more likely things to go wrong. Any advice? This is with NHC 1.4.2, but I suspect this is not something that is version dependent.

@mej
Copy link
Owner

mej commented Apr 18, 2021

Hey Ryan!

Based on what I see here, NHC is reporting -- correctly -- that the filesystem is mounted. :-)

As you know, NHC very intentionally does not call df on each individual filesystem; in fact, check_fs_mount() doesn't even use the df command, but it instead looks at the current mount namespace directly via /proc/self/mounts. One of the key problems NHC takes great pains to avoid is getting hung up on mounted network filesystems that have gone AWOL (e.g., NFS hard-mounts with down/lagged server).

I haven't touched GPFS in years, and we no longer use it at LANL...but I'm open to suggestions! 😀

By any chance have you looked at @treydock's GPFS check in #71? Would something like that help your use case?

@mej mej self-assigned this Apr 18, 2021
@mej mej added the bug label Apr 18, 2021
@mej mej added this to the 1.5 Release milestone Apr 18, 2021
@mej mej added usability Confusing, strange, misleading, or otherwise problematic UX and removed bug labels Apr 18, 2021
@novosirj
Copy link
Author

I actually don't know that this is specific to GPFS; if anyone has a tip for how to create a stale file handle (I don't actually know if I could figure out how to do it on purse for NFS or GPFS), I could probably experiment some. Personally, I'd rather NHC hang and report the hang than I would have it report a filesystem that's "technically" mounted when it means the node is unusable. These are very bad because they will drain the entire job queue if all jobs that run will fail because of a stale file handle on the user filesystem.

Would stat be safe? I went hunting around a little on the web when you asked this question, and I see someone else is using stat -t to detect a stale file handle. Again, without a way to test, it's hard to see what the behaviors would be. I completely understand not wanting to use df or something that is likely to hang under more circumstances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usability Confusing, strange, misleading, or otherwise problematic UX
Projects
None yet
Development

No branches or pull requests

2 participants