NHC returns false "OK" when checking for mounted GPFS filesystems #77

novosirj · 2019-01-11T22:59:13Z

To be honest, I'm not exactly sure if this is because GPFS is doing something non-standard, or this would happen with any stale remote filesystem type.

[root@node001 ~]# nhc -a

[root@node001 ~]# mount | grep projectsn
projectsn on /projectsn type gpfs (rw,relatime)

[root@node001 ~]# df -h /projectsn
df: '/projectsn': Stale file handle

It makes the filesystem check pretty unreliable, as this is one of the more likely things to go wrong. Any advice? This is with NHC 1.4.2, but I suspect this is not something that is version dependent.

The text was updated successfully, but these errors were encountered:

mej · 2021-04-18T03:06:36Z

Hey Ryan!

Based on what I see here, NHC is reporting -- correctly -- that the filesystem is mounted. :-)

As you know, NHC very intentionally does not call df on each individual filesystem; in fact, check_fs_mount() doesn't even use the df command, but it instead looks at the current mount namespace directly via /proc/self/mounts. One of the key problems NHC takes great pains to avoid is getting hung up on mounted network filesystems that have gone AWOL (e.g., NFS hard-mounts with down/lagged server).

I haven't touched GPFS in years, and we no longer use it at LANL...but I'm open to suggestions! 😀

By any chance have you looked at @treydock's GPFS check in #71? Would something like that help your use case?

novosirj · 2021-04-18T04:39:04Z

I actually don't know that this is specific to GPFS; if anyone has a tip for how to create a stale file handle (I don't actually know if I could figure out how to do it on purse for NFS or GPFS), I could probably experiment some. Personally, I'd rather NHC hang and report the hang than I would have it report a filesystem that's "technically" mounted when it means the node is unusable. These are very bad because they will drain the entire job queue if all jobs that run will fail because of a stale file handle on the user filesystem.

Would stat be safe? I went hunting around a little on the web when you asked this question, and I see someone else is using stat -t to detect a stale file handle. Again, without a way to test, it's hard to see what the behaviors would be. I completely understand not wanting to use df or something that is likely to hang under more circumstances.

mej self-assigned this Apr 18, 2021

mej added the bug label Apr 18, 2021

mej added this to the 1.5 Release milestone Apr 18, 2021

mej added usability Confusing, strange, misleading, or otherwise problematic UX and removed bug labels Apr 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NHC returns false "OK" when checking for mounted GPFS filesystems #77

NHC returns false "OK" when checking for mounted GPFS filesystems #77

novosirj commented Jan 11, 2019 •

edited

Loading

mej commented Apr 18, 2021

novosirj commented Apr 18, 2021

NHC returns false "OK" when checking for mounted GPFS filesystems #77

NHC returns false "OK" when checking for mounted GPFS filesystems #77

Comments

novosirj commented Jan 11, 2019 • edited Loading

mej commented Apr 18, 2021

novosirj commented Apr 18, 2021

novosirj commented Jan 11, 2019 •

edited

Loading