Help to pinpoint an intermittent runtime singularity error #2128

gmagklaras · 2023-08-29T14:10:28Z

gmagklaras
Aug 29, 2023

Hi,

The Setup

We have compiled and use as a module singularity version 3.11.3 with Go 1.20.5:
module load singularity/3.11.3 Loading singularity/3.11.3 Loading requirement: go/1.20.5

in our HPC environment that consists of RHEL 8.7 compute nodes:

[root@c6525-compute ~]# uname -a Linux c6525-compute 4.18.0-425.19.2.el8_7.x86_64 #1 SMP Fri Mar 17 01:52:38 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

[root@c6525-compute ~]# dnf list installed | grep squashfs squashfs-tools.x86_64 4.3-20.el8 @anaconda

The SIF images are accessed by an NFS partition which has no networking issues (that we can see of), some directories are bound in a Lustre filesystem but not the SIF image itself.

The following variables:
SINGULARITY_TMPDIR=/tmp SINGULARITY_SCRATCH=/modules/singularityscratch LOCALSTATEDIR := /var/containerstate RUNSTATEDIR := /var/containerstate/run

/tmp is local (/dev/mapper/vg00-lv_root on / type ext4 (rw,relatime,stripe=16))
/modules/singularityscratch is NFS ( nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,,local_lock=none,)
/var is local (/dev/mapper/vg00-lv_var_log on /var/log type ext4 (rw,relatime,stripe=16))

The Problem:

Most containers run properly most of the time but fail intermittently: For example, if we run a container 20 times with the same parameters, bind arguments, etc, it will execute properly 18 times and fail twice. This appears on different nodes (ruling out hardware issues). When the failure occurs, we see on the nodes many error messages of the following type:

SQUASHFS error: squashfs_read_data failed to read block 0x9bc97ec

At the same time, the container fails with error messages that reveal problems with locating different shared libraries:

/usr/bin/fimex: error while loading shared libraries: /lib/x86_64-linuxgnu/libicudata.so.66: cannot read file data: Input/output error

We know for certain that these libraries do exist in the container images from the many successful executions of the container (before it fails).

The other thing we observe is that when we reboot the compute nodes, it takes a while (from several days to weeks) for the problem to re-appear.

We have monitored the nodes to check for signs of memory pressure, but we do not see trends that indicate the nodes were starving for RAM when this error occurs.

We thus suspect squashfs or another kernel related issue and we would like some help or at least to see whether someone else has faced the same issue with RHEL 8 and Singularity.

Many thanks for helpful pointers.

dtrudg · 2023-08-29T14:24:32Z

dtrudg
Aug 29, 2023
Maintainer

To my knowledge, this has only been reported before:

In a case involving images run from a Lustre file-system, where it was eventually found that there was an Infiniband network issue (bad cabling).
In a case involving images run from a Beegfs file-system, where the solution was to move the images to an NFS file-system... it was a kernel squashfs <-> Beegfs incompatibility.

2 replies

gmagklaras Aug 29, 2023
Author

Dear Dave,

Thanks for the answer. As I stated, the images are run from an NFS partition. We do not use an IB/OB or other interconnect in these nodes, only 25/40 Gig Ethernet. So, whatever the container cannot intermittently access from the /lib/* directories resides in the SIF image, not in another parallel FS bound to a mount point.

gmagklaras Oct 17, 2023
Author

Does anybody else now if this is an issue with images residing in an nfs4 mount? This is an ongoing issue for us.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help to pinpoint an intermittent runtime singularity error #2128

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Help to pinpoint an intermittent runtime singularity error #2128

gmagklaras Aug 29, 2023

The Setup

The Problem:

Replies: 1 comment · 2 replies

dtrudg Aug 29, 2023 Maintainer

gmagklaras Aug 29, 2023 Author

gmagklaras Oct 17, 2023 Author

gmagklaras
Aug 29, 2023

Replies: 1 comment 2 replies

dtrudg
Aug 29, 2023
Maintainer

gmagklaras Aug 29, 2023
Author

gmagklaras Oct 17, 2023
Author