Help to pinpoint an intermittent runtime singularity error #2128
Unanswered
gmagklaras
asked this question in
Q&A
Replies: 1 comment 2 replies
-
To my knowledge, this has only been reported before:
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
The Setup
We have compiled and use as a module singularity version 3.11.3 with Go 1.20.5:
module load singularity/3.11.3 Loading singularity/3.11.3 Loading requirement: go/1.20.5
in our HPC environment that consists of RHEL 8.7 compute nodes:
[root@c6525-compute ~]# uname -a Linux c6525-compute 4.18.0-425.19.2.el8_7.x86_64 #1 SMP Fri Mar 17 01:52:38 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
[root@c6525-compute ~]# dnf list installed | grep squashfs squashfs-tools.x86_64 4.3-20.el8 @anaconda
The SIF images are accessed by an NFS partition which has no networking issues (that we can see of), some directories are bound in a Lustre filesystem but not the SIF image itself.
The following variables:
SINGULARITY_TMPDIR=/tmp SINGULARITY_SCRATCH=/modules/singularityscratch LOCALSTATEDIR := /var/containerstate RUNSTATEDIR := /var/containerstate/run
/tmp is local (/dev/mapper/vg00-lv_root on / type ext4 (rw,relatime,stripe=16))
/modules/singularityscratch is NFS ( nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,,local_lock=none,)
/var is local (/dev/mapper/vg00-lv_var_log on /var/log type ext4 (rw,relatime,stripe=16))
The Problem:
Most containers run properly most of the time but fail intermittently: For example, if we run a container 20 times with the same parameters, bind arguments, etc, it will execute properly 18 times and fail twice. This appears on different nodes (ruling out hardware issues). When the failure occurs, we see on the nodes many error messages of the following type:
SQUASHFS error: squashfs_read_data failed to read block 0x9bc97ec
At the same time, the container fails with error messages that reveal problems with locating different shared libraries:
/usr/bin/fimex: error while loading shared libraries: /lib/x86_64-linuxgnu/libicudata.so.66: cannot read file data: Input/output error
We know for certain that these libraries do exist in the container images from the many successful executions of the container (before it fails).
The other thing we observe is that when we reboot the compute nodes, it takes a while (from several days to weeks) for the problem to re-appear.
We have monitored the nodes to check for signs of memory pressure, but we do not see trends that indicate the nodes were starving for RAM when this error occurs.
We thus suspect squashfs or another kernel related issue and we would like some help or at least to see whether someone else has faced the same issue with RHEL 8 and Singularity.
Many thanks for helpful pointers.
Beta Was this translation helpful? Give feedback.
All reactions