Failed to change ownership on socket #318

dippynark · 2024-07-24T10:32:04Z

Issue

We have observed the following error when using the GCS FUSE CSI Driver on GKE:

/csi.v1.Node/NodePublishVolume failed with error: rpc error: code = Internal desc = failed to mount volume "[REDACTED]" to target path "/var/lib/kubelet/pods/[REDACTED]/volumes/kubernetes.io~csi/[REDACTED]/mount": failed to change ownership on socket: chown ./socket: no such file or directory

It appears the socket file could not be found after being created: https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/v1.4.1/pkg/csi_mounter/csi_mounter.go#L165-L167

Perhaps there is a race condition when changing directory? https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/v1.4.1/pkg/csi_mounter/csi_mounter.go#L142-L144

Impact

This issue seemed to cause the outcome described in the known issues doc where FUSE mount operations hang. I guess this is because socket creation happens after creating the FUSE mount but before passing the file descriptor to the GCS FUSE CSI Driver sidecar:

This interacted with a known kubelet issue where Pod cleanup hangs due to an unresponsive volume mount: kubernetes/kubernetes#101622

This then lead to all Pod actions stalling on the node: https://github.com/kubernetes/kubernetes/blob/v1.27.0/pkg/kubelet/kubelet.go#L148-L151

Confusingly, the node was not marked as unhealthy when this happened, however this seems to be due to an unrelated GKE node-problem-detector misconfiguration which I won't give details on here. Unfortunately, since this occurred in a production environment, we needed to manually deleted the node to bring the cluster back to a healthy state so it's not still around to verify this theory.

This issue has happened twice now on different nodes in the same cluster over the last week.

Note that the kubelet issue seems to have been fixed now, but not in the version of Kubernetes we are using: kubernetes/kubernetes#119968

Evironment

GKE version: v1.27.11-gke.1062004
GCS FUSE version: v1.4.1-gke.0

The text was updated successfully, but these errors were encountered:

songjiaxun · 2024-07-24T22:57:40Z

Hi @dippynark ,

The directory switch operation you are mentioning https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/v1.4.1/pkg/csi_mounter/csi_mounter.go#L142-L144 has a lock to avoid race condition:

gcs-fuse-csi-driver/pkg/csi_mounter/csi_mounter.go

Line 66 in 4ad4e2c

m.mux.Lock()

Could you share more details about your Pod scheduling pattern? Specifically, how many Pods are you scheduling to the same node at the same time? Thank you!

dippynark · 2024-07-25T08:18:52Z

Hi @songjiaxun, thanks for clearing that up,

There are 3 CronJobs each creating a Job every minute which each run one Pod that mounts a GCS bucket. All Pods are mounting the same GCS bucket.

Each Pod does a small amount of processing and then exits so each Job typically takes between 30-40 seconds to run. We're using concurrencyPolicy: Forbid on the CronJobs so we don't get more Jobs running than CronJobs even if they sometimes take longer than a minute to run.

We are also using the optimize utilization GKE autoscaling profile which means the 3 Pods are typically all scheduled to the same node at similar times.

Also, after seeing the socket error, we then started seeing lots of errors like the following (which we weren't seeing before the socket error):

/csi.v1.Node/NodePublishVolume failed with error: rpc error: code = Aborted desc = An operation with the given volume key /var/lib/kubelet/pods/[REDACTED]/volumes/kubernetes.io~csi/[REDACTED]/mount already exists

songjiaxun · 2024-07-26T20:39:13Z

Thanks @dippynark for reporting this issue. I am trying to reproduce this on my dev env now.

songjiaxun · 2024-07-26T20:41:54Z

Also, @dippynark , as we are moving forward to newer k8s versions, is it possible that you could consider upgrading your cluster version to 1.29? As you mentioned, the kubelet has a fix of the house keeping logic, and we will have a better chance to push any potential fixes much faster to newer k8s versions.

dippynark · 2024-07-29T09:10:11Z

Hi @songjiaxun, thanks yeah we are working on upgrading the cluster to latest version in the stable channel which should hopefully avoid this issue reoccurring

songjiaxun added the bug Something isn't working label Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to change ownership on socket #318

Failed to change ownership on socket #318

dippynark commented Jul 24, 2024 •

edited

Loading

songjiaxun commented Jul 24, 2024

dippynark commented Jul 25, 2024 •

edited

Loading

songjiaxun commented Jul 26, 2024

songjiaxun commented Jul 26, 2024

dippynark commented Jul 29, 2024

Failed to change ownership on socket #318

Failed to change ownership on socket #318

Comments

dippynark commented Jul 24, 2024 • edited Loading

Issue

Impact

Evironment

songjiaxun commented Jul 24, 2024

dippynark commented Jul 25, 2024 • edited Loading

songjiaxun commented Jul 26, 2024

songjiaxun commented Jul 26, 2024

dippynark commented Jul 29, 2024

dippynark commented Jul 24, 2024 •

edited

Loading

dippynark commented Jul 25, 2024 •

edited

Loading