Synchronization Issue between gcsfuse and Kubernetes Pod: 'No Such File or Directory' Error on File Update #44

leylmordor · 2023-07-03T14:22:16Z

I am encountering a synchronization issue between gcsfuse and a pod in Kubernetes environment. When I update the files in the Google Cloud Storage (GCS) bucket mounted by gcsfuse, the gcs fuse sidecar fails to access the updated files and throws an error.

Error Logs:

fuse: *fuseops.ReadFileOp error: no such file or directory
ReadFile: no such file or directory, fh.reader.ReadAt: startRead: NewReader: storage: object doesn't exist

Configuration:

GKE Pod: Mounted GCS bucket using gcsfuse with the following configuration:
mountOptions: 'uid=101,gid=82'

Kubernetes Job: Updates the contents of the GCS bucket by replacing the existing files. Names are not changed.

Observations and Troubleshooting Steps Taken:

Verified that the files are successfully uploaded to the GCS bucket by the Kubernetes Job.
Confirmed that the gcsfuse volume is correctly mounted in the pod.
Ensured consistent file naming between the updated files and the files accessed by the GKE Pod.
Verified file system permissions and confirmed that the uid and gid specified in the mountOptions match the permissions required by the Pod.
Tried setting stat-cache-ttl=0, type-cache-ttl=0, implicit-dirs in the mount options

Anything that I'm missing here?

The text was updated successfully, but these errors were encountered:

songjiaxun · 2023-07-06T00:35:00Z

Could you clarify the workflow a little bit?

Before you mount the bucket, does the file exist in the bucket?
Was the file initially created by gcsfuse?
Is the file in a subfolder or in the root path of the bucket? If in a subfolder, was the subfolder created by gcsfuse?
How large is the file?
Did your Job try to access the updated file immediately after it's updated?

I am trying to collect enough information to reproduce the issue on my end. Thank you!

leylmordor · 2023-09-12T05:09:20Z

@songjiaxun

Yes
No
In the root folder, files were already in the bucket
70MB
Yes.

So first we download the files using shell script to the bucket which is mounted to a path and then we try to access the files after the download is complete.

songjiaxun · 2023-09-13T05:40:38Z

Hi @sethiay , could you help take a look at this issue?

Seems like disabling cache and enabling implicit dir using stat-cache-ttl=0, type-cache-ttl=0, implicit-dirs does not help in this case.

leylmordor · 2023-12-04T05:23:58Z

@songjiaxun any luck with this?

raj-prince · 2023-12-04T10:46:36Z

@leylmordor

Just confirming if you are doing similar to the steps mentioned below:

(a) Opening a file handle from GCSFuse mount. let's say f1
(b) Changing the files remotely in GCS.
(c) Reading the files via the old file handle opened (f1).

If you are doing differently, could you please share the gcsfuse logs for the steps you performed.

Thanks,
Prince.

skyjacker2005 · 2024-04-22T13:52:54Z

Any solution to this issue ? We have the same situation.
In our case access to GCSFuse mounted volume works well if we have only one pod replica.
In case of 2 replicas unfortunately it seems that the second replica is not able to immediately see what is updated by the first.

songjiaxun · 2024-04-22T18:43:03Z

Hi @raj-prince , seems like in @skyjacker2005 case, two gcsfuse instances cannot sync the file states immediately. Can you provide some suggestions here? Is this expected?

raj-prince · 2024-04-22T19:57:25Z

To describe, let the two instances of gcsfuse mounted directory are gcs1 and gcs2. Here, assuming both gcs1 and gcs2 mounted with --stat-cache-ttl 0 && --type-cache-ttl 0 also gcs1 is used to read the file sync_test.txt and gcs2 is used to update the same file (sync_test.txt).

Now, a couple of scenarioes

Case 1: Reading from the same fileHandle forever:

(a) Create a file "sync_test.txt" with content "0" in gcs2 => echo "0" > gcs2/sync_test.txt
(b) Created a fileHandle in gcs1: fd = os.open("gcs1/sync_test.txt", os.O_RDWR|os.O_DIRECT)
(c) Read via fd: os.read(fd, 1) - output will be "0"
(d) Update the file "sync_test.txt" with content "1" in gcs2 => echo "1" > gcs2/sync_test.txt
(e) Re-position the seek of fd to start of the file: os.seek(fd, 0, 0)
(f) Read via fd again: os.read(fd, 1) - you will get an error - FileNotFoundError: [Errno 2] No such file or directory

If you perform the above operation, you will get an error `FileNotFoundError: [Errno 2] No such file or directory (assuming object versioning is not enabled in the mounted bucket).

Case 2: Reading via different handle every time:

(a) Create a file "sync_test.txt" with content "0" in gcs2 => echo "0" > gcs2/sync_test.txt
(b) Created a fileHandle in gcs1: fd = os.open("gcs1/sync_test.txt", os.O_RDWR|os.O_DIRECT)
(c) Read via fd: os.read(fd, 1) - output will be "0"
(d) Update the file "sync_test.txt" with content "1" in gcs2 => echo "1" > gcs2/sync_test.txt
(e) Created another fileHandle in gcs1: fd1 = os.open("gcs1/sync_test.txt", os.O_RDWR|os.O_DIRECT)
(f) Read via fd1: os.read(fd1, 1) - output will be "1" - you will get the latest content always.

You will get the latest content always.

Case 3 [very rare]: Read via different handle every time, but let's assume the latest updated object come with smaller generation no. (lexicographically) with respect to old one

(a) Create a file "sync_test.txt" with content "0" in gcs2 => echo "0" > gcs2/sync_test.txt (think generation: 5)
(b) Created a fileHandle in gcs1: fd = os.open("gcs1/sync_test.txt", os.O_RDWR|os.O_DIRECT)
(c) Read via fd: os.read(fd, 1) - output will be "0"
(d) Update the file "sync_test.txt" with content "1" in gcs2 => echo "1" > gcs2/sync_test.txt (think generation: 4)
(e) Created another fileHandle in gcs1: fd1 = os.open("gcs1/sync_test.txt", os.O_RDWR|os.O_DIRECT)
(f) Read via fd1: os.read(fd1, 1) - you will get an error - FileNotFoundError: [Errno 2] No such file or directory

This is a bug in GCSFuse, although it's very rare. So, we have a generation comparison where inode is updated only when the latest generation for the particular object is greater than the exist generation no. You can refer here to see the code: https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/internal/fs/fs.go#L877

So, Case 1 and Case 3 can cause to the above situation. Here, assumption is update to GCS is happening correctly. If you are using gcsfuse mounted direct, make sure to call Close/Flush to sync the item on GCS.

skyjacker2005 · 2024-04-23T08:08:28Z

Hi @raj-prince,

we don't manage how exactly the code accesses to filesystem files.
We simply use the standard unpickler load method of the "pickle" library for python object serialization.
https://github.com/python/cpython/blob/3.12/Lib/pickle.py (line 1179)

It's very strange that GCSFuse doesn't work as expected with a standard library that is used with success with other ReadyWriteMany file systems (ourselves are using the library with NFS versions 3 and 4 and the CephFS competitor).

Thanks a lot

raj-prince · 2024-04-23T08:45:36Z

@skyjacker2005 Could you please confirm are you using the same fileHandle to create the Unpicker object or not?

In the meantime, I'll discuss within team regarding this behavior and come back to you.

skyjacker2005 · 2024-04-23T09:36:22Z

@raj-prince we use it to pick sessions we store in filesystem (to be available to all replicas).

In our code I see:

f = open(self.get_session_filename(sid), "rb")

Then "f" is passed to a function executing:
unpickler = Unpickler(stream, encoding=encoding)
return unpickler.load()

Where the default encoding is ASCII.

unpickler.load fails with errors logged both from our code and gcsfuse (FileNotFoundError)

Thanks again

raj-prince · 2024-04-23T09:41:28Z

Thanks for the information!

Ohh, that means the same "f" is passed to the function executing Unpickler(stream, encoding=encoding).
If yes, Is it possible to pass different fileHandle to the function executing Unpicker logic? This will mostly solve the issue.

skyjacker2005 · 2024-04-23T09:57:03Z

I don't undestand why do you think the same file handle is passed to the function Unpickler(stream, encoding=encoding).
"f" is a local variable created on the fly. Everytime a new one is used.

raj-prince · 2024-04-23T10:14:26Z

Got it. This is strange.

There are two options now:
(a) Either to reproduce the same issue our side - this might take some time. Although if you could provide small reproducer for the issue that will be very helpful.
(b) You can provide bucket_name, project number or cluster_id via creating an internal customer support ticket, we can check gcsfuse logs (if gcs-fuse debug flags are enabled, --debug_fuse or --debug_gcs) or GCS logs, that will be quicker. Also, we would request to reproduce the issue by mounting with gcsfuse debug flags, as gcsfuse-debug logs will make the root cause analysis easier.

Thanks,
Prince Kumar.

raj-prince · 2024-04-29T07:01:39Z

A gentle follow-up!

@skyjacker2005 , could you please open a support-ticket as we need bucket-name, project-number, cluster-id to access GCS, gcsfuse logs? Otherwise, it's hard to debug.

skyjacker2005 · 2024-05-06T13:05:35Z

@raj-prince as soon as possibile we'll do it. Unfortunately I'm not the person in charge of the cloud project administration therefore I'm internally forwarding the question.

Thanks again,
I'll update you asap.

ashmeenkaur · 2024-05-14T13:49:05Z

Just to reiterate what was mentioned in #44 (comment):

Mounting gcsfuse with --stat-cache-ttl 0 && --type-cache-ttl 0 should reduce the occurrence of this issue. The tradeoff is an increase in GCS stat/list calls (more details here).

In addition to what was discussed in #44 (comment), this issue can also arise when there are concurrent reads and writes to GCS from different GCSFuse mount points.

Example Scenario:
Let's say there are two mounts:

gcs1 (for reading sync_test.txt)
gcs2 (for writing sync_test.txt)

sync_test.txt is a file on the bucket. (This issue is easier to reproduce with a large file. I used a 4GB file – but it can happen with small files as well.)

Scenario:

(a) Created a fileHandle in gcs1: fd = os.open("gcs1/sync_test.txt", os.O_RDWR|os.O_DIRECT)
(b) Read via fd: os.read(fd, 1)
(c) While (b) is in progress, update the sync_test.txt via gcs2
(d) The in-progress read on gcs1 will fail with the error: `No such file or directory`

Note: Reading via a new file handle would work successfully only if it file handle is opened after the file has been updated/written by the other mount.

Next Steps:
For further debugging, the customer will open a support ticket and share the bucket name, project number, and cluster ID.

Please feel free to reopen this issue if you have any other questions.

Thanks,
Ashmeen

songjiaxun added the question Further information is requested label Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronization Issue between gcsfuse and Kubernetes Pod: 'No Such File or Directory' Error on File Update #44

Synchronization Issue between gcsfuse and Kubernetes Pod: 'No Such File or Directory' Error on File Update #44

leylmordor commented Jul 3, 2023 •

edited

Loading

songjiaxun commented Jul 6, 2023

leylmordor commented Sep 12, 2023

songjiaxun commented Sep 13, 2023

leylmordor commented Dec 4, 2023

raj-prince commented Dec 4, 2023

skyjacker2005 commented Apr 22, 2024 •

edited

Loading

songjiaxun commented Apr 22, 2024

raj-prince commented Apr 22, 2024 •

edited

Loading

skyjacker2005 commented Apr 23, 2024 •

edited

Loading

raj-prince commented Apr 23, 2024

skyjacker2005 commented Apr 23, 2024

raj-prince commented Apr 23, 2024 •

edited

Loading

skyjacker2005 commented Apr 23, 2024 •

edited

Loading

raj-prince commented Apr 23, 2024 •

edited

Loading

raj-prince commented Apr 29, 2024 •

edited

Loading

skyjacker2005 commented May 6, 2024

ashmeenkaur commented May 14, 2024

Synchronization Issue between gcsfuse and Kubernetes Pod: 'No Such File or Directory' Error on File Update #44

Synchronization Issue between gcsfuse and Kubernetes Pod: 'No Such File or Directory' Error on File Update #44

Comments

leylmordor commented Jul 3, 2023 • edited Loading

songjiaxun commented Jul 6, 2023

leylmordor commented Sep 12, 2023

songjiaxun commented Sep 13, 2023

leylmordor commented Dec 4, 2023

raj-prince commented Dec 4, 2023

skyjacker2005 commented Apr 22, 2024 • edited Loading

songjiaxun commented Apr 22, 2024

raj-prince commented Apr 22, 2024 • edited Loading

skyjacker2005 commented Apr 23, 2024 • edited Loading

raj-prince commented Apr 23, 2024

skyjacker2005 commented Apr 23, 2024

raj-prince commented Apr 23, 2024 • edited Loading

skyjacker2005 commented Apr 23, 2024 • edited Loading

raj-prince commented Apr 23, 2024 • edited Loading

raj-prince commented Apr 29, 2024 • edited Loading

skyjacker2005 commented May 6, 2024

ashmeenkaur commented May 14, 2024

leylmordor commented Jul 3, 2023 •

edited

Loading

skyjacker2005 commented Apr 22, 2024 •

edited

Loading

raj-prince commented Apr 22, 2024 •

edited

Loading

skyjacker2005 commented Apr 23, 2024 •

edited

Loading

raj-prince commented Apr 23, 2024 •

edited

Loading

skyjacker2005 commented Apr 23, 2024 •

edited

Loading

raj-prince commented Apr 23, 2024 •

edited

Loading

raj-prince commented Apr 29, 2024 •

edited

Loading