Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronization Issue between gcsfuse and Kubernetes Pod: 'No Such File or Directory' Error on File Update #44

Open
leylmordor opened this issue Jul 3, 2023 · 17 comments
Labels
question Further information is requested

Comments

@leylmordor
Copy link

leylmordor commented Jul 3, 2023

I am encountering a synchronization issue between gcsfuse and a pod in Kubernetes environment. When I update the files in the Google Cloud Storage (GCS) bucket mounted by gcsfuse, the gcs fuse sidecar fails to access the updated files and throws an error.

Error Logs:

fuse: *fuseops.ReadFileOp error: no such file or directory
ReadFile: no such file or directory, fh.reader.ReadAt: startRead: NewReader: storage: object doesn't exist

Configuration:

GKE Pod: Mounted GCS bucket using gcsfuse with the following configuration:
mountOptions: 'uid=101,gid=82'

Kubernetes Job: Updates the contents of the GCS bucket by replacing the existing files. Names are not changed.

Observations and Troubleshooting Steps Taken:

  • Verified that the files are successfully uploaded to the GCS bucket by the Kubernetes Job.
  • Confirmed that the gcsfuse volume is correctly mounted in the pod.
  • Ensured consistent file naming between the updated files and the files accessed by the GKE Pod.
  • Verified file system permissions and confirmed that the uid and gid specified in the mountOptions match the permissions required by the Pod.
  • Tried setting stat-cache-ttl=0, type-cache-ttl=0, implicit-dirs in the mount options

Anything that I'm missing here?

@songjiaxun
Copy link
Collaborator

Could you clarify the workflow a little bit?

  1. Before you mount the bucket, does the file exist in the bucket?
  2. Was the file initially created by gcsfuse?
  3. Is the file in a subfolder or in the root path of the bucket? If in a subfolder, was the subfolder created by gcsfuse?
  4. How large is the file?
  5. Did your Job try to access the updated file immediately after it's updated?

I am trying to collect enough information to reproduce the issue on my end. Thank you!

@songjiaxun songjiaxun added the question Further information is requested label Jul 13, 2023
@leylmordor
Copy link
Author

@songjiaxun

  1. Yes
  2. No
  3. In the root folder, files were already in the bucket
  4. 70MB
  5. Yes.

So first we download the files using shell script to the bucket which is mounted to a path and then we try to access the files after the download is complete.

@songjiaxun
Copy link
Collaborator

Hi @sethiay , could you help take a look at this issue?

Seems like disabling cache and enabling implicit dir using stat-cache-ttl=0, type-cache-ttl=0, implicit-dirs does not help in this case.

@leylmordor
Copy link
Author

@songjiaxun any luck with this?

@raj-prince
Copy link

@leylmordor

Just confirming if you are doing similar to the steps mentioned below:

(a) Opening a file handle from GCSFuse mount. let's say f1
(b) Changing the files remotely in GCS.
(c) Reading the files via the old file handle opened (f1).

If you are doing differently, could you please share the gcsfuse logs for the steps you performed.

Thanks,
Prince.

@skyjacker2005
Copy link

skyjacker2005 commented Apr 22, 2024

Any solution to this issue ? We have the same situation.
In our case access to GCSFuse mounted volume works well if we have only one pod replica.
In case of 2 replicas unfortunately it seems that the second replica is not able to immediately see what is updated by the first.

@songjiaxun
Copy link
Collaborator

Hi @raj-prince , seems like in @skyjacker2005 case, two gcsfuse instances cannot sync the file states immediately. Can you provide some suggestions here? Is this expected?

@raj-prince
Copy link

raj-prince commented Apr 22, 2024

To describe, let the two instances of gcsfuse mounted directory are gcs1 and gcs2. Here, assuming both gcs1 and gcs2 mounted with --stat-cache-ttl 0 && --type-cache-ttl 0 also gcs1 is used to read the file sync_test.txt and gcs2 is used to update the same file (sync_test.txt).

Now, a couple of scenarioes

Case 1: Reading from the same fileHandle forever:

(a) Create a file "sync_test.txt" with content "0" in gcs2 => echo "0" > gcs2/sync_test.txt
(b) Created a fileHandle in gcs1: fd = os.open("gcs1/sync_test.txt", os.O_RDWR|os.O_DIRECT)
(c) Read via fd: os.read(fd, 1) - output will be "0"
(d) Update the file "sync_test.txt" with content "1" in gcs2 => echo "1" > gcs2/sync_test.txt
(e) Re-position the seek of fd to start of the file: os.seek(fd, 0, 0)
(f) Read via fd again: os.read(fd, 1) - you will get an error - FileNotFoundError: [Errno 2] No such file or directory

If you perform the above operation, you will get an error `FileNotFoundError: [Errno 2] No such file or directory (assuming object versioning is not enabled in the mounted bucket).

Case 2: Reading via different handle every time:

(a) Create a file "sync_test.txt" with content "0" in gcs2 => echo "0" > gcs2/sync_test.txt
(b) Created a fileHandle in gcs1: fd = os.open("gcs1/sync_test.txt", os.O_RDWR|os.O_DIRECT)
(c) Read via fd: os.read(fd, 1) - output will be "0"
(d) Update the file "sync_test.txt" with content "1" in gcs2 => echo "1" > gcs2/sync_test.txt
(e) Created another fileHandle in gcs1: fd1 = os.open("gcs1/sync_test.txt", os.O_RDWR|os.O_DIRECT)
(f) Read via fd1: os.read(fd1, 1) - output will be "1" - you will get the latest content always.

You will get the latest content always.

Case 3 [very rare]: Read via different handle every time, but let's assume the latest updated object come with smaller generation no. (lexicographically) with respect to old one

(a) Create a file "sync_test.txt" with content "0" in gcs2 => echo "0" > gcs2/sync_test.txt (think generation: 5)
(b) Created a fileHandle in gcs1: fd = os.open("gcs1/sync_test.txt", os.O_RDWR|os.O_DIRECT)
(c) Read via fd: os.read(fd, 1) - output will be "0"
(d) Update the file "sync_test.txt" with content "1" in gcs2 => echo "1" > gcs2/sync_test.txt (think generation: 4)
(e) Created another fileHandle in gcs1: fd1 = os.open("gcs1/sync_test.txt", os.O_RDWR|os.O_DIRECT)
(f) Read via fd1: os.read(fd1, 1) - you will get an error - FileNotFoundError: [Errno 2] No such file or directory

This is a bug in GCSFuse, although it's very rare. So, we have a generation comparison where inode is updated only when the latest generation for the particular object is greater than the exist generation no. You can refer here to see the code: https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/internal/fs/fs.go#L877

So, Case 1 and Case 3 can cause to the above situation. Here, assumption is update to GCS is happening correctly. If you are using gcsfuse mounted direct, make sure to call Close/Flush to sync the item on GCS.

@skyjacker2005
Copy link

skyjacker2005 commented Apr 23, 2024

Hi @raj-prince,

we don't manage how exactly the code accesses to filesystem files.
We simply use the standard unpickler load method of the "pickle" library for python object serialization.
https://github.com/python/cpython/blob/3.12/Lib/pickle.py (line 1179)

It's very strange that GCSFuse doesn't work as expected with a standard library that is used with success with other ReadyWriteMany file systems (ourselves are using the library with NFS versions 3 and 4 and the CephFS competitor).

Thanks a lot

@raj-prince
Copy link

@skyjacker2005 Could you please confirm are you using the same fileHandle to create the Unpicker object or not?

In the meantime, I'll discuss within team regarding this behavior and come back to you.

@skyjacker2005
Copy link

@raj-prince we use it to pick sessions we store in filesystem (to be available to all replicas).

In our code I see:

f = open(self.get_session_filename(sid), "rb")

Then "f" is passed to a function executing:
unpickler = Unpickler(stream, encoding=encoding)
return unpickler.load()

Where the default encoding is ASCII.

unpickler.load fails with errors logged both from our code and gcsfuse (FileNotFoundError)

Thanks again

@raj-prince
Copy link

raj-prince commented Apr 23, 2024

Thanks for the information!

Ohh, that means the same "f" is passed to the function executing Unpickler(stream, encoding=encoding).
If yes, Is it possible to pass different fileHandle to the function executing Unpicker logic? This will mostly solve the issue.

@skyjacker2005
Copy link

skyjacker2005 commented Apr 23, 2024

I don't undestand why do you think the same file handle is passed to the function Unpickler(stream, encoding=encoding).
"f" is a local variable created on the fly. Everytime a new one is used.

@raj-prince
Copy link

raj-prince commented Apr 23, 2024

Got it. This is strange.

There are two options now:
(a) Either to reproduce the same issue our side - this might take some time. Although if you could provide small reproducer for the issue that will be very helpful.
(b) You can provide bucket_name, project number or cluster_id via creating an internal customer support ticket, we can check gcsfuse logs (if gcs-fuse debug flags are enabled, --debug_fuse or --debug_gcs) or GCS logs, that will be quicker. Also, we would request to reproduce the issue by mounting with gcsfuse debug flags, as gcsfuse-debug logs will make the root cause analysis easier.

Thanks,
Prince Kumar.

@raj-prince
Copy link

raj-prince commented Apr 29, 2024

A gentle follow-up!

@skyjacker2005 , could you please open a support-ticket as we need bucket-name, project-number, cluster-id to access GCS, gcsfuse logs? Otherwise, it's hard to debug.

@skyjacker2005
Copy link

@raj-prince as soon as possibile we'll do it. Unfortunately I'm not the person in charge of the cloud project administration therefore I'm internally forwarding the question.

Thanks again,
I'll update you asap.

@ashmeenkaur
Copy link

Just to reiterate what was mentioned in #44 (comment):

Mounting gcsfuse with --stat-cache-ttl 0 && --type-cache-ttl 0 should reduce the occurrence of this issue. The tradeoff is an increase in GCS stat/list calls (more details here).

In addition to what was discussed in #44 (comment), this issue can also arise when there are concurrent reads and writes to GCS from different GCSFuse mount points.

Example Scenario:
Let's say there are two mounts:

  • gcs1 (for reading sync_test.txt)
  • gcs2 (for writing sync_test.txt)

sync_test.txt is a file on the bucket. (This issue is easier to reproduce with a large file. I used a 4GB file – but it can happen with small files as well.)

Scenario:

(a) Created a fileHandle in gcs1: fd = os.open("gcs1/sync_test.txt", os.O_RDWR|os.O_DIRECT)
(b) Read via fd: os.read(fd, 1)
(c) While (b) is in progress, update the sync_test.txt via gcs2
(d) The in-progress read on gcs1 will fail with the error: `No such file or directory`

Note: Reading via a new file handle would work successfully only if it file handle is opened after the file has been updated/written by the other mount.

Next Steps:
For further debugging, the customer will open a support ticket and share the bucket name, project number, and cluster ID.

Please feel free to reopen this issue if you have any other questions.

Thanks,
Ashmeen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants