Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make local/temporary file/directory deletion thread-safe #666

Open
cortadocodes opened this issue Jul 11, 2024 · 0 comments
Open

Make local/temporary file/directory deletion thread-safe #666

cortadocodes opened this issue Jul 11, 2024 · 0 comments
Assignees
Labels
bug Unintended behaviour in any area of the app

Comments

@cortadocodes
Copy link
Member

cortadocodes commented Jul 11, 2024

Bug report

What is the current behavior?

The Runner class and service configuration have a delete_local_files argument which deletes any datafiles downloaded and any registered temporary directories at the end of an analysis run. We know (and warn) that this isn't thread-safe. However, we've recently seen the same consequences of thread unsafety in what should be a thread-safe environment (separate containers running separate processes with supposedly separate filesystems in Cloud Run).

To get our containers to run reliably, we've recently had to disable local file deletion (delete_local_files=False is now the default). We now have a potential problem of running out of memory when running our containers in Cloud Run where, by default, local storage is actually in-memory. We need to find a way to clean up files from one analysis before the next is run.

The actual error we saw was a FileNotFoundError which would occur when using functions from the os.path module e.g. when trying to get the relative path of a datafile in a dataset so it could be downloaded/uploaded or when trying to get the absolute path of a metadata file.

We triggered this error when asking 24 questions (6 in parallel at a time) to one of our services with concurrency set to 1 (1 request (question) at a time per container). The first several questions would work ok (between 4 and 8), then the rest would fail. We think what happened was this:

  1. The first few questions would get new containers
  2. These questions would finish and clean up successfully
  3. The rest of the questions would be taken up by the containers that were already running and would attempt to use temporary directories. This would somehow fail because of the previous cleanup.

So, despite concurrency being 1, each container would attempt to answer multiple questions in serial. This should have been fine if each container had its own filesystem, but a new shared filesystem (between containers in a Cloud Run service instance) is being trialled in Cloud Run that we think makes it effectively thread-unsafe to clean up like this. I think it's possible that separate containers are interfering with each others list of registered temporary directories; it's also possible it's a problem within each container but caused by serial analyses.

What is the expected behavior?

We can automatically delete downloaded and temporary files/directories at the end of each analysis run.

Proposed Solution

Some ideas:

  • Set delete_local_files=True and make our file cleanup thread-safe (e.g. apply a lock on accessing and modifying the list of registered temporary directories)
    • e.g. put the registered_temporary_directories variable on the Runner instance so it's not global
  • Keep delete_local_files=False and start providing a non-in-memory filesystem that's cleaned up separately
@cortadocodes cortadocodes added the bug Unintended behaviour in any area of the app label Jul 11, 2024
@cortadocodes cortadocodes self-assigned this Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended behaviour in any area of the app
Projects
None yet
Development

No branches or pull requests

1 participant