Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage] Investigate rclone mount with VFS caching #3353

Open
Tracked by #3455
concretevitamin opened this issue Mar 22, 2024 · 14 comments
Open
Tracked by #3455

[Storage] Investigate rclone mount with VFS caching #3353

concretevitamin opened this issue Mar 22, 2024 · 14 comments
Labels

Comments

@concretevitamin
Copy link
Collaborator

https://rclone.org/commands/rclone_mount/#vfs-file-caching

Goal is to (1) improve the performance of datasets reading/writing, checkpoints writing (2) support things like appends, compared to regular bucket mounting via mode: MOUNT. rclone mount with VFS caching seems to be using local SSDs for reads/writes but also syncs up to the object storage bucket.

This mode may mimic some CSYNC-like functionality #2336.

@concretevitamin
Copy link
Collaborator Author

concretevitamin commented Mar 27, 2024

On one GCP node:

file_mounts:
  ~/.config/rclone/rclone.conf: ~/.config/rclone/rclone.conf

setup: |
  set -e
  rclone version || {{ curl https://rclone.org/install.sh | sudo bash }}
  # rclone requires fuse3
  sudo apt-get update && sudo apt-get install fuse3 -y
  rclone mount gcs:my-bucket /mnt/my-bucket \
    --daemon --daemon-wait 0 \
    --allow-other --rc --vfs-cache-mode full
  rclone rc vfs/refresh

Writing a large file to /mnt/my-bucket/ triggers immediate sync up (csync like behavior), with good speed (probably some caching):

$ bash dd_test.sh /mnt/my-bucket/testfile-rclone-vfs
Starting write test...
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 17.3448 s, 248 MB/s
Starting read test...
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.82129 s, 1.1 GB/s

Writing to a non-mounted path (waited for background cloud sync triggered by above to finish): for some reason slower

$ bash dd_test.sh /mnt/normal-disk
Starting write test...
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 38.8702 s, 110 MB/s
Starting read test...
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 19.2575 s, 223 MB/s

dd_test.sh:

#!/bin/bash
# Check if a test file name was provided
if [ -z "$1" ]; then
	  echo "Usage: $0 <testfile>"
	    exit 1
fi

# Use the provided argument as the test file name
TESTFILE=$1

# Size of the test file
FILESIZE=4294967296 # 4 GB

# Block size
BLOCKSIZE=65536

# Write test
echo "Starting write test..."
dd if=/dev/urandom of=$TESTFILE bs=$BLOCKSIZE count=$(($FILESIZE/$BLOCKSIZE)) conv=fdatasync oflag=direct 2>&1 | grep -E "copied|bytes"

# Read test
echo "Starting read test..."
dd if=$TESTFILE of=/dev/null bs=$BLOCKSIZE 2>&1 | grep -E "copied|bytes"

@shethhriday29
Copy link
Contributor

shethhriday29 commented May 26, 2024

Some findings after implementing a cache via Rclone:

  • List of initial benchmarks (link)
  • [ignore sections labeled EBS, they are irrelevant in this context because those were direct reads/writes to EBS, not to S3 via an EBS cache]
  • Even early on, I noticed that at some points, the advantage of an rclone was very thin/non-existent compared to EBS, like at the 1GB total, 1MB Blocksize AWS/S3 test on the doc (3,258.70 MB/s without cache, 3,633.64 MB/s with cache)
  • However, these benchmarks (both using fio and dd) became very inconsistent after trying them out multiple times, and I specifically found out that for larger quantities of data on these dd/fio benchmarks, accessing cloud storage directly was a lot faster than using an rclone cache (for a 5GB size/1MB blocksize write, was getting only 150 MB/s write speed via rclone cache and 2000+ without the rclone cache)
  • After more investigation, we realized the following:
    • The benchmark tests (dd/fio) were leveraging the buff/cache (aka, memory) for reads/writes, whereas the same benchmark tests for the old method (accessing goofys for AWS/gsutil for GCP) were not leveraging the buff/cache
    • As a result, we'd sometimes get blazingly fast access speeds for the rclone cache on benchmarks (>3000 MB/s) when the size of the read/write would fit within memory, and when it wouldn't (at larger read/write total sizes), it would degrade to the baseline read/write speed for the local disk/flash storage itself (150-250 MB/s for EBS)
    • Meanwhile, the original direct-access method (goofys for AWS/gsutil for GCP) did not vary as much in performance (1000-2000 MB/s) as total read/write size changed, because it's largely limited by network and the throughput of S3 itself (which ARE NOT limited by the throughput of local storage, the way the rclone cache is at larger read/write sizes)
    • This was confirmed by a trial run of fine-tuning Vicuna using both methods (first image for direct to cloud storage access, second image for the rclone cache):
Screenshot 2024-05-26 at 1 59 41 AM Screenshot 2024-05-26 at 2 00 38 AM
  • In short, were blocked for only 4 minutes each time the 100 GB needs to be written when there is no rclone cache, vs 5.5 minutes each time the same 100 GB needs to be written when there is an rclone cache

  • Despite this, there are still reasons to implement the rclone cache as an option for the user:

  • If the user has frequent reads/writes to files that are not large in size, then the lower latency provided by the rclone cache is advantageous (because we can just access disk directly, not wait for the latency of accessing cloud storage)

  • Additionally, rclone cache is quite important for supporting POSIX compatibility. This is important for a variety of things, including accessing Jupyter notebooks, which I WAS able to do in rclone-cache-mounted cloud storage but wasn't able to do via the old goofys method in AWS. (can find screenshots of this in the "initial benchmarks" doc linked at the top of this comment)

  • However, this is not necessarily an endorsement to make rclone cache the default mounting method, rather a supported one if the user chooses

@shethhriday29
Copy link
Contributor

for reference: draft pull request w/ rclone cache implementation #3455

@romilbhardwaj
Copy link
Collaborator

This is fantastic - thanks for sharing @shethhriday29!

What EBS disk_tier was used for these benchmarks?

Looks like this is a "high latency high throughput" (FUSE) vs "low latency low throughput" (Rclone, and potentially CSYNC?) situation:

  1. For write heavy workloads:
    a. Writing large files (e.g., Vicuna) - FUSE based approaches are faster than rclone.
    b. Writing many small files (e.g., logs, other workloads) - rclone is better. Especially since it can provide POSIX compatibility.

  2. For read heavy workloads:
    a. Reading large files (e.g., datasets stored as parquet files) - FUSE is better (?)
    b. Reading many small files (e.g., datasets stored as individual files/records) - rclone may be faster (?)

cc @landscapepainter - did you have similar observations for csync? any thoughts?

@shethhriday29
Copy link
Contributor

Thank you, Romil! Cannot 100% recall for the fio benchmarks, but at least for training Vicuna, we did have disk_tier: best.

@shethhriday29
Copy link
Contributor

Just did some of the fio testing on disk_tier: best and got the same general results — inconsistent speeds (sometimes blazingly fast reads/writes, otherwise very slow reads/writes).

@landscapepainter
Copy link
Collaborator

landscapepainter commented May 28, 2024

@romilbhardwaj I have not particularly benchmarked the performance on read workloads, but rclone vfs used in this PR, #3455, by @shethhriday29 seemed to be performing better than FUSE based tools(goofys, gcsfuse) when training script is writing checkpoints, and even on par with CSYNC, #2336. So I looked into it more in-depth and measured the time taken for each MOUNT, rclone vfs(#3455), CSYNC mode to write checkpoint(the time training script is stuck) by overriding the training class we use for Vicuna. I'm currently trying to organize the data I see now, so I'll update more detailed results soon, and here is the tl;dr:

rclone vfs is almost as fast as CSYNC writing the checkpoint on the local ssd(CSYNC ~150s vs. rclone vfs ~175s) . So it's a pass in terms of letting training script to continue as quick as possible. But there are two issues:

  • It takes quite long to sync the checkpoint written at local ssd to cloud storage.(CSYNC ~2min vs. rclone vfs ~26min).
  • Does not guarantee the sequential upload to cloud storage corresponding to the order of each checkpoint file written by the training script at local ssd, which will cause issue regards to what @concretevitamin mentioned about DeepSpeed. Files with smaller size that is generated later than the ones written earlier does get uploaded first. This is an issue for CSYNC as well, so we need to implement additional feature.

I'll try to update more details soon, but just wanted to give everyone a heads up.

@landscapepainter
Copy link
Collaborator

landscapepainter commented May 29, 2024

@romilbhardwaj @shethhriday29 @concretevitamin I'll explain the data I shared in the previous comment with more depth.

To measure the time of writing checkpoint with each storage mode, I override the _save_checkpoint() method of transformers.Trainer class we use in the script to train Vicuna. And spot A100-80GB:8 with default medium disk tier was used to train Vicuna 13B. For CSYNC, interval_seconds: 180 was set. Preemption did not happen while I was measuring the time.

The time it takes for the training script to write checkpoint before continuing to train is as follows:
gcsfuse MOUNT: 1245.24s
rclone vfs: 186.39s
CSYNC: 164.16s

And you can also see this visually from W&B plots:
Screenshot 2024-05-28 at 6 24 52 PM

The time for the training script to complete writing checkpoint for CSYNC and rclone vfs are similar. But there are two concerns I discovered with rclone vfs.

1. It takes quite a time for it to sync the checkpoints from local to cloud storage, and this can increase the chance of user losing the checkpoint when the spot instance get preempted. Following is the time plot of when the training script first starts to write the larger portion of checkpoints(~60GB) and when the corresponding checkpoint appears on the cloud storage:

rclone vfs:
training script starts to write on local ssd: 11:15:39
training script completes to write on local ssd: 11:18:30
all the larger portion of checkpoints appears on cloud storage: 11:44:07

CSYNC:
training script starts to write on local ssd: 10:27:01
training script completes to write on local ssd: 10:29:29
all the larger portion of checkpoints appears on cloud storage: 10:31:39

As you can see, the time it takes to fully sync between local and cloud storage for rclone vfs takes some time(CSYNC ~2min vs. rclone vfs ~26min). But one thing to note is that, this 2 minute of CSYNC will increase if we implement a feature to upload each checkpoints in order corresponding to the order it was written to the local ssd.

2. Some trianing framework require all files of a single checkpoint to exist in order if it was to be used, and if not it will crash rather than failing over to another checkpoint, and rclone vfs fails to maintain this order. For Deepspeed, if some of the checkpoint files that are written by the training script in later time(i.e. a file marking completion of writing checkpoint) exist, but the one written earlier does not, it will crash rather than failing over to the other checkpoint, which is an issue brought up by @concretevitamin. Hence, we want to make sure the order of files appear in the cloud storage to match with the files written at local by the training script, just like MOUNT mode. rclone vfs fails to address this problem just like currnet CSYNC. For our Vicuna training script with transformers.Train, the last file written as part of the checkpoint is called complete to mark that the entire checkpoint directory is completely written, and you can see this below from cloud storage that was used as MOUNT mode.

Screenshot 2024-05-28 at 6 56 43 PM

But for rclone vfs, the order is not maintained just like current CSYNC. There are files uploaded to cloud storage after complete file was uploaded.

Screenshot 2024-05-28 at 6 58 24 PM

Screenshot 2024-05-28 at 6 58 31 PM

As rclone vfs is not only slower than CSYNC, but also fails to maintain the order, it seems beneficial for us to complete the one last piece of CSYNC by implementing the feature to maintain the order.

@concretevitamin
Copy link
Collaborator Author

@landscapepainter @shethhriday29 In your benchmarks, what are the other VFS args (https://rclone.org/commands/rclone_mount/#vfs-file-caching) set to? Those may affect certain behavior like time taken to sync up to cloud storage.

@shethhriday29
Copy link
Contributor

I was mounting via
rclone mount {bucket_rclone_profile}:{bucket_name} {mount_path} --daemon --daemon --daemon-wait 0 --allow-other --rc --vfs-cache-mode full && rclone rc vfs/refresh

@landscapepainter
Copy link
Collaborator

@concretevitamin @shethhriday29 Let me look into what other options we have to improve that performance.

@landscapepainter
Copy link
Collaborator

landscapepainter commented May 30, 2024

@concretevitamin @shethhriday29 @romilbhardwaj I did more research on some feasible options to improve performance of the current rclone vfs implementation, and it seems like I found what we wanted.

On top of @shethhriday29's implementation, I added the following two options:

  • --transfers 1: This limits the concurrent upload from cache to cloud storage to 1. If there are multiple files to be uploaded at the cache, it follows the order of when it was created(modified) at the cache and upload one by one. This resolves the issue 2. mentioned in this comment.
  • --vfs-cache-poll-interval 10s: This sets the ineterval for how often rclone should check between cache and cloud storage to see the difference and trigger sync. The default value is set to 1 minute, and seems like this was the main cause for delayed upload speed(~26min) mentioned in this comment.

From the following, you can see that with newly added options:

  1. it took only ~8 minutes for checkpoint files(~61G) to be completely uploaded to cloud storage since the files got started to be written by the training script. rng_state_1.pth is the very first file written by the script, and complete is the last one.
  2. the order of checkpoint files uploaded to cloud storage follows the order of creation of each file written by training script.
  • result of ls -lh --time-style=full-iso at local
Screenshot 2024-05-29 at 11 16 28 PM
  • mounting gcs with rclone vfs

Screenshot 2024-05-29 at 11 15 45 PM

  • W&B graph
    Screenshot 2024-05-29 at 11 33 21 PM

But we do need additional testing on CPU usage, how large the cache can get and it's implication, reading checkpoint when spot gets preempted(especially options related to reading), and etc. We should add additional options like --cache-dir to specify the directory of the cache just like we were doing for CSYNC.

@landscapepainter
Copy link
Collaborator

landscapepainter commented May 31, 2024

Two additional discoveries on goofys and latest gcsfuse version 2.2.0.

  1. The time training script takes to write the checkpoint with MOUNT mode using goofys on S3 is on par with CSYNC and rclone with additional options, which explains the high throughput on s3 with goofys that @shethhriday29 discovered.

  2. Measured the same metric for latest gcsfuse version, 2.2.0. We currently have 1.3.0 running with MOUNT mode, and there was a good amount of performance improvement with the verison updates. I'll submit a PR for this update.

Screenshot 2024-05-30 at 5 41 42 PM

cc @romilbhardwaj @concretevitamin

@shethhriday29
Copy link
Contributor

Wow, this is awesome @landscapepainter ! Are the additional options just --vfs-cache-poll-interval 10s and --transfers 1? Also, from the graph, it doesn't look like there's a big difference in the performance of rclone vfs with or without the additional options — am I missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants