-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Colossus archive script #5193
Merged
Merged
Colossus archive script #5193
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Great work 👍 Is the |
Thank you. |
freakstatic
approved these changes
Dec 10, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
mnaamani
approved these changes
Dec 12, 2024
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Addresses: #5188
Overview
The archive mode is a new operating mode for Colossus.
When running in archive mode Colossus will:
7zip
orzstd
for additional compression),No external API is exposed in archive mode.
Essential Parameters
Filesystem
uploadQueueDir
: Directory for storing:objects_trackfile
- a file which tracks already downloaded data objects to avoid downloading them againarchives_trackfile.jsonl
- a file which keeps track of all uploaded archives and the data objects they contain. This is the only source of this information, so a copy of it is also periodically uploaded to S3 every--archiveTrackfileBackupFreqMinutes
.tmpDownloadDir
: Temporary directory for storing downloads in progress.CLI flags:
S3 bucket config
S3 support was based on #5175
CLI flags:
ENV variables:
Upload triggers
There are 3 parameters that control when then compression and upload flow is trigerred:
--localCountTriggerThreshold
- objects are packed/compressed and uploaded to S3 if the number of them (in the local directory) reaches this threshold.--localSizeTriggerThresholdMB
- objects are packed/compressed and uploaded to S3 if the size of them (in the local directory) reaches this threshold.--localAgeTriggerThresholdMinutes
- objects are packed/compressed and uploaded to S3 if the oldest of them was downloaded more than localAgeTriggerThresholdMinutes minutes agoCLI flags:
Size limits
--archiveFileSizeLimitMB
- specifies the desired size limit of the archives. This is a soft limit, the actual archives may be bigger depending on the size of the data objects (for example, if we set the limit to1GB
, but there is a2GB
data object being packed into the archive, the resulting archive will still exceed the limit). Generally lowering the limit will result in larger amount of smaller archives and increasing it will result in smaller amount of larger archives.--uploadQueueDirSizeLimitMB
- specifies the desired limit of the upload directory size. To leave a safe margin of error (for compression etc.), it should be set to ~50% of available disk space.CLI flags:
Performance Tuning
CLI flags:
Compression
Logging
Usage Example
# Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY env... storage-node archive \ --worker=123 \ --uploadQueueDir=/data/uploads \ --tmpDownloadDir=/data/temp \ --awsS3BucketRegion=us-east-1 \ --awsS3BucketName=my-bucket \ --awsStorageClass=DEEP_ARCHIVE \ --localSizeTriggerThresholdMB=20000 \ --uploadQueueDirSizeLimitMB=50000 \ --compressionAlgorithm=none
Archive service loop
The loop executed by archive service consists of the following steps:
uploadQueueDir
contents and:.tmp.*
archives (they may be left of if the compression failed at some point, in this case objects will be re-downloaded and compression will be re-attempted later)--uploadQueueDirSizeLimitMB
, in order to avoid overflowing the disk space (since downloads may be faster than uploads)objects_trackfile
, but exist in the runtime are finalized and there are no other pending tasks (like uploads in progress),localAgeTriggerThresholdMinutes
is reached an if it is - triggers the compression & uploads.--syncInterval
before next cycleLogging
It's recommended to use file logging (
debug
by default) and setCOLOSSUS_DEFAULT_LOG_LEVEL=info
envConfiguring with
env
Most of the parameters can be provided purely through env variables, this is the config I used for my tests:
Why packing/compression?
PUT
requests to S3 glacier deep archive have a price of$0.05 / 1000 requests
(https://aws.amazon.com/s3/pricing/). Currently we have almost 3,000,000 data objects on Joystream mainnet, meaning we'd have to pay $150 just for the requests. By packing ~100 objects per archive, we can reduce this cost x100 (to $1.5)$1 / month
.There are also a few drawbacks: it makes the process more complicated and it could raise transfer costs in case data objects are often being fetched (since we need to fetch entire archives). Some compression methods can can also be demanding computationally.