Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colossus archive script #5193

Merged
merged 18 commits into from
Dec 12, 2024
Merged

Colossus archive script #5193

merged 18 commits into from
Dec 12, 2024

Conversation

Lezek123
Copy link
Contributor

@Lezek123 Lezek123 commented Oct 25, 2024

Addresses: #5188

Overview

The archive mode is a new operating mode for Colossus.
When running in archive mode Colossus will:

  • sync/download all data objects regardless of which bucket they are assigned to,
  • pack them into archives according to set of predefined rules (optionally using 7zip or zstd for additional compression),
  • upload them to an S3 bucket of choice,
  • try to limit the local storage usage by removing already uploaded objects and archives and controlling the download rate according to specified limits.

No external API is exposed in archive mode.

Essential Parameters

Filesystem

  • uploadQueueDir: Directory for storing:
    • fully downloaded data objects ready to be packed into archives (removed after successful upload)
    • archives and compression artifacts (removed after successful upload)
    • objects_trackfile - a file which tracks already downloaded data objects to avoid downloading them again
    • archives_trackfile.jsonl - a file which keeps track of all uploaded archives and the data objects they contain. This is the only source of this information, so a copy of it is also periodically uploaded to S3 every --archiveTrackfileBackupFreqMinutes.
  • tmpDownloadDir: Temporary directory for storing downloads in progress.

CLI flags:

--uploadQueueDir=<PATH>
--tmpDownloadDir=<PATH>          # Directory for temporary downloads

S3 bucket config

S3 support was based on #5175

CLI flags:

--awsS3BucketRegion=<REGION>
--awsS3BucketName=<NAME>
--awsStorageClass=<CLASS>

ENV variables:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

Upload triggers

There are 3 parameters that control when then compression and upload flow is trigerred:

  • --localCountTriggerThreshold - objects are packed/compressed and uploaded to S3 if the number of them (in the local directory) reaches this threshold.
  • --localSizeTriggerThresholdMB - objects are packed/compressed and uploaded to S3 if the size of them (in the local directory) reaches this threshold.
  • --localAgeTriggerThresholdMinutes - objects are packed/compressed and uploaded to S3 if the oldest of them was downloaded more than localAgeTriggerThresholdMinutes minutes ago

CLI flags:

--localCountTriggerThreshold=<N>
--localSizeTriggerThresholdMB=<MB>
--localAgeTriggerThresholdMinutes=<MIN>

Size limits

  • --archiveFileSizeLimitMB - specifies the desired size limit of the archives. This is a soft limit, the actual archives may be bigger depending on the size of the data objects (for example, if we set the limit to 1GB, but there is a 2GB data object being packed into the archive, the resulting archive will still exceed the limit). Generally lowering the limit will result in larger amount of smaller archives and increasing it will result in smaller amount of larger archives.
  • --uploadQueueDirSizeLimitMB - specifies the desired limit of the upload directory size. To leave a safe margin of error (for compression etc.), it should be set to ~50% of available disk space.

CLI flags:

--uploadQueueDirSizeLimitMB=<MB>
--archiveFileSizeLimitMB=<MB>

Performance Tuning

CLI flags:

--uploadWorkersNumber=<N>
--syncWorkersNumber=<N>
--syncInterval=<MIN>

Compression

--compressionThreads=<N>
--compressionAlgorithm=<ALG>
--compressionLevel=<LVL>

Logging

--statsLoggingInterval=<MIN>

Usage Example

# Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY env...

storage-node archive \
  --worker=123 \
  --uploadQueueDir=/data/uploads \
  --tmpDownloadDir=/data/temp \
  --awsS3BucketRegion=us-east-1 \
  --awsS3BucketName=my-bucket \
  --awsStorageClass=DEEP_ARCHIVE \
  --localSizeTriggerThresholdMB=20000 \
  --uploadQueueDirSizeLimitMB=50000 \
  --compressionAlgorithm=none

Archive service loop

The loop executed by archive service consists of the following steps:

  1. Data Integrity Check
    • Verifies uploadQueueDir contents and:
      • Removes corrupted data (data objects in undetermined / conflicting state)
      • Removes .tmp.* archives (they may be left of if the compression failed at some point, in this case objects will be re-downloaded and compression will be re-attempted later)
      • Re-schedules valid archives for upload if not already uploaded
      • Removes already uploaded archives if not yet removed for some reason
  2. Sync Stage
    • During this stage all of the not-yet-synced data objects will be fetched
    • The downloads are paused when upload directory size + size of objects in download queue is approaching --uploadQueueDirSizeLimitMB, in order to avoid overflowing the disk space (since downloads may be faster than uploads)
    • Compression and uploads can be triggered during this stage and will happen in parallel to the downloads as soon as some of the upload trigger thresholds are reached.
    • Stage finishes when download attempts for all objects that don't exist in objects_trackfile, but exist in the runtime are finalized and there are no other pending tasks (like uploads in progress),
  3. Final thresholds check stage
    • Checks upload thresholds one last time, mostly to verify if localAgeTriggerThresholdMinutes is reached an if it is - triggers the compression & uploads.
  4. Idle Stage
    • Waits for configured --syncInterval before next cycle

Logging

It's recommended to use file logging (debug by default) and set COLOSSUS_DEFAULT_LOG_LEVEL=info env

Configuring with env

Most of the parameters can be provided purely through env variables, this is the config I used for my tests:

COLOSSUS_DEFAULT_LOG_LEVEL=info
WORKER_ID=17

UPLOAD_QUEUE_DIR=/data/upload_dir
TMP_DOWNLOAD_DIR=/data/temp_dir
LOG_FILE_PATH=/logs

LOCAL_SIZE_TRIGGER_THRESHOLD_MB=5000
ARCHIVE_FILE_SIZE_LIMIT_MB=500
UPLOAD_QUEUE_DIR_SIZE_LIMIT=15000

UPLOAD_WORKERS_NUMBER=8
SYNC_WORKERS_NUMBER=8
SYNC_INTERVAL_MINUTES=60
ARCHIVE_TRACKFILE_BACKUP_FREQ_MINUTES=60

COMPRESSION_ALGORITHM=none

LOCALSTACK_ENABLED=false
AWS_REGION=eu-central-1
AWS_BUCKET_NAME=joystream.storage
AWS_ACCESS_KEY_ID={MY_LOCALSTACK_KEY_ID}
AWS_SECRET_ACCESS_KEY={MY_LOCALSTACK_ACCESS_KEY}
AWS_NODEJS_CONNECTION_REUSE_ENABLED=1
AWS_STORAGE_CLASS=STANDARD

STATS_LOGGING_INTERVAL=5

Why packing/compression?

  1. PUT requests to S3 glacier deep archive have a price of $0.05 / 1000 requests (https://aws.amazon.com/s3/pricing/). Currently we have almost 3,000,000 data objects on Joystream mainnet, meaning we'd have to pay $150 just for the requests. By packing ~100 objects per archive, we can reduce this cost x100 (to $1.5)
  2. Compression allows us to reduce the size of stored data by a few percent. We have 100 TB of data on Joystream right now. Each saved TB is another $1 / month.
  3. As the number and total size of objects in the storage system will keep growing, the benefits from using packing/compression will be even more pronounced.

There are also a few drawbacks: it makes the process more complicated and it could raise transfer costs in case data objects are often being fetched (since we need to fetch entire archives). Some compression methods can can also be demanding computationally.

@freakstatic
Copy link
Contributor

Great work 👍

Is the DISABLE_BUCKET_AUTH being used? I couldn't fix any reference on the code...

@Lezek123
Copy link
Contributor Author

Great work 👍

Is the DISABLE_BUCKET_AUTH being used? I couldn't fix any reference on the code...

Thank you.
DISABLE_BUCKET_AUTH env is no longer used.
I updated the description.

Copy link
Contributor

@freakstatic freakstatic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Lezek123 Lezek123 requested review from mnaamani and removed request for mnaamani December 12, 2024 09:22
@mnaamani mnaamani merged commit 9308382 into Joystream:master Dec 12, 2024
23 checks passed
@Lezek123 Lezek123 mentioned this pull request Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants