Colossus archive script #5193

Lezek123 · 2024-10-25T13:36:52Z

Addresses: #5188

Overview

The archive mode is a new operating mode for Colossus.
When running in archive mode Colossus will:

sync/download all data objects regardless of which bucket they are assigned to,
pack them into archives according to set of predefined rules (optionally using 7zip or zstd for additional compression),
upload them to an S3 bucket of choice,
try to limit the local storage usage by removing already uploaded objects and archives and controlling the download rate according to specified limits.

No external API is exposed in archive mode.

Essential Parameters

Filesystem

uploadQueueDir: Directory for storing:
- fully downloaded data objects ready to be packed into archives (removed after successful upload)
- archives and compression artifacts (removed after successful upload)
- objects_trackfile - a file which tracks already downloaded data objects to avoid downloading them again
- archives_trackfile.jsonl - a file which keeps track of all uploaded archives and the data objects they contain. This is the only source of this information, so a copy of it is also periodically uploaded to S3 every --archiveTrackfileBackupFreqMinutes.
tmpDownloadDir: Temporary directory for storing downloads in progress.

CLI flags:

--uploadQueueDir=<PATH>
--tmpDownloadDir=<PATH>          # Directory for temporary downloads

S3 bucket config

S3 support was based on #5175

CLI flags:

--awsS3BucketRegion=<REGION>
--awsS3BucketName=<NAME>
--awsStorageClass=<CLASS>

ENV variables:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

Upload triggers

There are 3 parameters that control when then compression and upload flow is trigerred:

--localCountTriggerThreshold - objects are packed/compressed and uploaded to S3 if the number of them (in the local directory) reaches this threshold.
--localSizeTriggerThresholdMB - objects are packed/compressed and uploaded to S3 if the size of them (in the local directory) reaches this threshold.
--localAgeTriggerThresholdMinutes - objects are packed/compressed and uploaded to S3 if the oldest of them was downloaded more than localAgeTriggerThresholdMinutes minutes ago

CLI flags:

--localCountTriggerThreshold=<N>
--localSizeTriggerThresholdMB=<MB>
--localAgeTriggerThresholdMinutes=<MIN>

Size limits

--archiveFileSizeLimitMB - specifies the desired size limit of the archives. This is a soft limit, the actual archives may be bigger depending on the size of the data objects (for example, if we set the limit to 1GB, but there is a 2GB data object being packed into the archive, the resulting archive will still exceed the limit). Generally lowering the limit will result in larger amount of smaller archives and increasing it will result in smaller amount of larger archives.
--uploadQueueDirSizeLimitMB - specifies the desired limit of the upload directory size. To leave a safe margin of error (for compression etc.), it should be set to ~50% of available disk space.

CLI flags:

--uploadQueueDirSizeLimitMB=<MB>
--archiveFileSizeLimitMB=<MB>

Performance Tuning

CLI flags:

--uploadWorkersNumber=<N>
--syncWorkersNumber=<N>
--syncInterval=<MIN>

Compression

--compressionThreads=<N>
--compressionAlgorithm=<ALG>
--compressionLevel=<LVL>

Logging

--statsLoggingInterval=<MIN>

Usage Example

# Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY env...

storage-node archive \
  --worker=123 \
  --uploadQueueDir=/data/uploads \
  --tmpDownloadDir=/data/temp \
  --awsS3BucketRegion=us-east-1 \
  --awsS3BucketName=my-bucket \
  --awsStorageClass=DEEP_ARCHIVE \
  --localSizeTriggerThresholdMB=20000 \
  --uploadQueueDirSizeLimitMB=50000 \
  --compressionAlgorithm=none

Archive service loop

The loop executed by archive service consists of the following steps:

Data Integrity Check
- Verifies uploadQueueDir contents and:
  - Removes corrupted data (data objects in undetermined / conflicting state)
  - Removes .tmp.* archives (they may be left of if the compression failed at some point, in this case objects will be re-downloaded and compression will be re-attempted later)
  - Re-schedules valid archives for upload if not already uploaded
  - Removes already uploaded archives if not yet removed for some reason
Sync Stage
- During this stage all of the not-yet-synced data objects will be fetched
- The downloads are paused when upload directory size + size of objects in download queue is approaching --uploadQueueDirSizeLimitMB, in order to avoid overflowing the disk space (since downloads may be faster than uploads)
- Compression and uploads can be triggered during this stage and will happen in parallel to the downloads as soon as some of the upload trigger thresholds are reached.
- Stage finishes when download attempts for all objects that don't exist in objects_trackfile, but exist in the runtime are finalized and there are no other pending tasks (like uploads in progress),
Final thresholds check stage
- Checks upload thresholds one last time, mostly to verify if localAgeTriggerThresholdMinutes is reached an if it is - triggers the compression & uploads.
Idle Stage
- Waits for configured --syncInterval before next cycle

Logging

It's recommended to use file logging (debug by default) and set COLOSSUS_DEFAULT_LOG_LEVEL=info env

Configuring with `env`

Most of the parameters can be provided purely through env variables, this is the config I used for my tests:

COLOSSUS_DEFAULT_LOG_LEVEL=info
WORKER_ID=17

UPLOAD_QUEUE_DIR=/data/upload_dir
TMP_DOWNLOAD_DIR=/data/temp_dir
LOG_FILE_PATH=/logs

LOCAL_SIZE_TRIGGER_THRESHOLD_MB=5000
ARCHIVE_FILE_SIZE_LIMIT_MB=500
UPLOAD_QUEUE_DIR_SIZE_LIMIT=15000

UPLOAD_WORKERS_NUMBER=8
SYNC_WORKERS_NUMBER=8
SYNC_INTERVAL_MINUTES=60
ARCHIVE_TRACKFILE_BACKUP_FREQ_MINUTES=60

COMPRESSION_ALGORITHM=none

LOCALSTACK_ENABLED=false
AWS_REGION=eu-central-1
AWS_BUCKET_NAME=joystream.storage
AWS_ACCESS_KEY_ID={MY_LOCALSTACK_KEY_ID}
AWS_SECRET_ACCESS_KEY={MY_LOCALSTACK_ACCESS_KEY}
AWS_NODEJS_CONNECTION_REUSE_ENABLED=1
AWS_STORAGE_CLASS=STANDARD

STATS_LOGGING_INTERVAL=5

Why packing/compression?

PUT requests to S3 glacier deep archive have a price of $0.05 / 1000 requests (https://aws.amazon.com/s3/pricing/). Currently we have almost 3,000,000 data objects on Joystream mainnet, meaning we'd have to pay $150 just for the requests. By packing ~100 objects per archive, we can reduce this cost x100 (to $1.5)
Compression allows us to reduce the size of stored data by a few percent. We have 100 TB of data on Joystream right now. Each saved TB is another $1 / month.
As the number and total size of objects in the storage system will keep growing, the benefits from using packing/compression will be even more pronounced.

There are also a few drawbacks: it makes the process more complicated and it could raise transfer costs in case data objects are often being fetched (since we need to fetch entire archives). Some compression methods can can also be demanding computationally.

…ion + bug fixes

freakstatic · 2024-11-15T00:59:44Z

Great work 👍

Is the DISABLE_BUCKET_AUTH being used? I couldn't fix any reference on the code...

Lezek123 · 2024-11-26T11:45:35Z

Great work 👍

Is the DISABLE_BUCKET_AUTH being used? I couldn't fix any reference on the code...

Thank you.
DISABLE_BUCKET_AUTH env is no longer used.
I updated the description.

freakstatic

LGTM!

Lezek123 added 13 commits October 24, 2024 14:04

Gitignore .venv for localstack purposes

8a32117

Merge remote-tracking branch 'upstream/master' into archive_script

7696bb9

Colossus: Archive script

6485251

Colossus: Add proper-lockfile to deps

e3a3074

Colossus: Add @types/proper-lockfile dep

0d9ba5b

Colossus: Add @types/proper-lockfile dep

b8f5e8d

storageCleanup test: Give nodes more time to sync

df625e7

Merge remote-tracking branch 'origin/archive_script' into archive_script

f653de1

Colossus: Add util:search-archives command

6e74234

Colossus archive script: Add storage classes

abcf781

Colossus archive script: Support for faster compression / no compress…

2b45e12

…ion + bug fixes

Archive script: Optimizations, bug fixes, stats logging

7b7bf26

Archive script: Sync all objects, ignore bucket assignments

5238b65

Lezek123 added 5 commits November 15, 2024 17:19

Upload timeout issues fix attempt

3cb0713

Fix failure handling and adjust timeouts

c550c0e

Better logs

dff1040

Fix: Destroy fileStream after failed upload

f76e84d

Always destroy fileStream after successful/failed upload

53fbddf

Lezek123 requested review from freakstatic and mnaamani November 26, 2024 11:46

freakstatic approved these changes Dec 10, 2024

View reviewed changes

Lezek123 requested review from mnaamani and removed request for mnaamani December 12, 2024 09:22

mnaamani approved these changes Dec 12, 2024

View reviewed changes

mnaamani merged commit 9308382 into Joystream:master Dec 12, 2024
23 checks passed

Lezek123 mentioned this pull request Jan 8, 2025

WIP: Archive node #5188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Colossus archive script #5193

Colossus archive script #5193

Lezek123 commented Oct 25, 2024 •

edited

Loading

freakstatic commented Nov 15, 2024

Lezek123 commented Nov 26, 2024

freakstatic left a comment

Colossus archive script #5193

Colossus archive script #5193

Conversation

Lezek123 commented Oct 25, 2024 • edited Loading

Overview

Essential Parameters

Filesystem

S3 bucket config

Upload triggers

Size limits

Performance Tuning

Compression

Logging

Usage Example

Archive service loop

Logging

Configuring with env

Why packing/compression?

freakstatic commented Nov 15, 2024

Lezek123 commented Nov 26, 2024

freakstatic left a comment

Choose a reason for hiding this comment

Lezek123 commented Oct 25, 2024 •

edited

Loading

Configuring with `env`