Bring artefact archive to production level #590

spbnick · 2024-10-23T14:49:25Z

At the moment we have a prototype artifact archive working:

We post lists of files to (potentially) archive, extracted from the incoming I/O data, into a message queue.
We take the lists of files from the queue, download every 256th file under 5MB in size, and upload it into a private GCS (S3) bucket. We use URL hashes as file names.
We have a cloud function responding to HTTP requests which specify the artifact URL to fetch. If it finds it in the archive, it responds with a redirect to a "signed URL", which is only valid for a few seconds. The execution restrictions put on this function let us control download rate and thus our costs, indirectly.
We have a client class for operations with the archive.

To get to production grade, we need the following:

Decide on archival policy - do one iteration of research to decide what kind of artifacts we need to and can store and why. We did do some preliminary research before starting this implementation, which could be useful. Implement corresponding artifact filtering in the code.
Decide on retention policy: how long to store artifacts, depending on their properties, and in which storage type, and when to delete them altogether. The target here and above should be user expectations, needs of the triaging system, and costs. Implement that in the code by configuring storage policy on GCS deployment/update.
Increase archival load gradually from every 256th file, to 128th, 64th, and so on, until we archive every applicable file.
Re-engineer the deployment to handle the load, as necessary.
Have Maestro send direct links (not links to folders or WebUI) to uncompressed (or transport-compressed) files.

spbnick · 2024-10-24T13:30:00Z

Links:

Provide feedback