Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

End to end scheduling #32

Merged
merged 19 commits into from
Oct 17, 2024
Merged

End to end scheduling #32

merged 19 commits into from
Oct 17, 2024

Conversation

paul-butcher
Copy link
Contributor

@paul-butcher paul-butcher commented Oct 7, 2024

What does this change?

Resolves #27

You can now kick off an end to end restore and transfer by placing shoots on the restorer queue.

This will restore the images from Glacier on day one, then spread their transfer across day two, in batches of 60 per day.

This also adds some Makefile features to check what is yet to do, in order to place them onto the right list.

How to test

There are currently 31 shoots on the restore_shoots_production queue. This number should go to zero tonight, and across tomorrow, all of them should be run through the transferrer (caveat).

How can we measure success?

Future transfers of editorial photography should be a one-step process, with perhaps a little mopping up of errors afterwards.

Have we considered potential risks?

The point of a lot of this is to mitigate the risk of Archivematica falling over. The two relevant lambdas are run on a schedule so that the shots are processed at a rate that the target system can cope with.

The model relies on the restorer and transferrer being in step with one another - i.e. that on the evening of day one, Objects are restored and the transferrer queue populated, and across day two, that queue is emptied.

  • If it is not emptied, then the transferrer may start trying to operate on Objects that have gone back to cold storage.
  • If the timings are altered so that the transferrer starts less than 12 hours after the restorer, then it may try to operate on Objects that are yet to be restored.

Currently, the values are not linked in the definitions, partly because of the cron definition, which is manually written into the TF (i.e one is do 60 once and the other is do 10 six times, evenly spaced across the available hours)


def post_messages(session, shoot_numbers):
sns = session.resource("sns")
topic = sns.Topic(f"arn:aws:sns:eu-west-1:760097843905:restore_shoots-production")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No point in making the env configurable here, if only for testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do, but there is no staging transfer throttle, which is the only way in which this matters.

If we want to test things going to staging we can do that in steps using restore.py and start_transfers.py locally.

* shifts them onto the transfer queue

The transferrer then transfers everything on its queue
```mermaid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

* Notifies the transfer throttle queue.

Restoration takes a nondeterministic amount of time up to 12 hours
```mermaid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! 👍

@@ -14,7 +14,7 @@ module "input_queue" {

queue_name = "${var.action_name}-${var.environment}"

topic_arns = [module.notification_topic.arn]
topic_arns = concat(var.extra_topics, [module.notification_topic.arn])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not super clear why there's a "notification_topic" and also "extra_topics"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This notification queue module creates an SNS/SQS pair, so the SQS is fed by the SNS (notification topic).

The Restoration->Transfer transition requires something to happen on one account and result in queue messages on the other.

It seemed to be easier and clearer (as well as the Right Thing to Do, semantically) for the source to notify its own topic, and SQS to listen to that across the account boundary, rather than for the source to notify across the account boundary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty file

module "transfer_scheduler" {
source = "../lambda_scheduler"
cron = "cron(30 7,9,11,13,15,16 ? * MON-FRI *)"
description = "Restore a batch of shoots in the evening so they are ready to be transferred in the morning"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Moves batches of shoots to the transferrer at a rate Archivematica can handle"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops CTRL-C CTRL-V

@@ -0,0 +1,7 @@
terraform {
Copy link
Contributor

@agnesgaroux agnesgaroux Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the provider can be declared once in the top-level provider.tf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because this whole TF operates over two accounts. This allows us to have both accounts at the top level and pass the right one down into each module

Copy link
Contributor

@agnesgaroux agnesgaroux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nit-picky comments and questions but LGTM 👍 Nice piece of automated optimisation!

@agnesgaroux
Copy link
Contributor

I forgot: can you add something in a prominent place to explain how to turn the scheduling on and off again once everything has been transferred? I assume we don't want it to run all year round

@paul-butcher paul-butcher merged commit 8902b09 into main Oct 17, 2024
4 checks passed
@paul-butcher paul-butcher deleted the failure-list-2 branch October 17, 2024 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trigger restoration and transfer from Eventbridge
2 participants