End to end scheduling #32

paul-butcher · 2024-10-07T15:16:59Z

What does this change?

Resolves #27

You can now kick off an end to end restore and transfer by placing shoots on the restorer queue.

This will restore the images from Glacier on day one, then spread their transfer across day two, in batches of 60 per day.

This also adds some Makefile features to check what is yet to do, in order to place them onto the right list.

How to test

There are currently 31 shoots on the restore_shoots_production queue. This number should go to zero tonight, and across tomorrow, all of them should be run through the transferrer (caveat).

How can we measure success?

Future transfers of editorial photography should be a one-step process, with perhaps a little mopping up of errors afterwards.

Have we considered potential risks?

The point of a lot of this is to mitigate the risk of Archivematica falling over. The two relevant lambdas are run on a schedule so that the shots are processed at a rate that the target system can cope with.

The model relies on the restorer and transferrer being in step with one another - i.e. that on the evening of day one, Objects are restored and the transferrer queue populated, and across day two, that queue is emptied.

If it is not emptied, then the transferrer may start trying to operate on Objects that have gone back to cold storage.
If the timings are altered so that the transferrer starts less than 12 hours after the restorer, then it may try to operate on Objects that are yet to be restored.

Currently, the values are not linked in the definitions, partly because of the cron definition, which is manually written into the TF (i.e one is do 60 once and the other is do 10 six times, evenly spaced across the available hours)

agnesgaroux · 2024-10-15T08:25:58Z

client/start_restores.py

+
+def post_messages(session, shoot_numbers):
+    sns = session.resource("sns")
+    topic = sns.Topic(f"arn:aws:sns:eu-west-1:760097843905:restore_shoots-production")


No point in making the env configurable here, if only for testing?

Could do, but there is no staging transfer throttle, which is the only way in which this matters.

If we want to test things going to staging we can do that in steps using restore.py and start_transfers.py locally.

agnesgaroux · 2024-10-15T08:30:09Z

terraform/README.md

+* shifts them onto the transfer queue
+
+The transferrer then transfers everything on its queue
+```mermaid


agnesgaroux · 2024-10-15T08:30:10Z

terraform/README.md

+* Notifies the transfer throttle queue.
+
+Restoration takes a nondeterministic amount of time up to 12 hours
+```mermaid


Very nice! 👍

agnesgaroux · 2024-10-15T08:35:00Z

terraform/modules/notification_queue/main.tf

@@ -14,7 +14,7 @@ module "input_queue" {

  queue_name = "${var.action_name}-${var.environment}"

-  topic_arns                 = [module.notification_topic.arn]
+  topic_arns                 = concat(var.extra_topics, [module.notification_topic.arn])


It's not super clear why there's a "notification_topic" and also "extra_topics"

This notification queue module creates an SNS/SQS pair, so the SQS is fed by the SNS (notification topic).

The Restoration->Transfer transition requires something to happen on one account and result in queue messages on the other.

It seemed to be easier and clearer (as well as the Right Thing to Do, semantically) for the source to notify its own topic, and SQS to listen to that across the account boundary, rather than for the source to notify across the account boundary

agnesgaroux · 2024-10-15T08:36:48Z

terraform/modules/restorer_lambda/README.md

agnesgaroux · 2024-10-15T08:38:46Z

terraform/modules/transfer_throttle/main.tf

+module "transfer_scheduler" {
+  source = "../lambda_scheduler"
+  cron                 = "cron(30 7,9,11,13,15,16 ? * MON-FRI *)"
+  description          = "Restore a batch of shoots in the evening so they are ready to be transferred in the morning"


"Moves batches of shoots to the transferrer at a rate Archivematica can handle"?

whoops CTRL-C CTRL-V

agnesgaroux · 2024-10-15T08:40:22Z

terraform/modules/transfer_throttle/provider.tf

@@ -0,0 +1,7 @@
+terraform {


I think the provider can be declared once in the top-level provider.tf

This is because this whole TF operates over two accounts. This allows us to have both accounts at the top level and pass the right one down into each module

agnesgaroux

A few nit-picky comments and questions but LGTM 👍 Nice piece of automated optimisation!

agnesgaroux · 2024-10-15T08:43:49Z

I forgot: can you add something in a prominent place to explain how to turn the scheduling on and off again once everything has been transferred? I assume we don't want it to run all year round

paul-butcher added 17 commits August 7, 2024 15:49

add failure list compilation

3772ee6

add es to requirements

3a45e6e

polish up pending/failure lists

1da6ee2

polish up pending list check

8c8685c

move failure/pending lists

f817087

improve structure

e1418c7

Merge branch 'main' into failure-list-2

dcfa212

end to end scheduling

248effc

tidy

d04f7af

tidy

0ddc855

improve commentary

a740c38

remove broken diagram

f51eda5

fix diagram

3ab1a39

tidy

2cfb02a

print -> log

20b7e1c

help GH find the client-only code

1f5ea69

help GH find the client-only code

f2c5ec7

agnesgaroux reviewed Oct 15, 2024

View reviewed changes

terraform/modules/restorer_lambda/README.md Outdated

Copy link

Contributor

agnesgaroux Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty file

paul-butcher reacted with thumbs up emoji

agnesgaroux reviewed Oct 15, 2024

View reviewed changes

agnesgaroux approved these changes Oct 15, 2024

View reviewed changes

paul-butcher added 2 commits October 16, 2024 17:15

fix stuff found in review

b7c46da

longer timeout

acae26c

paul-butcher mentioned this pull request Oct 17, 2024

Switch off editorial photography ingests #34

Open

paul-butcher merged commit 8902b09 into main Oct 17, 2024
4 checks passed

paul-butcher deleted the failure-list-2 branch October 17, 2024 09:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End to end scheduling #32

End to end scheduling #32

paul-butcher commented Oct 7, 2024 •

edited

Loading

agnesgaroux Oct 15, 2024

paul-butcher Oct 16, 2024

agnesgaroux Oct 15, 2024

agnesgaroux Oct 15, 2024

agnesgaroux Oct 15, 2024

paul-butcher Oct 16, 2024

agnesgaroux Oct 15, 2024

agnesgaroux Oct 15, 2024

paul-butcher Oct 16, 2024

agnesgaroux Oct 15, 2024 •

edited

Loading

paul-butcher Oct 16, 2024

agnesgaroux left a comment

agnesgaroux commented Oct 15, 2024

End to end scheduling #32

End to end scheduling #32

Conversation

paul-butcher commented Oct 7, 2024 • edited Loading

What does this change?

How to test

How can we measure success?

Have we considered potential risks?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agnesgaroux Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agnesgaroux left a comment

Choose a reason for hiding this comment

agnesgaroux commented Oct 15, 2024

paul-butcher commented Oct 7, 2024 •

edited

Loading

agnesgaroux Oct 15, 2024 •

edited

Loading