Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve architecture for HUGE shoots #36

Open
paul-butcher opened this issue Oct 24, 2024 · 3 comments
Open

Improve architecture for HUGE shoots #36

paul-butcher opened this issue Oct 24, 2024 · 3 comments

Comments

@paul-butcher
Copy link
Contributor

Some of the last remaining shoots have been so large they need to be broken into more than just a few bundles - one consists of at least seven bundles. This cannot be handled in a single Lambda execution, regardless of how much memory, time, and threading we make available.

It is also deceptive WRT the amount left to transfer, and also messes with the assumption that Archivematica can cope with (roughly) a certain number of shoots per day.

Assuming it can normally cope with receiving 20 shoots, and somewhere up to 30 is likely to mostly succeed, we would ask for 20 shoots to be transferred.

If we ask it to transfer 20 shoots, mostly with enough photos to make one or two packages, then we are probably looking at 20-30 bundles going to Archivematica, and that will be fine.

If, however, most of them result in three packages and some of them five or even seven, then it's going to result in over 70 reaching the target system, and most of those will fail.

@paul-butcher
Copy link
Contributor Author

Currently, the transfer lambda deals with shoots. It should deal with packaged-up sub-shoots. Something upstream of it should deal with chopping up shoots into packages.

This can be done even before restoration, as we only need the s3 folder list and file size as reported in ObjectSummary. Both of these are still available when the actual data is squirrelled away in a Glacier.

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Nov 20, 2024

So the flow should look something like this:

flowchart LR
    Start -- CP_1234 --> Splitter
    Splitter -- CP_1234_001:
                a.tif,b.tif --> Restorer
    Splitter -- CP_1234_002:
                c.tif,d.tif,e.tif --> Restorer
    Splitter -- CP_1234_003:
                f.tif --> Restorer
    Restorer -- CP_1234_001:
                a.tif,b.tif --> Transferrer
    Restorer -- CP_1234_002:
                c.tif,d.tif,e.tif --> Transferrer
    Restorer -- CP_1234_003:
                f.tif --> Transferrer
Loading

@paul-butcher
Copy link
Contributor Author

That would make all the nodes much more predictable. Currently, if a lambda fails because there are too many shoots, it can be retried, because the transferrer will ignore anything that is has already been transferred. This is an inefficient manual process. It also requires the lambda to be scaled to cope with the largest possible shoots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant