Parallel downloads #12

paul-butcher · 2024-06-14T13:48:17Z

What does this change?

This adds the ability to pull multiple S3 objects in parallel.

resolves #11

How to test

Using an AWS profile with permission to download from wellcomecollection-editorial-photography...

PYTHONPATH=src python src/transferrer/download.py PBBD_TEST

The log output should show two downloading lines, followed by two downloaded lines

INFO:__main__:downloading	ST/PB_BD_TEST/PB_TEST_001.tif
INFO:__main__:downloading	ST/PB_BD_TEST/PB_TEST_002.tif
INFO:__main__:downloaded	ST/PB_BD_TEST/PB_TEST_001.tif
INFO:__main__:downloaded	ST/PB_BD_TEST/PB_TEST_002.tif

How can we measure success?

Downloads of s3 folders should be faster. However, this is a premature optimisation, as I don't have a real-world (i.e. running on Lambda) example to compare it against. Running on my computer, fetching EPOPTEST (not currently available), it seemed to shave off a few seconds.

Even so, I expect this to make it possible for us to tune the performance of the process.

Have we considered potential risks?

This is a little bit of added complexity, which has the potential to cause the process to use significantly more memory and possibly fail. In order to mitigate this, the thread pool is constrained to a relatively low number (0.01 of the max threads available in Lambda)

See https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/run-parallel-reads-of-s3-objects-by-using-python-in-an-aws-lambda-function.html

* implement object restoration * tidy up * no need for two Tier statements * add restoration * handle restoration attempts on non-glacier objects * Remove extraneous makers

See https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/run-parallel-reads-of-s3-objects-by-using-python-in-an-aws-lambda-function.html

paul-butcher added 13 commits June 11, 2024 16:37

implement object restoration

265a720

tidy up

4a2a66c

no need for two Tier statements

1b73bc3

add restoration

c73786e

Use a threadpool to fetch the files

62b38b4

See https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/run-parallel-reads-of-s3-objects-by-using-python-in-an-aws-lambda-function.html

Restoration (#10)

24d3b38

* implement object restoration * tidy up * no need for two Tier statements * add restoration * handle restoration attempts on non-glacier objects * Remove extraneous makers

implement object restoration

6d81bcf

tidy up

b10ff71

no need for two Tier statements

2609dc2

add restoration

71a15db

Use a threadpool to fetch the files

9bb823c

See https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/run-parallel-reads-of-s3-objects-by-using-python-in-an-aws-lambda-function.html

Merge branch 'main' into parallel-downloads

b49d014

improve connection pooling

4519650

agnesgaroux approved these changes Jun 21, 2024

View reviewed changes

paul-butcher merged commit ac4132b into main Jun 21, 2024
4 checks passed

paul-butcher deleted the parallel-downloads branch June 21, 2024 09:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel downloads #12

Parallel downloads #12

paul-butcher commented Jun 14, 2024

Parallel downloads #12

Parallel downloads #12

Conversation

paul-butcher commented Jun 14, 2024

What does this change?

How to test

How can we measure success?

Have we considered potential risks?