Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel downloads #12

Merged
merged 13 commits into from
Jun 21, 2024
Merged

Parallel downloads #12

merged 13 commits into from
Jun 21, 2024

Conversation

paul-butcher
Copy link
Contributor

What does this change?

This adds the ability to pull multiple S3 objects in parallel.

resolves #11

How to test

Using an AWS profile with permission to download from wellcomecollection-editorial-photography...

PYTHONPATH=src python src/transferrer/download.py PBBD_TEST

The log output should show two downloading lines, followed by two downloaded lines

INFO:__main__:downloading	ST/PB_BD_TEST/PB_TEST_001.tif
INFO:__main__:downloading	ST/PB_BD_TEST/PB_TEST_002.tif
INFO:__main__:downloaded	ST/PB_BD_TEST/PB_TEST_001.tif
INFO:__main__:downloaded	ST/PB_BD_TEST/PB_TEST_002.tif

How can we measure success?

Downloads of s3 folders should be faster. However, this is a premature optimisation, as I don't have a real-world (i.e. running on Lambda) example to compare it against. Running on my computer, fetching EPOPTEST (not currently available), it seemed to shave off a few seconds.

Even so, I expect this to make it possible for us to tune the performance of the process.

Have we considered potential risks?

This is a little bit of added complexity, which has the potential to cause the process to use significantly more memory and possibly fail. In order to mitigate this, the thread pool is constrained to a relatively low number (0.01 of the max threads available in Lambda)

@paul-butcher paul-butcher merged commit ac4132b into main Jun 21, 2024
4 checks passed
@paul-butcher paul-butcher deleted the parallel-downloads branch June 21, 2024 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallelise downloads
2 participants