Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GISAID workflow hitting max_session_duration #446

Open
1 of 5 tasks
joverlee521 opened this issue Jun 11, 2024 · 8 comments
Open
1 of 5 tasks

GISAID workflow hitting max_session_duration #446

joverlee521 opened this issue Jun 11, 2024 · 8 comments
Assignees

Comments

@joverlee521
Copy link
Contributor

joverlee521 commented Jun 11, 2024

Context

Our automated workflows use short lived AWS credentials for sessions which are limited by the max_session_duration of 12 hours, which is the maximum allowed by AWS.

The GISAID workflow this max yesterday and ran into errors:

upload failed: - to s3://nextstrain-ncov-private/gisaid.ndjson.xz An error occurred (ExpiredToken) when calling the UploadPart operation: The provided token has expired.

TODOs


Post clean-up

@joverlee521
Copy link
Contributor Author

It may be time to revisit #240

@joverlee521
Copy link
Contributor Author

My comment from related thread on Slack:

In the absence of benchmark files, I'm just scanning the Snakemake log files from the workflows for some general timings:
~1.5h - downloading data from GISAID/S3
~4h - transform-gisiad
~1h - filter fasta for new sequences to run through Nextclade
~0.5h - joining metadata + nextclade
= ~7h of data munging - this is about the same with/without new data

The rest of the workflow is just uploading files to S3!!

Without new data, it still takes ~2h to generate the hash for sequences.fasta for checking against S3.
With new data, it takes ~4h to upload sequences.fasta to S3.

@joverlee521
Copy link
Contributor Author

Ah, this is also not considering the full run that gets triggered when there's a new Nextclade dataset released.

The last full run on Apri 16, 2024 ran for ~15h.

joverlee521 added a commit that referenced this issue Jun 11, 2024
I've been only bumping the memory but not the CPUs for the
fetch-and-ingest workflows. Might as well use all the compute that we
are paying for. GenBank should be using c5.9xlarge and GISAID should be
using c5.12xlarge, so bumping CPUs to match the instances.¹

Maybe this will magically help #446?

¹ <https://aws.amazon.com/ec2/instance-types/c5/>
@joverlee521 joverlee521 self-assigned this Jun 11, 2024
@joverlee521
Copy link
Contributor Author

Bumping the CPUs in #447 decreased the build time by about ~1h, which came from parallelizing the download of data at the beginning of the workflow.

We will still run over the 12h limit for full Nextclade runs, so I'm going to work on nextstrain/ingest#41

@corneliusroemer
Copy link
Member

Thanks @joverlee521 for the summary!

How hard is it to parallelize ~4h - transform-gisaid?

As this operates on ndjson lines, it might be parallelizable, or at least some part of it.

The obvious way to do so would be to have a split rule to divide input files into N chunks, run transform, and merge back.

@joverlee521
Copy link
Contributor Author

How hard is it to parallelize ~4h - transform-gisaid?

@corneliusroemer I honestly have no idea...I made #448 to track this separately.

@joverlee521
Copy link
Contributor Author

The speeding up upload-to-s3 is not as straight-forward as initially thought...

For now, sidestepping the issue by creating a nextstrain-ncov-ingest IAM user and added credentials to repo secrets. So the workflow be able to run without any time limits. Added a post clean up list above to remove those credentials and delete the user once we've resolved this issue.

joverlee521 added a commit that referenced this issue Jun 17, 2024
Adding as part of #240 to help collect more data for tackling #446.

One unexpected behavior that I ran into when testing the `--stats`
option is that Snakemake doesn't generate the stats file if the
workflow exits with an error at any step.

Note that the Snakemake `--stats` option is not available starting with
Snakemake v8, so this will need to be removed when we eventually
upgrade Snakemake in our runtimes.
joverlee521 added a commit that referenced this issue Jun 17, 2024
Adding as part of #240 to help collect more data for tackling #446.
@joverlee521
Copy link
Contributor Author

Latest GISAID full run that included complete re-runs of both Nextclade datasets was >21h.

snakemake_stats.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants