Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up upload-to-s3 #41

Open
Tracked by #446
joverlee521 opened this issue Jun 11, 2024 · 13 comments
Open
Tracked by #446

Speed up upload-to-s3 #41

joverlee521 opened this issue Jun 11, 2024 · 13 comments
Assignees

Comments

@joverlee521
Copy link
Contributor

Prompted by nextstrain/ncov-ingest#446

Some ideas of speeding up upload-to-s3 proposed in related Slack thread

  1. Configure threads for compression since that is most likely the bottleneck as we are compressing using a single thread
  2. Update hashing to stop going through Python. Or we should compute the hash of the compressed version.
@joverlee521
Copy link
Contributor Author

Update hashing to stop going through Python.

This is currently blocked on sha256sum not being available in the conda runtime.

@tsibley
Copy link
Member

tsibley commented Jun 11, 2024

Alternatively, the naive sha256sum implementation in Python could exec into the GNU coreutils version, if found, otherwise fall back to the slow Python version.

tsibley added a commit to nextstrain/conda-base that referenced this issue Jun 11, 2024
They're provided in our other runtimes (almost by happenstance, as part
of the underlying OS image) and having them available in all runtimes
makes it much easier to write portable programs without having to deal
with GNU vs. BSD differences.

Note that typically GNU coreutils would already be available in the
Conda runtime on Linux (via the host system), but not the Conda runtime
on macOS (unless installed separately, e.g. via Homebrew).  So
explicitly including GNU coreutils here increases consistency,
isolation, and portability of the runtime.

Related-to: <nextstrain/ingest#41>
@joverlee521 joverlee521 self-assigned this Jun 12, 2024
@tsibley
Copy link
Member

tsibley commented Jun 12, 2024

Update hashing to stop going through Python.

This is currently blocked on sha256sum not being available in the conda runtime.

sha256sum is now available as of the nextstrain-base 20240612T205814Z Conda package.

Note that we're (I'm) assuming the GNU coreutils sha256sum implementation is faster than our Python one. It likely is! But we don't actually know. Benchmarking might be useful here.

@joverlee521
Copy link
Contributor Author

Note that we're (I'm) assuming the GNU coreutils sha256sum implementation is faster than our Python one. It likely is! But we don't actually know. Benchmarking might be useful here.

Simple test with a 1.3G fasta file.

$ time sha256sum ./ingest/data/ncbi_dataset_sequences.fasta
f85d4bfc6c9cfc00d10567aed87723c4bf39498b5dc94f81c94f4b31c98fb806  ./ingest/data/ncbi_dataset_sequences.fasta

real	0m4.452s
user	0m3.952s
sys	0m0.483s
$ time ./ingest/vendored/sha256sum ./ingest/data/ncbi_dataset_sequences.fasta

Still running after 10mins...I'll post update with final time after it finishes...

@joverlee521
Copy link
Contributor Author

Still running after 10mins...I'll post update with final time after it finishes...

🤦‍♀️ Nope, I was just running the script wrong

$ time ./ingest/vendored/sha256sum < ./ingest/data/ncbi_dataset_sequences.fasta
f85d4bfc6c9cfc00d10567aed87723c4bf39498b5dc94f81c94f4b31c98fb806

real	0m1.401s
user	0m0.539s
sys	0m0.841s

@tsibley
Copy link
Member

tsibley commented Jun 14, 2024

Wait, is the Python one actually faster? What? I mean, I know the hashlib implementations are in C as is much file i/o, but I still would expect Python overhead to be significant here.

@tsibley
Copy link
Member

tsibley commented Jun 14, 2024

Is your coreutils sha256sum x86_64 or aarch64? file $(type -p sha256sum)

@joverlee521
Copy link
Contributor Author

Is your coreutils sha256sum x86_64 or aarch64? file $(type -p sha256sum)

Ah, should have said I was running these in the Nextstrain shell using the Docker runtime.

@tsibley
Copy link
Member

tsibley commented Jun 14, 2024

One thing I noted looking at coreutils sha256sum is it's reading in 32 kiB chunks vs. our 5 MiB chunks.

@joverlee521
Copy link
Contributor Author

Similar results when running in macOS terminal:

KX76YWH7NX:mpox joverlee$ time sha256sum ingest/data/ncbi_dataset_sequences.fasta
f85d4bfc6c9cfc00d10567aed87723c4bf39498b5dc94f81c94f4b31c98fb806  ingest/data/ncbi_dataset_sequences.fasta

real	0m5.059s
user	0m4.725s
sys	0m0.209s
KX76YWH7NX:mpox joverlee$ time ./ingest/vendored/sha256sum < ingest/data/ncbi_dataset_sequences.fasta
f85d4bfc6c9cfc00d10567aed87723c4bf39498b5dc94f81c94f4b31c98fb806

real	0m1.439s
user	0m0.632s
sys	0m0.221s
KX76YWH7NX:mpox joverlee$ file $(type -p sha256sum)
/opt/homebrew/bin/sha256sum: Mach-O 64-bit executable arm64

@joverlee521
Copy link
Contributor Author

Just making sure this is true for a larger file, testing with a 70G fasta. Python is much faster than GNU coreutils!

GNU coreutils:

KX76YWH7NX:ncov-ingest joverlee$ time sha256sum data/gisaid/sequences.fasta
3f47c5c48118ec5da9955bffe4346ea6245ad6d6b443c544f85c7f4d377a4b1e  data/gisaid/sequences.fasta

real	4m16.720s
user	3m59.214s
sys	0m8.749s

Python:

KX76YWH7NX:ncov-ingest joverlee$ time ./vendored/sha256sum < data/gisaid/sequences.fasta
3f47c5c48118ec5da9955bffe4346ea6245ad6d6b443c544f85c7f4d377a4b1e

real	0m45.221s
user	0m30.535s
sys	0m6.089s

@tsibley
Copy link
Member

tsibley commented Jun 14, 2024

Wow. I'd be really curious what the times are if you drop our read size in Python to 32 kiB.

I'd also wonder if aarch64 is coming into play here: is Python taking advantage of it (and coreutils not) in a way it couldn't on x86_64 hardware we're using on AWS Batch?

On my machine, Python is only slightly faster than coreutils. In fact, alternative non-cryptographic/secure hashing algorithms I've tried (a few impls of MurmurHash3, simple crc32, simple md5) all come out very roughly in the same ballpark (within ~20s of each other on a 3GB file), which leads me to thinking I'm bottlenecking on i/o on my machine. And so I'd wonder if we hit an i/o bottleneck in Batch too. We're not using fast disks on AWS...

@joverlee521
Copy link
Contributor Author

I'd be really curious what the times are if you drop our read size in Python to 32 kiB.

It's actually slightly faster when I drop the chunk size

KX76YWH7NX:ncov-ingest joverlee$ time ./vendored/sha256sum < data/gisaid/sequences.fasta
chunk size: 32768
3f47c5c48118ec5da9955bffe4346ea6245ad6d6b443c544f85c7f4d377a4b1e

real	0m41.200s
user	0m30.406s
sys	0m8.086s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants