Speed up `upload-to-s3` #41

joverlee521 · 2024-06-11T21:16:22Z

Prompted by nextstrain/ncov-ingest#446

Some ideas of speeding up upload-to-s3 proposed in related Slack thread

Configure threads for compression since that is most likely the bottleneck as we are compressing using a single thread
Update hashing to stop going through Python. Or we should compute the hash of the compressed version.

The text was updated successfully, but these errors were encountered:

joverlee521 · 2024-06-11T23:33:30Z

Update hashing to stop going through Python.

This is currently blocked on sha256sum not being available in the conda runtime.

tsibley · 2024-06-11T23:38:37Z

Alternatively, the naive sha256sum implementation in Python could exec into the GNU coreutils version, if found, otherwise fall back to the slow Python version.

They're provided in our other runtimes (almost by happenstance, as part of the underlying OS image) and having them available in all runtimes makes it much easier to write portable programs without having to deal with GNU vs. BSD differences. Note that typically GNU coreutils would already be available in the Conda runtime on Linux (via the host system), but not the Conda runtime on macOS (unless installed separately, e.g. via Homebrew). So explicitly including GNU coreutils here increases consistency, isolation, and portability of the runtime. Related-to: <nextstrain/ingest#41>

tsibley · 2024-06-12T23:16:44Z

Update hashing to stop going through Python.

This is currently blocked on sha256sum not being available in the conda runtime.

sha256sum is now available as of the nextstrain-base 20240612T205814Z Conda package.

Note that we're (I'm) assuming the GNU coreutils sha256sum implementation is faster than our Python one. It likely is! But we don't actually know. Benchmarking might be useful here.

joverlee521 · 2024-06-14T17:09:32Z

Note that we're (I'm) assuming the GNU coreutils sha256sum implementation is faster than our Python one. It likely is! But we don't actually know. Benchmarking might be useful here.

Simple test with a 1.3G fasta file.

$ time sha256sum ./ingest/data/ncbi_dataset_sequences.fasta
f85d4bfc6c9cfc00d10567aed87723c4bf39498b5dc94f81c94f4b31c98fb806  ./ingest/data/ncbi_dataset_sequences.fasta

real	0m4.452s
user	0m3.952s
sys	0m0.483s

$ time ./ingest/vendored/sha256sum ./ingest/data/ncbi_dataset_sequences.fasta

Still running after 10mins...I'll post update with final time after it finishes...

joverlee521 · 2024-06-14T17:29:29Z

Still running after 10mins...I'll post update with final time after it finishes...

🤦‍♀️ Nope, I was just running the script wrong

$ time ./ingest/vendored/sha256sum < ./ingest/data/ncbi_dataset_sequences.fasta
f85d4bfc6c9cfc00d10567aed87723c4bf39498b5dc94f81c94f4b31c98fb806

real	0m1.401s
user	0m0.539s
sys	0m0.841s

tsibley · 2024-06-14T17:50:00Z

Wait, is the Python one actually faster? What? I mean, I know the hashlib implementations are in C as is much file i/o, but I still would expect Python overhead to be significant here.

tsibley · 2024-06-14T18:02:07Z

Is your coreutils sha256sum x86_64 or aarch64? file $(type -p sha256sum)

joverlee521 · 2024-06-14T18:04:28Z

Is your coreutils sha256sum x86_64 or aarch64? file $(type -p sha256sum)

Ah, should have said I was running these in the Nextstrain shell using the Docker runtime.

tsibley · 2024-06-14T18:04:46Z

One thing I noted looking at coreutils sha256sum is it's reading in 32 kiB chunks vs. our 5 MiB chunks.

joverlee521 · 2024-06-14T18:16:24Z

Similar results when running in macOS terminal:

KX76YWH7NX:mpox joverlee$ time sha256sum ingest/data/ncbi_dataset_sequences.fasta
f85d4bfc6c9cfc00d10567aed87723c4bf39498b5dc94f81c94f4b31c98fb806  ingest/data/ncbi_dataset_sequences.fasta

real	0m5.059s
user	0m4.725s
sys	0m0.209s
KX76YWH7NX:mpox joverlee$ time ./ingest/vendored/sha256sum < ingest/data/ncbi_dataset_sequences.fasta
f85d4bfc6c9cfc00d10567aed87723c4bf39498b5dc94f81c94f4b31c98fb806

real	0m1.439s
user	0m0.632s
sys	0m0.221s
KX76YWH7NX:mpox joverlee$ file $(type -p sha256sum)
/opt/homebrew/bin/sha256sum: Mach-O 64-bit executable arm64

joverlee521 · 2024-06-14T18:57:06Z

Just making sure this is true for a larger file, testing with a 70G fasta. Python is much faster than GNU coreutils!

GNU coreutils:

KX76YWH7NX:ncov-ingest joverlee$ time sha256sum data/gisaid/sequences.fasta
3f47c5c48118ec5da9955bffe4346ea6245ad6d6b443c544f85c7f4d377a4b1e  data/gisaid/sequences.fasta

real	4m16.720s
user	3m59.214s
sys	0m8.749s

Python:

KX76YWH7NX:ncov-ingest joverlee$ time ./vendored/sha256sum < data/gisaid/sequences.fasta
3f47c5c48118ec5da9955bffe4346ea6245ad6d6b443c544f85c7f4d377a4b1e

real	0m45.221s
user	0m30.535s
sys	0m6.089s

tsibley · 2024-06-14T19:10:04Z

Wow. I'd be really curious what the times are if you drop our read size in Python to 32 kiB.

I'd also wonder if aarch64 is coming into play here: is Python taking advantage of it (and coreutils not) in a way it couldn't on x86_64 hardware we're using on AWS Batch?

On my machine, Python is only slightly faster than coreutils. In fact, alternative non-cryptographic/secure hashing algorithms I've tried (a few impls of MurmurHash3, simple crc32, simple md5) all come out very roughly in the same ballpark (within ~20s of each other on a 3GB file), which leads me to thinking I'm bottlenecking on i/o on my machine. And so I'd wonder if we hit an i/o bottleneck in Batch too. We're not using fast disks on AWS...

joverlee521 · 2024-06-14T19:36:14Z

I'd be really curious what the times are if you drop our read size in Python to 32 kiB.

It's actually slightly faster when I drop the chunk size

KX76YWH7NX:ncov-ingest joverlee$ time ./vendored/sha256sum < data/gisaid/sequences.fasta
chunk size: 32768
3f47c5c48118ec5da9955bffe4346ea6245ad6d6b443c544f85c7f4d377a4b1e

real	0m41.200s
user	0m30.406s
sys	0m8.086s

tsibley mentioned this issue Jun 11, 2024

Provide GNU coreutils in the runtime nextstrain/conda-base#70

Merged

1 task

joverlee521 mentioned this issue Jun 12, 2024

GISAID workflow hitting max_session_duration nextstrain/ncov-ingest#446

Open

5 tasks

joverlee521 self-assigned this Jun 12, 2024

joverlee521 mentioned this issue Jun 14, 2024

upload-to-s3: Use GNU coreutils sha256sum #42

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `upload-to-s3` #41

Speed up `upload-to-s3` #41

joverlee521 commented Jun 11, 2024

joverlee521 commented Jun 11, 2024

tsibley commented Jun 11, 2024

tsibley commented Jun 12, 2024

joverlee521 commented Jun 14, 2024

joverlee521 commented Jun 14, 2024

tsibley commented Jun 14, 2024

tsibley commented Jun 14, 2024

joverlee521 commented Jun 14, 2024

tsibley commented Jun 14, 2024

joverlee521 commented Jun 14, 2024

joverlee521 commented Jun 14, 2024

tsibley commented Jun 14, 2024

joverlee521 commented Jun 14, 2024

Speed up upload-to-s3 #41

Speed up upload-to-s3 #41

Comments

joverlee521 commented Jun 11, 2024

joverlee521 commented Jun 11, 2024

tsibley commented Jun 11, 2024

tsibley commented Jun 12, 2024

joverlee521 commented Jun 14, 2024

joverlee521 commented Jun 14, 2024

tsibley commented Jun 14, 2024

tsibley commented Jun 14, 2024

joverlee521 commented Jun 14, 2024

tsibley commented Jun 14, 2024

joverlee521 commented Jun 14, 2024

joverlee521 commented Jun 14, 2024

tsibley commented Jun 14, 2024

joverlee521 commented Jun 14, 2024

Speed up `upload-to-s3` #41

Speed up `upload-to-s3` #41