-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing to run cudf_merge benchmark on a node with 4 H100 #1088
Comments
Indeed that doesn't look right. I do not have immediate access to a system with H100s but I can try to do that tomorrow. In the meantime I was able to run that on a DGX-1 and the results I see are much more in line with what we would expect, although I had to reduce to chunk size to 100M due to the amount of memory in the V100s: 2 GPUs
4 GPUs
8 GPUs
Based on the affinity reported by your system as the output of
That is a good question, I'm not sure we have done such testing in the past but for the 2 GPU case we get 85-90% of expected, with ~19.5GiB/s where ucx_perftest 2xV100``` $ ucx_perftest -t tag_bw -m cuda -s 1000000000 -n 1000 & ucx_perftest -t tag_bw -m cuda -s 1000000000 -n 1000 localhost [1] 1875533 [1729719349.121679] [dgx13:1875534:0] perftest.c:793 UCX WARN CPU affinity is not set (bound to 80 cpus). Performance may be impacted. [1729719349.121731] [dgx13:1875533:0] perftest.c:793 UCX WARN CPU affinity is not set (bound to 80 cpus). Performance may be impacted. Waiting for connection... Accepted connection from 127.0.0.1:33582 +----------------------------------------------------------------------------------------------------------+ +--------------+--------------+------------------------------+---------------------+-----------------------+ | API: protocol layer | | | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) | | Test: tag match bandwidth | +--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+ | Data layout: (automatic) | | Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall | | Send memory: cuda | +--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+ | Recv memory: cuda | | Message size: 1000000000 | | Window size: 32 | +----------------------------------------------------------------------------------------------------------+ [thread 0] 57 3.855 18078.963 18078.963 52750.50 52750.50 55 55 [thread 0] 82 41114.179 41216.364 25133.049 23138.24 37945.03 24 40 [thread 0] 107 41118.761 41216.316 28890.821 23138.27 33009.60 24 35 [thread 0] 132 41119.304 41216.402 31225.212 23138.22 30541.80 24 32 [thread 0] 157 41119.606 41216.240 32816.140 23138.31 29061.14 24 30 [thread 0] 182 41119.606 41216.278 33970.005 23138.29 28074.01 24 29 [thread 0] 207 41119.798 41216.326 34845.164 23138.27 27368.91 24 29 [thread 0] 232 41119.923 41216.316 35531.711 23138.27 26840.09 24 28 [thread 0] 257 41120.042 41216.516 36084.708 23138.16 26428.77 24 28 [thread 0] 282 41120.065 41216.278 36539.634 23138.29 26099.72 24 27 [thread 0] 307 41120.199 41216.364 36920.475 23138.24 25830.50 24 27 [thread 0] 332 41120.195 41216.278 37243.954 23138.29 25606.15 24 27 [thread 0] 357 41120.195 41216.478 37522.142 23138.18 25416.31 24 27 [thread 0] 382 41120.199 41216.326 37763.908 23138.27 25253.59 24 26 [thread 0] 407 41120.263 41216.354 37975.975 23138.25 25112.57 24 26 [thread 0] 432 41120.337 41216.364 38163.497 23138.24 24989.17 24 26 [thread 0] 457 41120.364 41216.316 38330.501 23138.27 24880.30 24 26 [thread 0] 482 41120.434 41216.440 38480.186 23138.20 24783.52 24 26 [thread 0] 507 41120.435 41216.288 38615.103 23138.29 24696.93 24 26 [thread 0] 532 41120.435 41216.316 38737.340 23138.27 24618.99 24 26 [thread 0] 557 41120.453 41216.402 38848.609 23138.22 24548.48 24 26 [thread 0] 582 41120.453 41216.316 38950.314 23138.27 24484.38 24 26 [thread 0] 607 41120.510 41216.278 39043.641 23138.29 24425.86 24 26 [thread 0] 632 41120.543 41216.326 39129.585 23138.27 24372.21 24 26 [thread 0] 657 41120.543 41216.402 39208.992 23138.22 24322.85 24 26 [thread 0] 682 41120.550 41216.240 39282.572 23138.31 24277.29 24 25 [thread 0] 707 41120.577 41216.316 39350.950 23138.27 24235.10 24 25 [thread 0] 732 41120.577 41216.478 39414.664 23138.18 24195.93 24 25 [thread 0] 757 41120.568 41216.240 39474.161 23138.31 24159.46 24 25 [thread 0] 782 41120.568 41216.326 39529.857 23138.27 24125.42 24 25 [thread 0] 807 41120.567 41216.316 39582.102 23138.27 24093.57 24 25 [thread 0] 832 41120.553 41216.478 39631.211 23138.18 24063.72 24 25 [thread 0] 857 41120.577 41216.240 39677.449 23138.31 24035.68 24 25 [thread 0] 882 41120.581 41216.402 39721.070 23138.22 24009.28 24 25 [thread 0] 907 41120.581 41216.316 39762.284 23138.27 23984.39 24 25 [thread 0] 932 41120.581 41216.326 39801.288 23138.27 23960.89 24 25 [thread 0] 957 41120.591 41216.316 39838.253 23138.27 23938.66 24 25 [thread 0] 982 41120.599 41216.316 39873.336 23138.27 23917.60 24 25 Final: 1000 41120.606 114493.397 41216.497 8329.51 23138.17 9 24 ```Also note that a DGX-1 doesn't have NVSwitch connecting all GPUs, so when scaling to all of them we expect to be limited by the Connect-X 4 bandwidth. |
Thanks @pentschev for the quick feedback. No, I'm using a full node in exclusive mode for these tests. There are 8 NUMA nodes of 8 CPU physical cores, and GPUs are connected by pairs on a same NUMA node. I can provide the full lstopo output if of interest. Note that we may have an issue on our side wrt to affinity as actually nothing forces Slurm to allocate CPU cores on the same NUMA node of the GPU when you request a single GPU. But in this case I'm using the full node, and at least the list of CPU cores is not empty, so I assume it works as expected on that side. But to be checked.
2 GPUs
4 GPUs
Yes, if you can access a H100 node and reproduce the case that would be a great point of comparison. Thanks again. |
I was able to reproduce the variability consistently in a H100 node, both by getting a partial and full node allocations, so presumably this is not related to what I initially thought. I was also able to reproduce the some errors preventing the run to succeed with 4 GPUs, and by the looks of it on my end it was during the establishment of endpoints, which I've previously observed flakiness in Dask clusters when a large number of endpoints are attempted to be created simultaneously, but the errors exactly as you've posted I was unable to observe. I do not have a lead yet as to what happens in H100s, my first guess is that it's related to suboptimal paths or the lack of assigning proper affinity to each process. I'll see if I can do more testing tomorrow or early next week. To be honest, this is a benchmark that, to my knowledge, is not often used, I haven't touched this myself probably in 2 years. Would you mind briefly describing how did you come across it and why are you interested in this one specifically? |
Hi @pentschev, many thanks for your investigations.
nvbandwidth gives ~260 GB/s:
|
UCX can be faster than a single link because it uses multiple rails for communication, meaning transfers may be split among those various links (e.g., NVLink + InfiniBand) to achieve higher bandwidth than you would when using a single rail, this is something that can be controlled with |
Here is the test (on fewer iteration, setting UCX_MAX_RNDV_RAILS to 1):
|
Thanks for confirming those results on your end @orliac . After talking to some of the UCX developers I've been informed we have an internal bug report (not publicly viewable) that may be the cause, what happens is
This results in much more consistent results with
|
Thanks for investigating @pentschev. Replicating on my side I cannot reach the numbers you obtain on your side (which look consistent with what is expected). I'm below half the expected throughput but at least numbers are consistent over iterations.
|
Actually, what you're seeing is probably correct, given the implementation internals. The "Bidirectional" test in In my case, GPUs have more NVLinks (NV18 vs NV6 on your end), and thus the higher unidirectional bandwidth.
And then in P2P I get actually higher bandwidth than UCX/UCX-Py, which I can't explain at the moment but I think it might have to do with the way the
|
Thanks @pentschev for the feedback, and sorry for the slow reply.
|
Hi there,
I'm facing issue when trying to run the cudf_merge benchmark locally on a node that hosts 4 h100:
I can run the benchmark over any pair of GPUs with no issue:
python -m ucp.benchmarks.cudf_merge --devs 0,1 --chunk-size 200_000_000 --iter 10
But it fails to run over the 4 devices:
python -m ucp.benchmarks.cudf_merge --devs 0,1,2,3 --chunk-size 200_000_000 --iter 10
My environment:
Any idea?
Also, I'm surprised by the variability of the benchmark over the successive 10 iterations.
And finally, is it expected that the benchmark saturate the available bandwidth between the GPUs?
The text was updated successfully, but these errors were encountered: