Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The special topology causes the NCCL test to fail #273

Open
zh0ngtian opened this issue Dec 6, 2024 · 7 comments
Open

The special topology causes the NCCL test to fail #273

zh0ngtian opened this issue Dec 6, 2024 · 7 comments

Comments

@zh0ngtian
Copy link

We have two topologies of L40S. The former can pass the test, but the latter cannot.

nccl version: 2.23.4+cuda12.4

test pass

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     0-89    0               N/A
GPU1    NODE     X      NODE    NODE    SYS     SYS     SYS     SYS     SYS     0-89    0               N/A
GPU2    NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     SYS     0-89    0               N/A
GPU3    NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     SYS     0-89    0               N/A
GPU4    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     90-179  1               N/A
GPU5    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS     90-179  1               N/A
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    SYS     90-179  1               N/A
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS     90-179  1               N/A
NIC0    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X

test failed

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     2-11,96,98-107  0               N/A
GPU1    PIX      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     2-11,96,98-107  0               N/A
GPU2    SYS     SYS      X      PIX     SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     24-35,120-131   2               N/A
GPU3    SYS     SYS     PIX      X      SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     24-35,120-131   2               N/A
GPU4    SYS     SYS     SYS     SYS      X      PIX     SYS     SYS     SYS     SYS     SYS     SYS     48-59,144-155   4               N/A
GPU5    SYS     SYS     SYS     SYS     PIX      X      SYS     SYS     SYS     SYS     SYS     SYS     48-59,144-155   4               N/A
GPU6    SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     SYS     SYS     NODE    NODE    72-83,168-179   6               N/A
GPU7    SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      SYS     SYS     NODE    NODE    72-83,168-179   6               N/A
NIC0    SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS      X      PIX     SYS     SYS
NIC1    SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     PIX      X      SYS     SYS
NIC2    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS      X      PIX
NIC3    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     PIX      X
NCCL_DEBUG=TRACE ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  22817 on dc05-p14-t100-n041 device  0 [0x18] NVIDIA L40S
#  Rank  1 Group  0 Pid  22817 on dc05-p14-t100-n041 device  1 [0x1a] NVIDIA L40S
#  Rank  2 Group  0 Pid  22817 on dc05-p14-t100-n041 device  2 [0x3c] NVIDIA L40S
#  Rank  3 Group  0 Pid  22817 on dc05-p14-t100-n041 device  3 [0x3e] NVIDIA L40S
#  Rank  4 Group  0 Pid  22817 on dc05-p14-t100-n041 device  4 [0x9a] NVIDIA L40S
#  Rank  5 Group  0 Pid  22817 on dc05-p14-t100-n041 device  5 [0x9c] NVIDIA L40S
#  Rank  6 Group  0 Pid  22817 on dc05-p14-t100-n041 device  6 [0xbc] NVIDIA L40S
#  Rank  7 Group  0 Pid  22817 on dc05-p14-t100-n041 device  7 [0xbe] NVIDIA L40S
dc05-p14-t100-n041:22817:22817 [0] NCCL INFO Bootstrap : Using eth0:fdbd:dc05:14:100::41<0>
dc05-p14-t100-n041:22817:22817 [0] NCCL INFO cudaDriverVersion 12040
dc05-p14-t100-n041:22817:22817 [0] NCCL INFO NCCL version 2.23.4+cuda12.4
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO NET/IB : No device found.
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO NET/Socket : Using [0]eth0:fdbd:dc05:14:100::41<0> [1]eth2:fdbd:dc05:14:102::41<0> [2]carma_br0:fdbd:dc05:14:100:2900::1<0>
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO Using network Socket
dc05-p14-t100-n041:22817:22836 [5] NCCL INFO Using network Socket
dc05-p14-t100-n041:22817:22832 [1] NCCL INFO Using network Socket
dc05-p14-t100-n041:22817:22833 [2] NCCL INFO Using network Socket
dc05-p14-t100-n041:22817:22837 [6] NCCL INFO Using network Socket
dc05-p14-t100-n041:22817:22835 [4] NCCL INFO Using network Socket
dc05-p14-t100-n041:22817:22834 [3] NCCL INFO Using network Socket
dc05-p14-t100-n041:22817:22838 [7] NCCL INFO Using network Socket
dc05-p14-t100-n041:22817:22833 [2] NCCL INFO ncclCommInitAll comm 0x5588193bdc90 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3c000 commId 0xed1225223327be20 - Init START
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO ncclCommInitAll comm 0x558819334cf0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 18000 commId 0xed1225223327be20 - Init START
dc05-p14-t100-n041:22817:22836 [5] NCCL INFO ncclCommInitAll comm 0x55881948b400 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 9c000 commId 0xed1225223327be20 - Init START
dc05-p14-t100-n041:22817:22832 [1] NCCL INFO ncclCommInitAll comm 0x5588193794c0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 1a000 commId 0xed1225223327be20 - Init START
dc05-p14-t100-n041:22817:22835 [4] NCCL INFO ncclCommInitAll comm 0x558819446c30 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9a000 commId 0xed1225223327be20 - Init START
dc05-p14-t100-n041:22817:22837 [6] NCCL INFO ncclCommInitAll comm 0x5588194cfbd0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bc000 commId 0xed1225223327be20 - Init START
dc05-p14-t100-n041:22817:22834 [3] NCCL INFO ncclCommInitAll comm 0x558819402460 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 3e000 commId 0xed1225223327be20 - Init START
dc05-p14-t100-n041:22817:22838 [7] NCCL INFO ncclCommInitAll comm 0x5588195143a0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId be000 commId 0xed1225223327be20 - Init START
dc05-p14-t100-n041:22817:22833 [2] NCCL INFO Bootstrap timings total 0.000992 (create 0.000038, send 0.000125, recv 0.000522, ring 0.000219, delay 0.000000)
dc05-p14-t100-n041:22817:22832 [1] NCCL INFO Bootstrap timings total 0.000855 (create 0.000026, send 0.000094, recv 0.000222, ring 0.000413, delay 0.000000)
dc05-p14-t100-n041:22817:22836 [5] NCCL INFO Bootstrap timings total 0.000863 (create 0.000024, send 0.000093, recv 0.000372, ring 0.000280, delay 0.000000)
dc05-p14-t100-n041:22817:22838 [7] NCCL INFO Bootstrap timings total 0.000799 (create 0.000025, send 0.000096, recv 0.000516, ring 0.000075, delay 0.000000)
dc05-p14-t100-n041:22817:22837 [6] NCCL INFO Bootstrap timings total 0.000837 (create 0.000028, send 0.000095, recv 0.000509, ring 0.000117, delay 0.000000)
dc05-p14-t100-n041:22817:22834 [3] NCCL INFO Bootstrap timings total 0.000827 (create 0.000027, send 0.000098, recv 0.000437, ring 0.000175, delay 0.000000)
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO Bootstrap timings total 0.000889 (create 0.000024, send 0.000094, recv 0.000200, ring 0.000075, delay 0.000000)
dc05-p14-t100-n041:22817:22835 [4] NCCL INFO Bootstrap timings total 0.000847 (create 0.000026, send 0.000094, recv 0.000270, ring 0.000182, delay 0.000000)
dc05-p14-t100-n041:22817:22837 [6] NCCL INFO Setting affinity for GPU 6 to 0fff00,00000000,00000000,000fff00,00000000,00000000
dc05-p14-t100-n041:22817:22837 [6] NCCL INFO NVLS multicast support is not available on dev 6
dc05-p14-t100-n041:22817:22838 [7] NCCL INFO Setting affinity for GPU 7 to 0fff00,00000000,00000000,000fff00,00000000,00000000
dc05-p14-t100-n041:22817:22838 [7] NCCL INFO NVLS multicast support is not available on dev 7
dc05-p14-t100-n041:22817:22834 [3] NCCL INFO Setting affinity for GPU 3 to 0f,ff000000,00000000,0000000f,ff000000
dc05-p14-t100-n041:22817:22834 [3] NCCL INFO NVLS multicast support is not available on dev 3
dc05-p14-t100-n041:22817:22833 [2] NCCL INFO Setting affinity for GPU 2 to 0f,ff000000,00000000,0000000f,ff000000
dc05-p14-t100-n041:22817:22832 [1] NCCL INFO Setting affinity for GPU 1 to 0ffd,00000000,00000000,00000ffc
dc05-p14-t100-n041:22817:22832 [1] NCCL INFO NVLS multicast support is not available on dev 1
dc05-p14-t100-n041:22817:22833 [2] NCCL INFO NVLS multicast support is not available on dev 2
dc05-p14-t100-n041:22817:22836 [5] NCCL INFO Setting affinity for GPU 5 to 0fff0000,00000000,00000000,0fff0000,00000000
dc05-p14-t100-n041:22817:22835 [4] NCCL INFO Setting affinity for GPU 4 to 0fff0000,00000000,00000000,0fff0000,00000000
dc05-p14-t100-n041:22817:22836 [5] NCCL INFO NVLS multicast support is not available on dev 5
dc05-p14-t100-n041:22817:22835 [4] NCCL INFO NVLS multicast support is not available on dev 4
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO Setting affinity for GPU 0 to 0ffd,00000000,00000000,00000ffc
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO NVLS multicast support is not available on dev 0
dc05-p14-t100-n041:22817:22833 [2] NCCL INFO comm 0x5588193bdc90 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
dc05-p14-t100-n041:22817:22832 [1] NCCL INFO comm 0x5588193794c0 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
dc05-p14-t100-n041:22817:22838 [7] NCCL INFO comm 0x5588195143a0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
dc05-p14-t100-n041:22817:22835 [4] NCCL INFO comm 0x558819446c30 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO comm 0x558819334cf0 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
dc05-p14-t100-n041:22817:22835 [4] NCCL INFO Trees [0] 5/-1/-1->4->7 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->7 [3] 5/-1/-1->4->3
dc05-p14-t100-n041:22817:22835 [4] NCCL INFO P2P Chunksize set to 131072
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO Channel 00/04 : 0 1 2 3 6 7 4 5
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO Channel 01/04 : 0 1 4 5 6 7 2 3
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO Channel 02/04 : 0 1 2 3 6 7 4 5
dc05-p14-t100-n041:22817:22832 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 6/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 6/-1/-1->1->0
dc05-p14-t100-n041:22817:22832 [1] NCCL INFO P2P Chunksize set to 131072
dc05-p14-t100-n041:22817:22838 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 2/-1/-1->7->6 [2] 4/-1/-1->7->6 [3] 2/-1/-1->7->6
dc05-p14-t100-n041:22817:22838 [7] NCCL INFO P2P Chunksize set to 131072
dc05-p14-t100-n041:22817:22837 [6] NCCL INFO comm 0x5588194cfbd0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
dc05-p14-t100-n041:22817:22834 [3] NCCL INFO comm 0x558819402460 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
dc05-p14-t100-n041:22817:22836 [5] NCCL INFO comm 0x55881948b400 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO Channel 03/04 : 0 1 4 5 6 7 2 3
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO P2P Chunksize set to 131072
dc05-p14-t100-n041:22817:22840 [4] NCCL INFO [Proxy Service] Device 4 CPU core 153
dc05-p14-t100-n041:22817:22834 [3] NCCL INFO Trees [0] 6/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 6/-1/-1->3->2 [3] 4/-1/-1->3->2
dc05-p14-t100-n041:22817:22834 [3] NCCL INFO P2P Chunksize set to 131072
dc05-p14-t100-n041:22817:22836 [5] NCCL INFO Trees [0] -1/-1/-1->5->4 [1] -1/-1/-1->5->4 [2] -1/-1/-1->5->4 [3] -1/-1/-1->5->4
dc05-p14-t100-n041:22817:22836 [5] NCCL INFO P2P Chunksize set to 131072
dc05-p14-t100-n041:22817:22833 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->7 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->7
dc05-p14-t100-n041:22817:22833 [2] NCCL INFO P2P Chunksize set to 131072
dc05-p14-t100-n041:22817:22837 [6] NCCL INFO Trees [0] 7/-1/-1->6->3 [1] 7/-1/-1->6->1 [2] 7/-1/-1->6->3 [3] 7/-1/-1->6->1
dc05-p14-t100-n041:22817:22837 [6] NCCL INFO P2P Chunksize set to 131072
dc05-p14-t100-n041:22817:22841 [1] NCCL INFO [Proxy Service] Device 1 CPU core 99
dc05-p14-t100-n041:22817:22843 [4] NCCL INFO [Proxy Service UDS] Device 4 CPU core 153
dc05-p14-t100-n041:22817:22842 [7] NCCL INFO [Proxy Service] Device 7 CPU core 169
dc05-p14-t100-n041:22817:22844 [0] NCCL INFO [Proxy Service] Device 0 CPU core 104
dc05-p14-t100-n041:22817:22845 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 99
dc05-p14-t100-n041:22817:22846 [3] NCCL INFO [Proxy Service] Device 3 CPU core 121
dc05-p14-t100-n041:22817:22847 [5] NCCL INFO [Proxy Service] Device 5 CPU core 49
dc05-p14-t100-n041:22817:22849 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 122
dc05-p14-t100-n041:22817:22851 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 104
dc05-p14-t100-n041:22817:22850 [7] NCCL INFO [Proxy Service UDS] Device 7 CPU core 169
dc05-p14-t100-n041:22817:22852 [6] NCCL INFO [Proxy Service] Device 6 CPU core 79
dc05-p14-t100-n041:22817:22848 [2] NCCL INFO [Proxy Service] Device 2 CPU core 32
dc05-p14-t100-n041:22817:22853 [5] NCCL INFO [Proxy Service UDS] Device 5 CPU core 147
dc05-p14-t100-n041:22817:22854 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 33
dc05-p14-t100-n041:22817:22855 [6] NCCL INFO [Proxy Service UDS] Device 6 CPU core 79
dc05-p14-t100-n041:22817:22836 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dc05-p14-t100-n041:22817:22836 [5] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dc05-p14-t100-n041:22817:22835 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dc05-p14-t100-n041:22817:22835 [4] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dc05-p14-t100-n041:22817:22834 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dc05-p14-t100-n041:22817:22834 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dc05-p14-t100-n041:22817:22833 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dc05-p14-t100-n041:22817:22833 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dc05-p14-t100-n041:22817:22838 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dc05-p14-t100-n041:22817:22838 [7] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dc05-p14-t100-n041:22817:22832 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dc05-p14-t100-n041:22817:22832 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
dc05-p14-t100-n041:22817:22837 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dc05-p14-t100-n041:22817:22837 [6] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dc05-p14-t100-n041:22817:22834 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
dc05-p14-t100-n041:22817:22834 [3] NCCL INFO ncclCommInitAll comm 0x558819402460 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 3e000 commId 0xed1225223327be20 - Init COMPLETE
dc05-p14-t100-n041:22817:22834 [3] NCCL INFO Init timings - ncclCommInitAll: rank 3 nranks 8 total 0.60 (kernels 0.44, alloc 0.05, bootstrap 0.00, allgathers 0.00, topo 0.04, graphs 0.04, connections 0.01, rest 0.00)
dc05-p14-t100-n041:22817:22836 [5] NCCL INFO ncclCommInitAll comm 0x55881948b400 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 9c000 commId 0xed1225223327be20 - Init COMPLETE
dc05-p14-t100-n041:22817:22835 [4] NCCL INFO ncclCommInitAll comm 0x558819446c30 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9a000 commId 0xed1225223327be20 - Init COMPLETE
dc05-p14-t100-n041:22817:22836 [5] NCCL INFO Init timings - ncclCommInitAll: rank 5 nranks 8 total 0.60 (kernels 0.44, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.04, connections 0.01, rest 0.00)
dc05-p14-t100-n041:22817:22838 [7] NCCL INFO ncclCommInitAll comm 0x5588195143a0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId be000 commId 0xed1225223327be20 - Init COMPLETE
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO ncclCommInitAll comm 0x558819334cf0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 18000 commId 0xed1225223327be20 - Init COMPLETE
dc05-p14-t100-n041:22817:22837 [6] NCCL INFO ncclCommInitAll comm 0x5588194cfbd0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bc000 commId 0xed1225223327be20 - Init COMPLETE
dc05-p14-t100-n041:22817:22831 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 8 total 0.60 (kernels 0.44, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.04, connections 0.01, rest 0.00)
dc05-p14-t100-n041:22817:22837 [6] NCCL INFO Init timings - ncclCommInitAll: rank 6 nranks 8 total 0.60 (kernels 0.44, alloc 0.05, bootstrap 0.00, allgathers 0.00, topo 0.04, graphs 0.04, connections 0.01, rest 0.00)
dc05-p14-t100-n041:22817:22838 [7] NCCL INFO Init timings - ncclCommInitAll: rank 7 nranks 8 total 0.60 (kernels 0.44, alloc 0.05, bootstrap 0.00, allgathers 0.00, topo 0.04, graphs 0.04, connections 0.01, rest 0.00)
dc05-p14-t100-n041:22817:22832 [1] NCCL INFO ncclCommInitAll comm 0x5588193794c0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 1a000 commId 0xed1225223327be20 - Init COMPLETE
dc05-p14-t100-n041:22817:22835 [4] NCCL INFO Init timings - ncclCommInitAll: rank 4 nranks 8 total 0.60 (kernels 0.44, alloc 0.05, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.04, connections 0.01, rest 0.00)
dc05-p14-t100-n041:22817:22833 [2] NCCL INFO ncclCommInitAll comm 0x5588193bdc90 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3c000 commId 0xed1225223327be20 - Init COMPLETE
dc05-p14-t100-n041:22817:22832 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 8 total 0.60 (kernels 0.44, alloc 0.05, bootstrap 0.00, allgathers 0.00, topo 0.04, graphs 0.04, connections 0.01, rest 0.00)
dc05-p14-t100-n041:22817:22833 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 8 total 0.60 (kernels 0.44, alloc 0.05, bootstrap 0.00, allgathers 0.00, topo 0.04, graphs 0.04, connections 0.01, rest 0.00)
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
dc05-p14-t100-n041:22817:22859 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/direct pointer
dc05-p14-t100-n041:22817:22863 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer
dc05-p14-t100-n041:22817:22859 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/direct pointer
dc05-p14-t100-n041:22817:22863 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer
dc05-p14-t100-n041:22817:22859 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/direct pointer
dc05-p14-t100-n041:22817:22863 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer
dc05-p14-t100-n041:22817:22859 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/direct pointer
dc05-p14-t100-n041:22817:22863 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer
dc05-p14-t100-n041:22817:22862 [1] NCCL INFO Channel 00 : 1[1] -> 2[2] via SHM/direct/direct
dc05-p14-t100-n041:22817:22858 [5] NCCL INFO Channel 01 : 5[5] -> 6[6] via SHM/direct/direct
dc05-p14-t100-n041:22817:22857 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/direct pointer
dc05-p14-t100-n041:22817:22857 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/direct pointer
dc05-p14-t100-n041:22817:22857 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/direct pointer
dc05-p14-t100-n041:22817:22857 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/direct pointer
dc05-p14-t100-n041:22817:22861 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/direct pointer
dc05-p14-t100-n041:22817:22861 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/direct pointer
dc05-p14-t100-n041:22817:22862 [1] NCCL INFO Channel 02 : 1[1] -> 2[2] via SHM/direct/direct
dc05-p14-t100-n041:22817:22861 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/direct pointer
dc05-p14-t100-n041:22817:22861 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/direct pointer
dc05-p14-t100-n041:22817:22858 [5] NCCL INFO Channel 03 : 5[5] -> 6[6] via SHM/direct/direct
dc05-p14-t100-n041:22817:22856 [7] NCCL INFO Channel 01 : 7[7] -> 2[2] via SHM/direct/direct
dc05-p14-t100-n041:22817:22862 [1] NCCL INFO Channel 01 : 1[1] -> 4[4] via SHM/direct/direct
dc05-p14-t100-n041:22817:22860 [3] NCCL INFO Channel 00 : 3[3] -> 6[6] via SHM/direct/direct
dc05-p14-t100-n041:22817:22858 [5] NCCL INFO Channel 00 : 5[5] -> 0[0] via SHM/direct/direct
dc05-p14-t100-n041:22817:22856 [7] NCCL INFO Channel 03 : 7[7] -> 2[2] via SHM/direct/direct
dc05-p14-t100-n041:22817:22862 [1] NCCL INFO Channel 03 : 1[1] -> 4[4] via SHM/direct/direct
dc05-p14-t100-n041:22817:22860 [3] NCCL INFO Channel 02 : 3[3] -> 6[6] via SHM/direct/direct
dc05-p14-t100-n041:22817:22858 [5] NCCL INFO Channel 02 : 5[5] -> 0[0] via SHM/direct/direct
dc05-p14-t100-n041:22817:22856 [7] NCCL INFO Channel 00 : 7[7] -> 4[4] via SHM/direct/direct
dc05-p14-t100-n041:22817:22860 [3] NCCL INFO Channel 01 : 3[3] -> 0[0] via SHM/direct/direct
dc05-p14-t100-n041:22817:22856 [7] NCCL INFO Channel 02 : 7[7] -> 4[4] via SHM/direct/direct
dc05-p14-t100-n041:22817:22860 [3] NCCL INFO Channel 03 : 3[3] -> 0[0] via SHM/direct/direct
dc05-p14-t100-n041:22817:22858 [5] NCCL INFO Connected all rings
dc05-p14-t100-n041:22817:22862 [1] NCCL INFO Connected all rings
dc05-p14-t100-n041:22817:22861 [2] NCCL INFO Connected all rings
dc05-p14-t100-n041:22817:22857 [6] NCCL INFO Connected all rings
dc05-p14-t100-n041:22817:22856 [7] NCCL INFO Connected all rings
dc05-p14-t100-n041:22817:22863 [0] NCCL INFO Connected all rings
dc05-p14-t100-n041:22817:22859 [4] NCCL INFO Connected all rings
dc05-p14-t100-n041:22817:22860 [3] NCCL INFO Connected all rings
(stuck here)
@sjeaugey
Copy link
Member

sjeaugey commented Dec 9, 2024

Maybe ACS is enabled on the one getting stuck?

@zh0ngtian
Copy link
Author

Maybe ACS is enabled on the one getting stuck?

@sjeaugey I used command sudo lspci -vvv | grep ACSCtl to check ACS is enabled or not. The result is ACS is disabled.

@sjeaugey
Copy link
Member

sjeaugey commented Dec 9, 2024

Can you confirm whether it works when setting NCCL_P2P_DISABLE=1?

@zh0ngtian
Copy link
Author

Can you confirm whether it works when setting NCCL_P2P_DISABLE=1?

@sjeaugey This works! So will turning off P2P affect performance?

@sjeaugey
Copy link
Member

sjeaugey commented Dec 9, 2024

Yes it will, but at least we confirmed the hang was coming from GPU Direct P2P being broken on your system, i.e. GPUs can't talk directly to each other through PCI. Which typically comes from ACS being enabled, but maybe in your case it's something else.

@zh0ngtian
Copy link
Author

NCCL_DEBUG=TRACE gdb ./build/all_reduce_perf -ex "run -b 8 -e 128M -f 2 -g 8" -ex "thread apply all bt" -ex "quit"

# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
[New Thread 0x7fffe742d000 (LWP 7625)]
#  Rank  0 Group  0 Pid   7622 on dc02-p21-t208-n012 device  0 [0x18] NVIDIA L40S
#  Rank  1 Group  0 Pid   7622 on dc02-p21-t208-n012 device  1 [0x19] NVIDIA L40S
#  Rank  2 Group  0 Pid   7622 on dc02-p21-t208-n012 device  2 [0x3b] NVIDIA L40S
#  Rank  3 Group  0 Pid   7622 on dc02-p21-t208-n012 device  3 [0x3c] NVIDIA L40S
#  Rank  4 Group  0 Pid   7622 on dc02-p21-t208-n012 device  4 [0x9a] NVIDIA L40S
#  Rank  5 Group  0 Pid   7622 on dc02-p21-t208-n012 device  5 [0x9b] NVIDIA L40S
#  Rank  6 Group  0 Pid   7622 on dc02-p21-t208-n012 device  6 [0xbb] NVIDIA L40S
#  Rank  7 Group  0 Pid   7622 on dc02-p21-t208-n012 device  7 [0xbc] NVIDIA L40S
dc02-p21-t208-n012:7622:7622 [0] NCCL INFO Bootstrap : Using eth0:fdbd:dc02:21:208::12<0>
[New Thread 0x7fffd3f39000 (LWP 7626)]
[New Thread 0x7fffd3738000 (LWP 7627)]
dc02-p21-t208-n012:7622:7622 [0] NCCL INFO cudaDriverVersion 12040
[New Thread 0x7fffd1d16000 (LWP 7628)]
[New Thread 0x7fff81fff000 (LWP 7629)]
[New Thread 0x7fff49fff000 (LWP 7630)]
[New Thread 0x7fff11fff000 (LWP 7631)]
[New Thread 0x7ffed7fff000 (LWP 7632)]
[New Thread 0x7ffed67dd000 (LWP 7633)]
[New Thread 0x7ffed4fbb000 (LWP 7634)]
[New Thread 0x7ffe9cfde000 (LWP 7635)]
dc02-p21-t208-n012:7622:7622 [7] NCCL INFO NCCL version 2.22.3+cuda12.4
[New Thread 0x7ffe65fff000 (LWP 7636)]
[New Thread 0x7ffe657fe000 (LWP 7637)]
[New Thread 0x7ffe64ffd000 (LWP 7638)]
[New Thread 0x7ffdf1fff000 (LWP 7639)]
[New Thread 0x7ffdf17fe000 (LWP 7640)]
[New Thread 0x7ffdf0ffd000 (LWP 7641)]
[New Thread 0x7ffde8f04000 (LWP 7642)]
[New Thread 0x7ffdd3fff000 (LWP 7643)]
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO NET/IB : No device found.
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO NET/Socket : Using [0]eth0:fdbd:dc02:21:208::12<0> [1]eth2:fdbd:dc02:21:20a::12<0> [2]carma_vxlan0:fe80::9848:c0ff:fe8b:8b83%carma_vxlan0<0> [3]carma_br0:fdbd:fdbd:fdbd:fdbd::1<0>
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO Using network Socket
dc02-p21-t208-n012:7622:7638 [2] NCCL INFO Using network Socket
dc02-p21-t208-n012:7622:7639 [3] NCCL INFO Using network Socket
dc02-p21-t208-n012:7622:7642 [6] NCCL INFO Using network Socket
dc02-p21-t208-n012:7622:7641 [5] NCCL INFO Using network Socket
dc02-p21-t208-n012:7622:7643 [7] NCCL INFO Using network Socket
dc02-p21-t208-n012:7622:7640 [4] NCCL INFO Using network Socket
dc02-p21-t208-n012:7622:7637 [1] NCCL INFO Using network Socket
[Thread 0x7ffe9cfde000 (LWP 7635) exited]
dc02-p21-t208-n012:7622:7643 [7] NCCL INFO ncclCommInitRank comm 0x55555b0f2f50 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId bc000 commId 0xb1a317b2c0f1498c - Init START
dc02-p21-t208-n012:7622:7639 [3] NCCL INFO ncclCommInitRank comm 0x55555b010370 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 3c000 commId 0xb1a317b2c0f1498c - Init START
dc02-p21-t208-n012:7622:7640 [4] NCCL INFO ncclCommInitRank comm 0x55555b048e40 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9a000 commId 0xb1a317b2c0f1498c - Init START
dc02-p21-t208-n012:7622:7642 [6] NCCL INFO ncclCommInitRank comm 0x55555b0ba430 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bb000 commId 0xb1a317b2c0f1498c - Init START
dc02-p21-t208-n012:7622:7641 [5] NCCL INFO ncclCommInitRank comm 0x55555b081960 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 9b000 commId 0xb1a317b2c0f1498c - Init START
dc02-p21-t208-n012:7622:7638 [2] NCCL INFO ncclCommInitRank comm 0x55555afd7850 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3b000 commId 0xb1a317b2c0f1498c - Init START
dc02-p21-t208-n012:7622:7637 [1] NCCL INFO ncclCommInitRank comm 0x55555af9ecf0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 19000 commId 0xb1a317b2c0f1498c - Init START
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO ncclCommInitRank comm 0x55555af62920 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 18000 commId 0xb1a317b2c0f1498c - Init START
dc02-p21-t208-n012:7622:7640 [4] NCCL INFO Setting affinity for GPU 4 to 0fff0000,00000000,00000000,0fff0000,00000000
dc02-p21-t208-n012:7622:7638 [2] NCCL INFO Setting affinity for GPU 2 to 0f,ff000000,00000000,0000000f,ff000000
dc02-p21-t208-n012:7622:7640 [4] NCCL INFO NVLS multicast support is not available on dev 4
dc02-p21-t208-n012:7622:7638 [2] NCCL INFO NVLS multicast support is not available on dev 2
dc02-p21-t208-n012:7622:7639 [3] NCCL INFO Setting affinity for GPU 3 to 0f,ff000000,00000000,0000000f,ff000000
dc02-p21-t208-n012:7622:7639 [3] NCCL INFO NVLS multicast support is not available on dev 3
dc02-p21-t208-n012:7622:7641 [5] NCCL INFO Setting affinity for GPU 5 to 0fff0000,00000000,00000000,0fff0000,00000000
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO Setting affinity for GPU 0 to 0ffd,00000000,00000000,00000ffc
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO NVLS multicast support is not available on dev 0
dc02-p21-t208-n012:7622:7637 [1] NCCL INFO Setting affinity for GPU 1 to 0ffd,00000000,00000000,00000ffc
dc02-p21-t208-n012:7622:7637 [1] NCCL INFO NVLS multicast support is not available on dev 1
dc02-p21-t208-n012:7622:7641 [5] NCCL INFO NVLS multicast support is not available on dev 5
dc02-p21-t208-n012:7622:7643 [7] NCCL INFO Setting affinity for GPU 7 to 0fff00,00000000,00000000,000fff00,00000000,00000000
dc02-p21-t208-n012:7622:7643 [7] NCCL INFO NVLS multicast support is not available on dev 7
dc02-p21-t208-n012:7622:7642 [6] NCCL INFO Setting affinity for GPU 6 to 0fff00,00000000,00000000,000fff00,00000000,00000000
dc02-p21-t208-n012:7622:7642 [6] NCCL INFO NVLS multicast support is not available on dev 6
dc02-p21-t208-n012:7622:7642 [6] NCCL INFO comm 0x55555b0ba430 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
dc02-p21-t208-n012:7622:7638 [2] NCCL INFO comm 0x55555afd7850 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
dc02-p21-t208-n012:7622:7643 [7] NCCL INFO comm 0x55555b0f2f50 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
dc02-p21-t208-n012:7622:7638 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->7 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->7
dc02-p21-t208-n012:7622:7643 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 2/-1/-1->7->6 [2] 4/-1/-1->7->6 [3] 2/-1/-1->7->6
dc02-p21-t208-n012:7622:7643 [7] NCCL INFO P2P Chunksize set to 131072
dc02-p21-t208-n012:7622:7637 [1] NCCL INFO comm 0x55555af9ecf0 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO comm 0x55555af62920 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
dc02-p21-t208-n012:7622:7637 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 6/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 6/-1/-1->1->0
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO Channel 00/04 :    0   1   2   3   6   7   4   5
dc02-p21-t208-n012:7622:7642 [6] NCCL INFO Trees [0] 7/-1/-1->6->3 [1] 7/-1/-1->6->1 [2] 7/-1/-1->6->3 [3] 7/-1/-1->6->1
dc02-p21-t208-n012:7622:7642 [6] NCCL INFO P2P Chunksize set to 131072
dc02-p21-t208-n012:7622:7638 [2] NCCL INFO P2P Chunksize set to 131072
dc02-p21-t208-n012:7622:7641 [5] NCCL INFO comm 0x55555b081960 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
dc02-p21-t208-n012:7622:7639 [3] NCCL INFO comm 0x55555b010370 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
dc02-p21-t208-n012:7622:7641 [5] NCCL INFO Trees [0] -1/-1/-1->5->4 [1] -1/-1/-1->5->4 [2] -1/-1/-1->5->4 [3] -1/-1/-1->5->4
dc02-p21-t208-n012:7622:7639 [3] NCCL INFO Trees [0] 6/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 6/-1/-1->3->2 [3] 4/-1/-1->3->2
dc02-p21-t208-n012:7622:7639 [3] NCCL INFO P2P Chunksize set to 131072
dc02-p21-t208-n012:7622:7640 [4] NCCL INFO comm 0x55555b048e40 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
dc02-p21-t208-n012:7622:7641 [5] NCCL INFO P2P Chunksize set to 131072
dc02-p21-t208-n012:7622:7640 [4] NCCL INFO Trees [0] 5/-1/-1->4->7 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->7 [3] 5/-1/-1->4->3
dc02-p21-t208-n012:7622:7640 [4] NCCL INFO P2P Chunksize set to 131072
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO Channel 01/04 :    0   1   4   5   6   7   2   3
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO Channel 02/04 :    0   1   2   3   6   7   4   5
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO Channel 03/04 :    0   1   4   5   6   7   2   3
dc02-p21-t208-n012:7622:7637 [1] NCCL INFO P2P Chunksize set to 131072
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO P2P Chunksize set to 131072
[New Thread 0x7ffe9cfde000 (LWP 7644)]
[New Thread 0x7ffdc8fff000 (LWP 7645)]
[New Thread 0x7ffcf2ffd000 (LWP 7648)]
[New Thread 0x7ffcf27fc000 (LWP 7649)]
[New Thread 0x7ffcf37fe000 (LWP 7647)]
[New Thread 0x7ffcf3fff000 (LWP 7646)]
[New Thread 0x7ffcf1ffb000 (LWP 7650)]
[New Thread 0x7ffcf17fa000 (LWP 7651)]
[New Thread 0x7ffcebfff000 (LWP 7652)]
[New Thread 0x7ffceaffd000 (LWP 7653)]
[New Thread 0x7ffceb7fe000 (LWP 7654)]
[New Thread 0x7ffce9ffb000 (LWP 7658)]
[New Thread 0x7ffcea7fc000 (LWP 7655)]
[New Thread 0x7ffce97fa000 (LWP 7657)]
[New Thread 0x7ffce8ff9000 (LWP 7656)]
[New Thread 0x7ffccffff000 (LWP 7659)]
dc02-p21-t208-n012:7622:7642 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dc02-p21-t208-n012:7622:7642 [6] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dc02-p21-t208-n012:7622:7640 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dc02-p21-t208-n012:7622:7640 [4] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dc02-p21-t208-n012:7622:7641 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dc02-p21-t208-n012:7622:7641 [5] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dc02-p21-t208-n012:7622:7643 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dc02-p21-t208-n012:7622:7643 [7] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dc02-p21-t208-n012:7622:7638 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dc02-p21-t208-n012:7622:7638 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dc02-p21-t208-n012:7622:7637 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dc02-p21-t208-n012:7622:7637 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
dc02-p21-t208-n012:7622:7639 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dc02-p21-t208-n012:7622:7639 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dc02-p21-t208-n012:7622:7638 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
dc02-p21-t208-n012:7622:7638 [2] NCCL INFO ncclCommInitRank comm 0x55555afd7850 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3b000 commId 0xb1a317b2c0f1498c - Init COMPLETE
dc02-p21-t208-n012:7622:7638 [2] NCCL INFO Init timings: rank 2 nranks 8 total 0.66 (kernels 0.42, bootstrap 0.07, allgathers 0.03, topo 0.06, graphs 0.05, connections 0.02, rest 0.00)
dc02-p21-t208-n012:7622:7642 [6] NCCL INFO ncclCommInitRank comm 0x55555b0ba430 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bb000 commId 0xb1a317b2c0f1498c - Init COMPLETE
dc02-p21-t208-n012:7622:7640 [4] NCCL INFO ncclCommInitRank comm 0x55555b048e40 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9a000 commId 0xb1a317b2c0f1498c - Init COMPLETE
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO ncclCommInitRank comm 0x55555af62920 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 18000 commId 0xb1a317b2c0f1498c - Init COMPLETE
dc02-p21-t208-n012:7622:7641 [5] NCCL INFO ncclCommInitRank comm 0x55555b081960 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 9b000 commId 0xb1a317b2c0f1498c - Init COMPLETE
dc02-p21-t208-n012:7622:7639 [3] NCCL INFO ncclCommInitRank comm 0x55555b010370 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 3c000 commId 0xb1a317b2c0f1498c - Init COMPLETE
dc02-p21-t208-n012:7622:7643 [7] NCCL INFO ncclCommInitRank comm 0x55555b0f2f50 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId bc000 commId 0xb1a317b2c0f1498c - Init COMPLETE
[Thread 0x7ffe64ffd000 (LWP 7638) exited]
dc02-p21-t208-n012:7622:7640 [4] NCCL INFO Init timings: rank 4 nranks 8 total 0.66 (kernels 0.42, bootstrap 0.07, allgathers 0.03, topo 0.06, graphs 0.05, connections 0.02, rest 0.00)
dc02-p21-t208-n012:7622:7637 [1] NCCL INFO ncclCommInitRank comm 0x55555af9ecf0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 19000 commId 0xb1a317b2c0f1498c - Init COMPLETE
dc02-p21-t208-n012:7622:7642 [6] NCCL INFO Init timings: rank 6 nranks 8 total 0.66 (kernels 0.42, bootstrap 0.07, allgathers 0.00, topo 0.06, graphs 0.08, connections 0.02, rest 0.00)
dc02-p21-t208-n012:7622:7636 [0] NCCL INFO Init timings: rank 0 nranks 8 total 0.66 (kernels 0.42, bootstrap 0.08, allgathers 0.03, topo 0.06, graphs 0.05, connections 0.02, rest 0.00)
dc02-p21-t208-n012:7622:7641 [5] NCCL INFO Init timings: rank 5 nranks 8 total 0.66 (kernels 0.42, bootstrap 0.07, allgathers 0.03, topo 0.06, graphs 0.05, connections 0.02, rest 0.00)
dc02-p21-t208-n012:7622:7639 [3] NCCL INFO Init timings: rank 3 nranks 8 total 0.66 (kernels 0.42, bootstrap 0.07, allgathers 0.03, topo 0.06, graphs 0.05, connections 0.02, rest 0.00)
dc02-p21-t208-n012:7622:7643 [7] NCCL INFO Init timings: rank 7 nranks 8 total 0.66 (kernels 0.42, bootstrap 0.07, allgathers 0.00, topo 0.06, graphs 0.08, connections 0.02, rest 0.00)
dc02-p21-t208-n012:7622:7637 [1] NCCL INFO Init timings: rank 1 nranks 8 total 0.66 (kernels 0.42, bootstrap 0.08, allgathers 0.03, topo 0.06, graphs 0.05, connections 0.02, rest 0.00)
[Thread 0x7ffdf17fe000 (LWP 7640) exited]
[Thread 0x7ffde8f04000 (LWP 7642) exited]
[Thread 0x7ffe65fff000 (LWP 7636) exited]
[Thread 0x7ffdd3fff000 (LWP 7643) exited]
[Thread 0x7ffdf0ffd000 (LWP 7641) exited]
[Thread 0x7ffe657fe000 (LWP 7637) exited]
[Thread 0x7ffdf1fff000 (LWP 7639) exited]
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
[New Thread 0x7ffdf1fff000 (LWP 7660)]
[New Thread 0x7ffe657fe000 (LWP 7661)]
[New Thread 0x7ffe65fff000 (LWP 7662)]
[New Thread 0x7ffdd3fff000 (LWP 7663)]
[New Thread 0x7ffe64ffd000 (LWP 7664)]
[New Thread 0x7ffdf17fe000 (LWP 7665)]
[New Thread 0x7ffdf0ffd000 (LWP 7666)]
[New Thread 0x7ffde8f04000 (LWP 7667)]
dc02-p21-t208-n012:7622:7667 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer
dc02-p21-t208-n012:7622:7663 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/direct pointer
dc02-p21-t208-n012:7622:7667 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer
dc02-p21-t208-n012:7622:7663 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/direct pointer
dc02-p21-t208-n012:7622:7667 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer
dc02-p21-t208-n012:7622:7663 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/direct pointer
dc02-p21-t208-n012:7622:7667 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer
dc02-p21-t208-n012:7622:7663 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/direct pointer
dc02-p21-t208-n012:7622:7662 [5] NCCL INFO Channel 01 : 5[5] -> 6[6] via SHM/direct/direct
dc02-p21-t208-n012:7622:7666 [1] NCCL INFO Channel 00 : 1[1] -> 2[2] via SHM/direct/direct
dc02-p21-t208-n012:7622:7665 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/direct pointer
dc02-p21-t208-n012:7622:7665 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/direct pointer
dc02-p21-t208-n012:7622:7665 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/direct pointer
dc02-p21-t208-n012:7622:7665 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/direct pointer
dc02-p21-t208-n012:7622:7661 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/direct pointer
dc02-p21-t208-n012:7622:7661 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/direct pointer
dc02-p21-t208-n012:7622:7661 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/direct pointer
dc02-p21-t208-n012:7622:7661 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/direct pointer
dc02-p21-t208-n012:7622:7662 [5] NCCL INFO Channel 03 : 5[5] -> 6[6] via SHM/direct/direct
dc02-p21-t208-n012:7622:7666 [1] NCCL INFO Channel 02 : 1[1] -> 2[2] via SHM/direct/direct
dc02-p21-t208-n012:7622:7664 [3] NCCL INFO Channel 00 : 3[3] -> 6[6] via SHM/direct/direct
dc02-p21-t208-n012:7622:7660 [7] NCCL INFO Channel 01 : 7[7] -> 2[2] via SHM/direct/direct
dc02-p21-t208-n012:7622:7662 [5] NCCL INFO Channel 00 : 5[5] -> 0[0] via SHM/direct/direct
dc02-p21-t208-n012:7622:7666 [1] NCCL INFO Channel 01 : 1[1] -> 4[4] via SHM/direct/direct
dc02-p21-t208-n012:7622:7664 [3] NCCL INFO Channel 02 : 3[3] -> 6[6] via SHM/direct/direct
dc02-p21-t208-n012:7622:7660 [7] NCCL INFO Channel 03 : 7[7] -> 2[2] via SHM/direct/direct
dc02-p21-t208-n012:7622:7662 [5] NCCL INFO Channel 02 : 5[5] -> 0[0] via SHM/direct/direct
dc02-p21-t208-n012:7622:7666 [1] NCCL INFO Channel 03 : 1[1] -> 4[4] via SHM/direct/direct
dc02-p21-t208-n012:7622:7664 [3] NCCL INFO Channel 01 : 3[3] -> 0[0] via SHM/direct/direct
dc02-p21-t208-n012:7622:7664 [3] NCCL INFO Channel 03 : 3[3] -> 0[0] via SHM/direct/direct
dc02-p21-t208-n012:7622:7660 [7] NCCL INFO Channel 00 : 7[7] -> 4[4] via SHM/direct/direct
dc02-p21-t208-n012:7622:7660 [7] NCCL INFO Channel 02 : 7[7] -> 4[4] via SHM/direct/direct
dc02-p21-t208-n012:7622:7665 [2] NCCL INFO Connected all rings
dc02-p21-t208-n012:7622:7661 [6] NCCL INFO Connected all rings
[Thread 0x7ffdf17fe000 (LWP 7665) exited]
[Thread 0x7ffe657fe000 (LWP 7661) exited]
dc02-p21-t208-n012:7622:7662 [5] NCCL INFO Connected all rings
dc02-p21-t208-n012:7622:7666 [1] NCCL INFO Connected all rings
[Thread 0x7ffe65fff000 (LWP 7662) exited]
dc02-p21-t208-n012:7622:7664 [3] NCCL INFO Connected all rings
[Thread 0x7ffdf0ffd000 (LWP 7666) exited]
dc02-p21-t208-n012:7622:7660 [7] NCCL INFO Connected all rings
dc02-p21-t208-n012:7622:7667 [0] NCCL INFO Connected all rings
dc02-p21-t208-n012:7622:7663 [4] NCCL INFO Connected all rings
[Thread 0x7ffe64ffd000 (LWP 7664) exited]
[Thread 0x7ffdf1fff000 (LWP 7660) exited]
[Thread 0x7ffde8f04000 (LWP 7667) exited]
[Thread 0x7ffdd3fff000 (LWP 7663) exited]
(stuck)

@sjeaugey
Copy link
Member

sjeaugey commented Jan 6, 2025

Aside from disabling ACS on the PCI switches, I believe disabling VT-D in the BIOS or trying different boot options for iommu has had some success for other users. But it's platform specific and I don't have direct experience with all platform types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants