-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The special topology causes the NCCL test to fail #273
Comments
Maybe ACS is enabled on the one getting stuck? |
@sjeaugey I used command |
Can you confirm whether it works when setting |
@sjeaugey This works! So will turning off P2P affect performance? |
Yes it will, but at least we confirmed the hang was coming from GPU Direct P2P being broken on your system, i.e. GPUs can't talk directly to each other through PCI. Which typically comes from ACS being enabled, but maybe in your case it's something else. |
|
Aside from disabling ACS on the PCI switches, I believe disabling VT-D in the BIOS or trying different boot options for |
We have two topologies of L40S. The former can pass the test, but the latter cannot.
nccl version: 2.23.4+cuda12.4
test pass
test failed
The text was updated successfully, but these errors were encountered: