Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run nccl test in vm without nvswitch passthroughed? #260

Open
joydchh opened this issue Oct 31, 2024 · 2 comments
Open

How to run nccl test in vm without nvswitch passthroughed? #260

joydchh opened this issue Oct 31, 2024 · 2 comments

Comments

@joydchh
Copy link

joydchh commented Oct 31, 2024

Hi,
We are trying to run 4 vms in a host with 8 H100s, and each vm with 2 GPUs.
We found that the nvswitches can only be passthroughed into a single vm, and the rest vms got none. In this case, vms without nvswitch cannot run nccl test. The error is like blow.
image
Then, it came to my mind that maybe disabling nvlink would help to find the path with pcie. So, I tried to set NCCL_P2P_DISABLE=1, but still not working.
image
I don't know if there is any way to make through?

@joydchh
Copy link
Author

joydchh commented Nov 5, 2024

Any insights on this?

@tryauuum
Copy link

you have to run the nvidia-fabricmanager on the hypervisor itself

so you only passthrough the GPUs to your VMs, NVSwitches stay attached to hypervisor and bound to the nvidia driver. and then you can run nvidia-fabricmanager and configure the partitions:
https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html#shared-nvswitch-virtualization-model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants