How to run nccl test in vm without nvswitch passthroughed? #260

joydchh · 2024-10-31T13:41:53Z

Hi,
We are trying to run 4 vms in a host with 8 H100s, and each vm with 2 GPUs.
We found that the nvswitches can only be passthroughed into a single vm, and the rest vms got none. In this case, vms without nvswitch cannot run nccl test. The error is like blow.

Then, it came to my mind that maybe disabling nvlink would help to find the path with pcie. So, I tried to set NCCL_P2P_DISABLE=1, but still not working.

I don't know if there is any way to make through?

joydchh · 2024-11-05T00:24:33Z

Any insights on this?

tryauuum · 2024-12-18T11:10:31Z

you have to run the nvidia-fabricmanager on the hypervisor itself

so you only passthrough the GPUs to your VMs, NVSwitches stay attached to hypervisor and bound to the nvidia driver. and then you can run nvidia-fabricmanager and configure the partitions:
https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html#shared-nvswitch-virtualization-model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run nccl test in vm without nvswitch passthroughed? #260

How to run nccl test in vm without nvswitch passthroughed? #260

joydchh commented Oct 31, 2024

joydchh commented Nov 5, 2024

tryauuum commented Dec 18, 2024

How to run nccl test in vm without nvswitch passthroughed? #260

How to run nccl test in vm without nvswitch passthroughed? #260

Comments

joydchh commented Oct 31, 2024

joydchh commented Nov 5, 2024

tryauuum commented Dec 18, 2024