Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel Gaudi bootc container is missing InfiniBand #483

Open
tiran opened this issue May 7, 2024 · 7 comments
Open

Intel Gaudi bootc container is missing InfiniBand #483

tiran opened this issue May 7, 2024 · 7 comments

Comments

@tiran
Copy link
Contributor

tiran commented May 7, 2024

The container file https://github.com/containers/ai-lab-recipes/blob/main/training/intel-bootc/Containerfile does not contain necessary bits and pieces to setup InfiniBand Intel Gaudi devices. Without IB, it is not possible to use Intel Gaudi 2 cards for training. I'm not entirely sure if this only affects servers with multiple Intel Gaudi 2 cards or also servers with a single card. My test systems have 8 Intel Gaudi 2 cards each.

The server with bootc image did not have the habanalabs_ib module loaded. Manually loading the module doesn't make a difference. rdma dev does not show any hlib (Babana Labs InfiniBand) devices. Without the devices, PyTorch and Habana's PyTorch plugin habana_frameworks fail to initialize the devices: hcl_ibverbs_t::init failed to find matching Habana IB device (hlib_6)

On the other bare metal server with regular RHEL 9 and Habana's packages, rdma dev shows a node for each card and rdma link shows over 160 active connections with LINK_UP.

# rdma dev
0: mlx5_0: node_type ca fw 20.32.2004 node_guid b83f:d203:004a:5242 sys_image_guid b83f:d203:004a:5242 
1: mlx5_1: node_type ca fw 20.32.2004 node_guid b83f:d203:004a:522e sys_image_guid b83f:d203:004a:522e 
18: hlib_4: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 
19: hlib_6: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 
20: hlib_2: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 
21: hlib_7: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 
22: hlib_1: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 
23: hlib_3: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 
24: hlib_5: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 
25: hlib_0: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000

journal shows that habanalabs Kernel module registers a habana labs infiniband device for each card:

# journalctl -o cat | grep hlib
habanalabs 0000:db:00.0 hlib_4: IB device registered
habanalabs 0000:bb:00.0 hlib_2: IB device registered
habanalabs 0000:19:00.0 hlib_6: IB device registered
habanalabs 0000:5d:00.0 hlib_7: IB device registered
habanalabs 0000:9b:00.0 hlib_1: IB device registered
habanalabs 0000:cb:00.0 hlib_3: IB device registered
habanalabs 0000:4c:00.0 hlib_5: IB device registered
habanalabs 0000:3b:00.0 hlib_0: IB device registered

According to https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html we also need the habanalabs-thunk and habanalabs-rdma-core packages in the bootc image.

reproducer

#!/usr/bin/env python3
import os

os.environ["ENABLE_CONSOLE"] = "true"
os.environ["LOG_LEVEL_ALL"] = "1"

from habana_frameworks.torch import hpu
hpu.init()
[19:18:28.773286][HCL       ][info ][tid:55][C:host[f6fa5532e33d] device[0]] hcl::IntermediateBufferContainer::IntermediateBufferContainer Allocated device memory. Address: 0x100160016d800000, Size: 168MB
[19:18:28.774092][HCL_IBV   ][info ][tid:55][C:host[f6fa5532e33d] device[0]] loaded: /opt/habanalabs/rdma-core/src/build/lib/libhlib.so
[19:18:28.774101][HCL_IBV   ][info ][tid:55][C:host[f6fa5532e33d] device[0]] loaded: /opt/habanalabs/rdma-core/src/build/lib/libibverbshl.so.1
[19:18:28.775252][HCL_IBV   ][error][tid:55][C:host[f6fa5532e33d] device[0]] hcl_ibverbs_t::init failed to find matching Habana IB device (hlib_6)
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi2/hcl_device.cpp::68(HclDeviceGaudi2): The condition [ hcclSuccess == ret ] failed. ibverb init returned 6[hcclInternalError] 
[19:18:28.775294][HCL       ][critical][tid:55][C:host[f6fa5532e33d] device[0]] HclDeviceGaudi2: The condition [ hcclSuccess == ret ] failed. ibverb init returned 6[hcclInternalError]
[19:18:28.775409][HCL       ][info ][tid:55][C:host[f6fa5532e33d] device[0]] ----------------- interface counters (for 0 interfaces) -------------
[19:18:28.775435][SYN_DEVICE    ][error][tid:55][C:host[f6fa5532e33d] device[0]] _acquire: Failed to initialize HCCL device for device
@rhatdan
Copy link
Member

rhatdan commented May 21, 2024

Care to open a PR to fix?

@tiran
Copy link
Contributor Author

tiran commented May 21, 2024

@enriquebelarte solved the issue in ec05b07

Enrique, is there anything left to do here?

@enriquebelarte
Copy link
Collaborator

Could not test upstream as build checks failed because of kernel version in runners.
Need to find a solution for it but apart from that, the fix adds packages that made the bootc image work with Gaudi2 hardware, so I guess this issue can be closed by now.

@braultatgithub
Copy link
Contributor

@tiran , @enriquebelarte , assuming the fix has been identified and proven to work repeatably (add and load habanalabs_ib module as part of the bootc image) , should we close the issue?

@enriquebelarte
Copy link
Collaborator

@braultatgithub This Containerfile has changed significantly since the fix was submitted. I haven't tested the new Containerfile but it seems to be doing the correct thing. Extracts InfiniBand module from rpm and loads it.

@tiran
Copy link
Contributor Author

tiran commented Jun 26, 2024

We need to test this on a real server. RDMA may also need a custom libfabric with SynapseAI, habanlabs-rdma-core, habanalabs-thunk, and Habana's hccl_ofi_wrapper library.

@Feelas
Copy link

Feelas commented Jun 26, 2024

@tiran @enriquebelarte I bumped onto this ticket by accident when I was diagnosing some other habanalabs_ib issue myself. "Manually loading the module doesn't make a difference." rings very close to home, so I decided to share my two cents.

I've seen a scenario where I had habanalabs, habanalabs_en and habanalabs_cn were correctly loaded on our system while habanalabs_ib failed to load due to ib_uverbs module (which is a dependency) not being loaded automatically when loading habanalabs_ib (for any reason possible). You should be able to observe that in dmesg on the failing system by looking for "habanalabs_ib" lines, namely looking for "Unknown symbol" lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants