[Bug] A bug in the initialize_ub function. #1170

wangzihe1996 · 2024-09-09T07:14:23Z

I find that TransformerEngine has supported the tensor parallelism (TP) communication overlap without the dependency of MPI. Therefore, I tried to use the tensor parallelism (TP) communication overlap in torchrun method. In this process, I found a bug in the initialize_ub function.

I ran my code but get the error as follows.

[rank1]:   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/base.py", line 149, in initialize_ub
[rank1]:     raise OSError(f"Invalid network interface: {ifname}") from err
[rank1]: OSError: Invalid network interface: eth
[rank7]: Traceback (most recent call last):
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/base.py", line 144, in initialize_ub
[rank7]:     fcntl.ioctl(
[rank7]: OSError: [Errno 19] No such device

I found the code as follows.

TransformerEngine/transformer_engine/pytorch/module/base.py

Lines 130 to 149 in bdea56f

 # Construct an intra-node communicator based on global ranks that share the same hostname 

 # NOTE: If the user specified a valid network interface for NCCL or GLOO, use the host 

 # address on that interface instead of the hostname. This can help avoid issues when 

 # different hosts have the same hostname on Kubernetes clusters. 

 hostname = socket.gethostname() 

 ifname = os.getenv( 

 "NVTE_UB_SOCKET_IFNAME", 

 os.getenv("NCCL_SOCKET_IFNAME", os.getenv("GLOO_SOCKET_IFNAME")), 

 ) 

 if ifname is not None: 

 s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) 

 try: 

 hostname = socket.inet_ntoa( 

 fcntl.ioctl( 

 s.fileno(), 0x8915, struct.pack("256s", ifname[:15].encode("UTF-8")) 

 )[20:24] 

 ) 

 except OSError as err: 

 raise OSError(f"Invalid network interface: {ifname}") from err

The function tries to get the IP address as the hostname if the ifname is not None. And the ifname is obtained from the environment variables NVTE_UB_SOCKET_IFNAME, NCCL_SOCKET_IFNAME, and GLOO_SOCKET_IFNAME.

In my environment, the NCCL_SOCKET_IFNAME is set to be eth. In fact, the machine has many network cards named eth0, eth1, eth2, and so on. But there is not a network card called eth. I try to use ifname = eth0 and the code above can run successfully.

I checked the NCCL documentation about the environment variables. It shows that when the value of NCCL_SOCKET_IFNAME is eth, it will use all interfaces starting with eth, e.g. eth0, eth1.

So I think the code need to use other methods, such as psutil or netifaces, to get the name of the network cards when NCCL_SOCKET_IFNAME is eth.

This is my test code:

import socket, fcntl, struct, os
ifname = os.getenv("NCCL_SOCKET_IFNAME")
ifname = 'eth0'
print('ifname:', ifname)
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
print('fileno:', s.fileno())
inet = fcntl.ioctl(s.fileno(), 0x8915, struct.pack("256s", ifname[:15].encode("UTF-8")))
print('inet:', inet)
hostname = socket.inet_ntoa(inet[20:24])
print('hostname:', hostname)
s.close()

The text was updated successfully, but these errors were encountered:

ptrendx · 2024-09-10T18:12:30Z

@denera Could you take a look at it?

denera · 2024-09-10T18:47:24Z

@wangzihe1996 It's easy to get a list of network interfaces from socket.if_nameindex(), but cluster configurations vary significantly across the user base and there is no way for us to know which of those interfaces is the right one to use when trying to detect ranks on the same physical node.

That's why we set up the NVTE_UB_SOCKET_IFNAME variable to override both NCCL_SOCKET_IFNAME and GLOO_SOCKET_IFNAME. You can simply set NVTE_UB_SOCKET_IFNAME=eth0 to run your code without modifying anything in TE, and it will use ifname = 'eth0' to fetch hostname information on every rank.

That said, it would still be better if TE checks ifname against socket.if_nameindex() and safely falls back to os.gethostname() with a useful warning message instead of just erroring out. At least that way we can point the user toward manually setting the network interface via NVTE_UB_SOCKET_IFNAME. I will file a PR for this shortly.

wangzihe1996 · 2024-09-12T03:10:04Z

@denera I'm glad you have made these changes which can give more informations to the developers such as me. Meanwhile, I have another question that which one should I choose if I have many RDMA network cards. We can assume that they are named eth0, eth1, ..., eth7 and each of them has an individual IPV4 address.

denera · 2024-09-17T15:36:42Z

@wangzihe1996 The correct network interface is the one that returns the same hostname on processes/ranks that map to GPUs on the same physical node.

We can't make any general assumptions or guesses about this in TE, but your specific case looks like each node connecting to different groups of nodes via different RDMA network interfaces. If that's true, then any one of the eth interfaces will likely return the correct hostname information and it won't matter which you select.

wangzihe1996 · 2024-09-18T12:04:07Z

@denera Thank you for your reply. I understand more details of this.

ptrendx assigned denera Sep 10, 2024

denera mentioned this issue Sep 10, 2024

[PyTorch] Check network interface name when initializing Userbuffers #1175

Merged

13 tasks

wangzihe1996 closed this as completed Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] A bug in the initialize_ub function. #1170

[Bug] A bug in the initialize_ub function. #1170

wangzihe1996 commented Sep 9, 2024 •

edited

Loading

ptrendx commented Sep 10, 2024

denera commented Sep 10, 2024 •

edited

Loading

wangzihe1996 commented Sep 12, 2024

denera commented Sep 17, 2024

wangzihe1996 commented Sep 18, 2024

[Bug] A bug in the initialize_ub function. #1170

[Bug] A bug in the initialize_ub function. #1170

Comments

wangzihe1996 commented Sep 9, 2024 • edited Loading

ptrendx commented Sep 10, 2024

denera commented Sep 10, 2024 • edited Loading

wangzihe1996 commented Sep 12, 2024

denera commented Sep 17, 2024

wangzihe1996 commented Sep 18, 2024

wangzihe1996 commented Sep 9, 2024 •

edited

Loading

denera commented Sep 10, 2024 •

edited

Loading