Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] A bug in the initialize_ub function. #1170

Closed
wangzihe1996 opened this issue Sep 9, 2024 · 5 comments · Fixed by #1175
Closed

[Bug] A bug in the initialize_ub function. #1170

wangzihe1996 opened this issue Sep 9, 2024 · 5 comments · Fixed by #1175
Assignees

Comments

@wangzihe1996
Copy link

wangzihe1996 commented Sep 9, 2024

I find that TransformerEngine has supported the tensor parallelism (TP) communication overlap without the dependency of MPI. Therefore, I tried to use the tensor parallelism (TP) communication overlap in torchrun method. In this process, I found a bug in the initialize_ub function.

I ran my code but get the error as follows.

[rank1]:   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/base.py", line 149, in initialize_ub
[rank1]:     raise OSError(f"Invalid network interface: {ifname}") from err
[rank1]: OSError: Invalid network interface: eth
[rank7]: Traceback (most recent call last):
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/base.py", line 144, in initialize_ub
[rank7]:     fcntl.ioctl(
[rank7]: OSError: [Errno 19] No such device

I found the code as follows.

# Construct an intra-node communicator based on global ranks that share the same hostname
# NOTE: If the user specified a valid network interface for NCCL or GLOO, use the host
# address on that interface instead of the hostname. This can help avoid issues when
# different hosts have the same hostname on Kubernetes clusters.
hostname = socket.gethostname()
ifname = os.getenv(
"NVTE_UB_SOCKET_IFNAME",
os.getenv("NCCL_SOCKET_IFNAME", os.getenv("GLOO_SOCKET_IFNAME")),
)
if ifname is not None:
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
try:
hostname = socket.inet_ntoa(
fcntl.ioctl(
s.fileno(), 0x8915, struct.pack("256s", ifname[:15].encode("UTF-8"))
)[20:24]
)
except OSError as err:
raise OSError(f"Invalid network interface: {ifname}") from err

The function tries to get the IP address as the hostname if the ifname is not None. And the ifname is obtained from the environment variables NVTE_UB_SOCKET_IFNAME, NCCL_SOCKET_IFNAME, and GLOO_SOCKET_IFNAME.

In my environment, the NCCL_SOCKET_IFNAME is set to be eth. In fact, the machine has many network cards named eth0, eth1, eth2, and so on. But there is not a network card called eth. I try to use ifname = eth0 and the code above can run successfully.

I checked the NCCL documentation about the environment variables. It shows that when the value of NCCL_SOCKET_IFNAME is eth, it will use all interfaces starting with eth, e.g. eth0, eth1.

So I think the code need to use other methods, such as psutil or netifaces, to get the name of the network cards when NCCL_SOCKET_IFNAME is eth.

This is my test code:

import socket, fcntl, struct, os
ifname = os.getenv("NCCL_SOCKET_IFNAME")
ifname = 'eth0'
print('ifname:', ifname)
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
print('fileno:', s.fileno())
inet = fcntl.ioctl(s.fileno(), 0x8915, struct.pack("256s", ifname[:15].encode("UTF-8")))
print('inet:', inet)
hostname = socket.inet_ntoa(inet[20:24])
print('hostname:', hostname)
s.close()
@ptrendx
Copy link
Member

ptrendx commented Sep 10, 2024

@denera Could you take a look at it?

@denera
Copy link
Collaborator

denera commented Sep 10, 2024

@wangzihe1996 It's easy to get a list of network interfaces from socket.if_nameindex(), but cluster configurations vary significantly across the user base and there is no way for us to know which of those interfaces is the right one to use when trying to detect ranks on the same physical node.

That's why we set up the NVTE_UB_SOCKET_IFNAME variable to override both NCCL_SOCKET_IFNAME and GLOO_SOCKET_IFNAME. You can simply set NVTE_UB_SOCKET_IFNAME=eth0 to run your code without modifying anything in TE, and it will use ifname = 'eth0' to fetch hostname information on every rank.

That said, it would still be better if TE checks ifname against socket.if_nameindex() and safely falls back to os.gethostname() with a useful warning message instead of just erroring out. At least that way we can point the user toward manually setting the network interface via NVTE_UB_SOCKET_IFNAME. I will file a PR for this shortly.

@wangzihe1996
Copy link
Author

@denera I'm glad you have made these changes which can give more informations to the developers such as me. Meanwhile, I have another question that which one should I choose if I have many RDMA network cards. We can assume that they are named eth0, eth1, ..., eth7 and each of them has an individual IPV4 address.

@denera
Copy link
Collaborator

denera commented Sep 17, 2024

@wangzihe1996 The correct network interface is the one that returns the same hostname on processes/ranks that map to GPUs on the same physical node.

We can't make any general assumptions or guesses about this in TE, but your specific case looks like each node connecting to different groups of nodes via different RDMA network interfaces. If that's true, then any one of the eth interfaces will likely return the correct hostname information and it won't matter which you select.

@wangzihe1996
Copy link
Author

@denera Thank you for your reply. I understand more details of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants