wrong global rank when trying multi-nodes #6454
-
Hello, in node 1,
same as in node 2,
And got stuck with repeated message like below
I'm trying to setup like this document but also got problem, like below os.environ["MASTER_ADDR"] = master_addr
os.environ["MASTER_PORT"] = master_port
os.environ["WORLD_SIZE"] = "16"
os.environ["NODE_RANK"] = rank
I'd appreciate any help. thanks in advance |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 5 replies
-
I believe the docs might be outdated, have you tried changing cc: @awaelchli. We should provide better error messages for this |
Beta Was this translation helpful? Give feedback.
-
Hi, @carmocca I changed and try it but do not work. |
Beta Was this translation helpful? Give feedback.
-
What have you set for IP: And I want For both my machines I'd need to run something like Let me know if this helps! On top of this, we should update the doc if this does work :) |
Beta Was this translation helpful? Give feedback.
-
@SeanNaren @awaelchli from pytorch_lightning.plugins.environments.cluster_environment import ClusterEnvironment
class MyCluster(ClusterEnvironment):
def master_address(self):
return MY_ADDR
def master_port(self):
return MY_PORT
def world_size(self):
return MY_WORLD_SIZE in main, ...
# case1
cluster = [MyCluster()]
trainer = Trainer(args, cluster_environment ='ddp')
# case2
cluster = [MyCluster()]
trainer = Trainer(args, accelators='ddp', plugins=[cluster]) but got error like this
How can I setup my own cluster and using with ddp? Thanks in advance |
Beta Was this translation helpful? Give feedback.
-
Thanks @jungwhank but # Initialize a trainer
trainer = pl.Trainer(logger=logger,
callbacks=[checkpoint_callback],
max_epochs=hparams["epochs"],
devices=devices,
plugins=CustomPlugin(),
accelerator=accelerator,
strategy=train_strategy) I'm getting an init error cluster_environment=CustomEnvironment(),
TypeError: Can't instantiate abstract class CustomEnvironment with abstract methods creates_processes_externally, detect, global_rank, main_address, main_port, set_global_rank, set_world_size
completed on node |
Beta Was this translation helpful? Give feedback.
-
I've tried to implement missing abstract method as specified above. class CustomEnvironment(ClusterEnvironment):
def __init__(self, num_nodes=2):
super().__init__()
self._num_nodes = num_nodes
self._master_port = None
self._world_size = None
self._global_rank = None
def creates_processes_externally(self):
# Assuming PyTorch Lightning manages processes internally
return False
def detect(self):
# Implement detection of nodes and processes if necessary
log.debug("Detect method is called.")
def global_rank(self):
if self._global_rank is None:
self._global_rank = int(os.getenv("RANK", 0))
log.debug(f"GLOBAL_RANK: {self._global_rank}")
return self._global_rank
@property
def main_address(self):
return self.master_address()
@property
def main_port(self):
return self.master_port()
def set_global_rank(self, rank: int):
self._global_rank = rank
log.debug(f"Set GLOBAL_RANK: {self._global_rank}")
def set_world_size(self, world_size: int):
self._world_size = world_size
log.debug(f"Set WORLD_SIZE: {self._world_size}")
def master_address(self):
MASTER_ADDR = os.getenv("MASTER_ADDR")
log.debug(f"MASTER_ADDR: {MASTER_ADDR}")
return MASTER_ADDR
def master_port(self):
if self._master_port is None:
self._master_port = os.getenv("MASTER_PORT")
log.debug(f"MASTER_PORT: {self._master_port}")
return int(self._master_port)
def world_size(self):
if self._world_size is None:
log.debug("WORLD_SIZE is not set.")
return self._world_size
def node_rank(self):
MY_RANK = int(os.getenv("NODE_RANK", "0"))
log.debug(f"NODE_RANK: {MY_RANK}")
return int(MY_RANK)
def local_rank(self) -> int:
LOCAL_RANK = int(os.getenv("LOCAL_RANK", "0"))
log.debug(f"LOCAL_RANK: {LOCAL_RANK}")
return LOCAL_RANK |
Beta Was this translation helpful? Give feedback.
What have you set for
MASTER_ADDR
andMASTER_PORT
? These have to reference to one of the two machines you are using. For example if I have two nodes like this:IP:
512.124.134.4
IP:
512.124.136.8
And I want
512.124.134.4
to be my master node.For both my machines I'd need to run something like
MASTER_ADDR=512.124.134.4 MASTER_PORT=4500 python train.py
.Let me know if this helps! On top of this, we should update the doc if this does work :)