You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For PyTorch elastic synchronous training jobs, the number of workers is typically set between min_nodes and max_nodes. If the number of nodes is less than min_nodes, the training iteration cannot start, with the initially launched workers occupying resources while waiting, leading to a waste of hardware capability. Gang Scheduling, on the other hand, will not launch the worker Pods until the number of available nodes in the cluster is at least min_nodes.
The text was updated successfully, but these errors were encountered:
For PyTorch elastic synchronous training jobs, the number of workers is typically set between
min_nodes
andmax_nodes
. If the number of nodes is less thanmin_nodes
, the training iteration cannot start, with the initially launched workers occupying resources while waiting, leading to a waste of hardware capability. Gang Scheduling, on the other hand, will not launch the worker Pods until the number of available nodes in the cluster is at leastmin_nodes
.The text was updated successfully, but these errors were encountered: