Support Gang Scheduling for PytorchJob on Kueue #2796

FWCoder · 2024-08-07T21:29:16Z

What happened:
When I submitted a PytorchJob that is required 8 GPUs on Master and 8 GPUs on Worker, it was admitted even though there is only 8 GPU available in the Cluster Queue. Both master and worker pods were created but only Master pod can move to Init and Running states. The Worker Pod is stuck on Pending until the Master pod move to Completed state. At that point, the Worker Pod will stuck on Init state since it is waiting for the Master pod to come up. (Deadlock Scenario)

This happens with "waitForPodsReady" enable.

What you expected to happen:
Kueue Controller Manager will evaluate the total requested resources between both Master and Workers. It should blocks the job being admitted until there is enough resources in the Cluster Queue.

How to reproduce it (as minimally and precisely as possible):

Job Details:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: <LOCAL_QUEUE_NAME>
  name: hello-world-kueue
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - command:
                - "sleep"
                - "60"
              image: <PYTORCH_IMAGE>
              imagePullPolicy: Always
              name: pytorch
              resources:
                limits:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
                requests:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
          securityContext:
            runAsUser: 1000
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - command:
                - "sleep"
                - "10"
              image:<PYTORCH_IMAGE>
              imagePullPolicy: Always
              name: pytorch
              resources:
                limits:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
                requests:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
          securityContext:
            runAsUser: 1000
  runPolicy:
    ttlSecondsAfterFinished: 604800

Create Job:

kubectl create -f hello-world-kueue.yaml

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.28
Kueue version (use git describe --tags --dirty --always): 0.6.1
Cloud provider or hardware configuration: AWS

Tasks

Give feedback

No tasks being tracked yet.

Options

The text was updated successfully, but these errors were encountered:

FWCoder added the kind/bug Categorizes issue or PR as related to a bug. label Aug 7, 2024

FWCoder changed the title ~~Support Gang Scheduling for Kueue~~ Support Gang Scheduling for PytorchJob on Kueue Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Gang Scheduling for PytorchJob on Kueue #2796

Support Gang Scheduling for PytorchJob on Kueue #2796

FWCoder commented Aug 7, 2024 •

edited

Loading

Tasks

Support Gang Scheduling for PytorchJob on Kueue #2796

Support Gang Scheduling for PytorchJob on Kueue #2796

Comments

FWCoder commented Aug 7, 2024 • edited Loading

Tasks

FWCoder commented Aug 7, 2024 •

edited

Loading