Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Gang Scheduling for PytorchJob on Kueue #2796

Open
FWCoder opened this issue Aug 7, 2024 · 0 comments
Open

Support Gang Scheduling for PytorchJob on Kueue #2796

FWCoder opened this issue Aug 7, 2024 · 0 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@FWCoder
Copy link

FWCoder commented Aug 7, 2024

What happened:
When I submitted a PytorchJob that is required 8 GPUs on Master and 8 GPUs on Worker, it was admitted even though there is only 8 GPU available in the Cluster Queue. Both master and worker pods were created but only Master pod can move to Init and Running states. The Worker Pod is stuck on Pending until the Master pod move to Completed state. At that point, the Worker Pod will stuck on Init state since it is waiting for the Master pod to come up. (Deadlock Scenario)

This happens with "waitForPodsReady" enable.

What you expected to happen:
Kueue Controller Manager will evaluate the total requested resources between both Master and Workers. It should blocks the job being admitted until there is enough resources in the Cluster Queue.

How to reproduce it (as minimally and precisely as possible):

Job Details:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: <LOCAL_QUEUE_NAME>
  name: hello-world-kueue
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - command:
                - "sleep"
                - "60"
              image: <PYTORCH_IMAGE>
              imagePullPolicy: Always
              name: pytorch
              resources:
                limits:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
                requests:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
          securityContext:
            runAsUser: 1000
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - command:
                - "sleep"
                - "10"
              image:<PYTORCH_IMAGE>
              imagePullPolicy: Always
              name: pytorch
              resources:
                limits:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
                requests:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
          securityContext:
            runAsUser: 1000
  runPolicy:
    ttlSecondsAfterFinished: 604800

Create Job:

kubectl create -f hello-world-kueue.yaml

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.28
  • Kueue version (use git describe --tags --dirty --always): 0.6.1
  • Cloud provider or hardware configuration: AWS

Tasks

No tasks being tracked yet.
@FWCoder FWCoder added the kind/bug Categorizes issue or PR as related to a bug. label Aug 7, 2024
@FWCoder FWCoder changed the title Support Gang Scheduling for Kueue Support Gang Scheduling for PytorchJob on Kueue Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

1 participant