Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Show hints when requested resources don't fit in Kubernetes cluster #3590

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented May 23, 2024

This PR adds the plumbing to surface hints from the cloud when no feasible resources are found. This is useful for surfacing errors in Kubernetes.

Closes #3506.

Examples

===== GPU drivers not installed on the node =====
sky launch -c test3 --gpus T4:1 --cloud kubernetes
I 05-23 13:39:13 optimizer.py:1263] No resource satisfying Kubernetes({'T4': 1}) on Kubernetes.
I 05-23 13:39:13 optimizer.py:1272] Kubernetes: Could not detect GPU resources (`nvidia.com/gpu`) in Kubernetes cluster. If this cluster contains GPUs, please ensure GPU drivers are installed on the node. Check if the GPUs are setup correctly by running `kubectl describe nodes` and looking for the nvidia.com/gpu resource. Please refer to the documentation on how to set up GPUs.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
  resources: Kubernetes({'T4': 1}).

To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.


======= When enough CPU is not available =======
$ sky launch -c test3 --gpus T4:1 --cpus 32 --cloud kubernetes
I 05-23 13:47:27 optimizer.py:1263] No resource satisfying Kubernetes(cpus=32, {'T4': 1}) on Kubernetes.
I 05-23 13:47:27 optimizer.py:1272] Kubernetes: GPU nodes with T4 do not have enough CPU (> 32.0 vCPUs) and/or memory (> 128.0 G). Maximum resources found on a single node: 16.0 CPUs, 58.9G Memory
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
  resources: Kubernetes(cpus=32, {'T4': 1}).

To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manual tests from above issue and other optimizer tests in tests/*

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @romilbhardwaj! This is great, as for some clouds, if we specify a specific cpus: 10 without + at the end, many clouds will not be able to provide such instance, having a hint for that would be very useful. Can we add a TODO for adding those hints in the future?

@@ -425,7 +425,7 @@ def make_deploy_resources_variables(self,

def _get_feasible_launchable_resources(
self, resources: 'resources_lib.Resources'
) -> Tuple[List['resources_lib.Resources'], List[str]]:
) -> Tuple[List['resources_lib.Resources'], List[str], Optional[str]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the return value is getting longer. Could we also do a refactoring for it in this PR, e.g., creating a data class for the return type to make it simpler?

@dataclasses.dataclass
class FeasibleResourcesWithHints:
    pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] [GKE] Fail to request T4 instance
2 participants