Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map gke h100 megas to 'H100' #3691

Merged
merged 3 commits into from
Jul 2, 2024
Merged

Conversation

asaiacai
Copy link
Contributor

@asaiacai asaiacai commented Jun 26, 2024

GCP recently introduced a3-mega instances with improved bandwidth instances. On GKE, such nodes are labeled as cloud.google.com/gke-accelerator=nvidia-h100-mega-80gb which was causing these to be recognized as H100-MEGA-80GB. This change fixes this.

Before

(sky) Andrews-MacBook-Air:skypilot asai$ sky show-gpus --cloud kubernetes
Kubernetes GPUs
GPU             QTY_PER_NODE            TOTAL_GPUS  TOTAL_FREE_GPUS  
H100-MEGA-80GB  1, 2, 3, 4, 5, 6, 7, 8  16          16               
(sky) Andrews-MacBook-Air:skypilot asai$ sky launch --cloud kubernetes --gpus H100:8
I 06-26 15:58:20 optimizer.py:1264] No resource satisfying Kubernetes({'H100': 8}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
  resources: Kubernetes({'H100': 8}).

To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manual test
(sky) Andrews-MacBook-Air:skypilot asai$  sky show-gpus --cloud kubernetes
Kubernetes GPUs
GPU   QTY_PER_NODE            TOTAL_GPUS  TOTAL_FREE_GPUS  
H100  1, 2, 3, 4, 5, 6, 7, 8  16          16               
(sky) Andrews-MacBook-Air:skypilot asai$ sky launch --cloud kubernetes --gpus H100:8
I 06-26 16:02:26 optimizer.py:695] == Optimizer ==
I 06-26 16:02:26 optimizer.py:718] Estimated cost: $0.0 / hour
I 06-26 16:02:26 optimizer.py:718] 
I 06-26 16:02:26 optimizer.py:843] Considered resources (1 node):
I 06-26 16:02:26 optimizer.py:913] ----------------------------------------------------------------------------------------------------
I 06-26 16:02:26 optimizer.py:913]  CLOUD        INSTANCE           vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 06-26 16:02:26 optimizer.py:913] ----------------------------------------------------------------------------------------------------
I 06-26 16:02:26 optimizer.py:913]  Kubernetes   2CPU--8GB--8H100   2       8         H100:8         kubernetes    0.00          ✔     
I 06-26 16:02:26 optimizer.py:913] ----------------------------------------------------------------------------------------------------
I 06-26 16:02:26 optimizer.py:913] 
Launching a new cluster 'sky-adb3-asai'. Proceed? [Y/n]: n
Aborted!

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @asaiacai! LGTM.

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll merged commit 47d3dc0 into skypilot-org:master Jul 2, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants