Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cudo] Unable to launch instances on Cudo #3710

Closed
romilbhardwaj opened this issue Jul 1, 2024 · 3 comments
Closed

[Cudo] Unable to launch instances on Cudo #3710

romilbhardwaj opened this issue Jul 1, 2024 · 3 comments

Comments

@romilbhardwaj
Copy link
Collaborator

All launches are failing with There are no hosts available for your specified virtual machine. error from Cudo API.

Repro:

# Any GPU type fails
sky launch -c test --cloud cudo --gpus RTXA6000:1

Error:

I 07-01 12:50:23 cloud_vm_ray_backend.py:4420] Creating a new cluster: 'test' [1x Cudo(epyc-rome-rtx-a6000_4x1v2gb, {'RTXA6000': 1})].
I 07-01 12:50:23 cloud_vm_ray_backend.py:4420] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 07-01 12:50:23 cloud_vm_ray_backend.py:1406] To view detailed progress: tail -n100 -f /Users/romilb/sky_logs/sky-2024-07-01-12-50-22-143430/provision.log
D 07-01 12:50:23 backend_utils.py:853] Using ssh_proxy_command: None
D 07-01 12:50:23 provisioner.py:168] SkyPilot version: 1.0.0-dev0; commit: 1a15411aa461b903b0b09218ebb3ac70d5da9f50-dirty
D 07-01 12:50:23 provisioner.py:170]
D 07-01 12:50:23 provisioner.py:170]
D 07-01 12:50:23 provisioner.py:170] ==================== Provisioning ====================
D 07-01 12:50:23 provisioner.py:170]
D 07-01 12:50:23 provisioner.py:171] Provision config:
D 07-01 12:50:23 provisioner.py:171] {
D 07-01 12:50:23 provisioner.py:171]   "provider_config": {
D 07-01 12:50:23 provisioner.py:171]     "type": "external",
D 07-01 12:50:23 provisioner.py:171]     "module": "sky.provision.cudo",
D 07-01 12:50:23 provisioner.py:171]     "region": "no-luster-1",
D 07-01 12:50:23 provisioner.py:171]     "disable_launch_config_check": true
D 07-01 12:50:23 provisioner.py:171]   },
D 07-01 12:50:23 provisioner.py:171]   "authentication_config": {
D 07-01 12:50:23 provisioner.py:171]     "ssh_user": "root",
D 07-01 12:50:23 provisioner.py:171]     "ssh_private_key": "~/.ssh/sky-key"
D 07-01 12:50:23 provisioner.py:171]   },
D 07-01 12:50:23 provisioner.py:171]   "docker_config": {},
D 07-01 12:50:23 provisioner.py:171]   "node_config": {
D 07-01 12:50:23 provisioner.py:171]     "AuthorizedKey": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDDErDY64rh5Jrt8CiFPV0Hda8o7YR0wrShb6KYwtH51lnLvBSRhf2MlmS/olrtF2XI/s0B2iOUuJJPsmlpwdS9c/ILFT4sM+Np+BeBWZ45mV+KksPX4Of+LWIEV/GBM/HGVeI5Pa7AG5aCQrhu+QKco3KaQndOdUMEhq0MjhCKABdmX8Yj0PkBuErYMW0Z2MibEiU2OJJO/5zJ1GU4RjKTAWIMmceuCaYc2EqX14s5VtXCgM6pcRmAi7RiH7IxjNnD81QWRFGGzZ3aRwe4efAXhPsyAxAvnEInjro23if5NsXaP9gTzzktS8TyomLnvV/v9vbpo1IAVEu543Ygy+x9\n",
D 07-01 12:50:23 provisioner.py:171]     "InstanceType": "epyc-rome-rtx-a6000_4x1v2gb",
D 07-01 12:50:23 provisioner.py:171]     "DiskSize": 256
D 07-01 12:50:23 provisioner.py:171]   },
D 07-01 12:50:23 provisioner.py:171]   "count": 1,
D 07-01 12:50:23 provisioner.py:171]   "tags": {},
D 07-01 12:50:23 provisioner.py:171]   "resume_stopped_nodes": true
D 07-01 12:50:23 provisioner.py:171] }
I 07-01 12:50:23 provisioner.py:76] Launching on Cudo no-luster-1 (all zones)
W 07-01 12:50:25 instance.py:99] run_instances error: (400)
W 07-01 12:50:25 instance.py:99] Reason: Bad Request
W 07-01 12:50:25 instance.py:99] HTTP response headers: HTTPHeaderDict({'Date': 'Mon, 01 Jul 2024 19:50:25 GMT', 'Content-Type': 'application/json', 'Content-Length': '102', 'Connection': 'keep-alive', 'vary': 'Origin', 'CF-Cache-Status': 'DYNAMIC', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v4?s=hkmQtvxuYZICjDUpd3LKN%2Fjy3OWUHHhKc%2B%2FoRmU851exeILRJhz4dSd8GB%2FulNEhQBuhuTO2EPRXgR0vGIONuDPSRhuVh8t50MGS0mUPa38e1pmOUthUGOwutkL7lkvVoEO6mRLI3Q%3D%3D"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains; preload', 'X-Content-Type-Options': 'nosniff', 'Server': 'cloudflare', 'CF-RAY': '89c8ecc7acd9d045-SJC', 'alt-svc': 'h3=":443"; ma=86400'})
W 07-01 12:50:25 instance.py:99] HTTP response body: {"code":3, "message":"There are no hosts available for your specified virtual machine.", "details":[]}
W 07-01 12:50:25 instance.py:99]
D 07-01 12:50:25 provisioner.py:180] Failed to provision 'test' on Cudo (all zones).
D 07-01 12:50:25 provisioner.py:182] bulk_provision for 'test' failed. Stacktrace:
D 07-01 12:50:25 provisioner.py:182] Traceback (most recent call last):
D 07-01 12:50:25 provisioner.py:182]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/provisioner.py", line 174, in bulk_provision
D 07-01 12:50:25 provisioner.py:182]     return _bulk_provision(cloud, region, zones, cluster_name,
D 07-01 12:50:25 provisioner.py:182]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/provisioner.py", line 98, in _bulk_provision
D 07-01 12:50:25 provisioner.py:182]     provision_record = provision.run_instances(provider_name,
D 07-01 12:50:25 provisioner.py:182]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/__init__.py", line 47, in _wrapper
D 07-01 12:50:25 provisioner.py:182]     return impl(*args, **kwargs)
D 07-01 12:50:25 provisioner.py:182]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/cudo/instance.py", line 87, in run_instances
D 07-01 12:50:25 provisioner.py:182]     instance_id = cudo_wrapper.launch(
D 07-01 12:50:25 provisioner.py:182]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/cudo/cudo_wrapper.py", line 36, in launch
D 07-01 12:50:25 provisioner.py:182]     raise e
D 07-01 12:50:25 provisioner.py:182]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/cudo/cudo_wrapper.py", line 33, in launch
D 07-01 12:50:25 provisioner.py:182]     vm = api.create_vm(cudo.cudo.cudo_api.project_id_throwable(), request)
D 07-01 12:50:25 provisioner.py:182]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api/virtual_machines_api.py", line 487, in create_vm
D 07-01 12:50:25 provisioner.py:182]     (data) = self.create_vm_with_http_info(project_id, create_vm_body, **kwargs)  # noqa: E501
D 07-01 12:50:25 provisioner.py:182]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api/virtual_machines_api.py", line 557, in create_vm_with_http_info
D 07-01 12:50:25 provisioner.py:182]     return self.api_client.call_api(
D 07-01 12:50:25 provisioner.py:182]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api_client.py", line 326, in call_api
D 07-01 12:50:25 provisioner.py:182]     return self.__call_api(resource_path, method,
D 07-01 12:50:25 provisioner.py:182]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api_client.py", line 158, in __call_api
D 07-01 12:50:25 provisioner.py:182]     response_data = self.request(
D 07-01 12:50:25 provisioner.py:182]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api_client.py", line 368, in request
D 07-01 12:50:25 provisioner.py:182]     return self.rest_client.POST(url,
D 07-01 12:50:25 provisioner.py:182]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/rest.py", line 269, in POST
D 07-01 12:50:25 provisioner.py:182]     return self.request("POST", url,
D 07-01 12:50:25 provisioner.py:182]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/rest.py", line 228, in request
D 07-01 12:50:25 provisioner.py:182]     raise ApiException(http_resp=r)
D 07-01 12:50:25 provisioner.py:182] cudo_compute.rest.ApiException: (400)
D 07-01 12:50:25 provisioner.py:182] Reason: Bad Request
D 07-01 12:50:25 provisioner.py:182] HTTP response headers: HTTPHeaderDict({'Date': 'Mon, 01 Jul 2024 19:50:25 GMT', 'Content-Type': 'application/json', 'Content-Length': '102', 'Connection': 'keep-alive', 'vary': 'Origin', 'CF-Cache-Status': 'DYNAMIC', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v4?s=hkmQtvxuYZICjDUpd3LKN%2Fjy3OWUHHhKc%2B%2FoRmU851exeILRJhz4dSd8GB%2FulNEhQBuhuTO2EPRXgR0vGIONuDPSRhuVh8t50MGS0mUPa38e1pmOUthUGOwutkL7lkvVoEO6mRLI3Q%3D%3D"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains; preload', 'X-Content-Type-Options': 'nosniff', 'Server': 'cloudflare', 'CF-RAY': '89c8ecc7acd9d045-SJC', 'alt-svc': 'h3=":443"; ma=86400'})
D 07-01 12:50:25 provisioner.py:182] HTTP response body: {"code":3, "message":"There are no hosts available for your specified virtual machine.", "details":[]}
D 07-01 12:50:25 provisioner.py:182]
D 07-01 12:50:25 provisioner.py:182]
D 07-01 12:50:25 provisioner.py:187] Terminating the failed cluster.
D 07-01 12:50:26 common_utils.py:509] Tried to remove /Users/romilb/.sky/generated/ssh/test but failed to find it. Skip.
D 07-01 12:50:26 cloud_vm_ray_backend.py:1187] Got error(s) in Cudo:[cudo_compute.rest.ApiException] (400)
D 07-01 12:50:26 cloud_vm_ray_backend.py:1187] Reason: Bad Request
D 07-01 12:50:26 cloud_vm_ray_backend.py:1187] HTTP response headers: HTTPHeaderDict({'Date': 'Mon, 01 Jul 2024 19:50:25 GMT', 'Content-Type': 'application/json', 'Content-Length': '102', 'Connection': 'keep-alive', 'vary': 'Origin', 'CF-Cache-Status': 'DYNAMIC', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v4?s=hkmQtvxuYZICjDUpd3LKN%2Fjy3OWUHHhKc%2B%2FoRmU851exeILRJhz4dSd8GB%2FulNEhQBuhuTO2EPRXgR0vGIONuDPSRhuVh8t50MGS0mUPa38e1pmOUthUGOwutkL7lkvVoEO6mRLI3Q%3D%3D"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains; preload', 'X-Content-Type-Options': 'nosniff', 'Server': 'cloudflare', 'CF-RAY': '89c8ecc7acd9d045-SJC', 'alt-svc': 'h3=":443"; ma=86400'})
D 07-01 12:50:26 cloud_vm_ray_backend.py:1187] HTTP response body: {"code":3, "message":"There are no hosts available for your specified virtual machine.", "details":[]}
D 07-01 12:50:26 cloud_vm_ray_backend.py:1187]
W 07-01 12:50:26 cloud_vm_ray_backend.py:2086] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in no-luster-1. Try changing resource requirements or use another region.
W 07-01 12:50:26 cloud_vm_ray_backend.py:2095]
W 07-01 12:50:26 cloud_vm_ray_backend.py:2095] Provision failed for 1x Cudo(epyc-rome-rtx-a6000_4x1v2gb, {'RTXA6000': 1}) in no-luster-1. Trying other locations (if any).

Version & Commit info:

  • sky -c:4821f70b3f4998821dd68c2afcdc7ff61b54ec46
@romilbhardwaj
Copy link
Collaborator Author

cc @JungleCatSW - would you be able to help us with this?

@JungleCatSW
Copy link
Contributor

@romilbhardwaj Hi the fix is in this PR #3256

@Michaelvll
Copy link
Collaborator

This has been fixed by #3256. Closing now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants