Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[K8s] Wait until endpoint to be ready for --endpoint call #3634

Merged
merged 9 commits into from
Jul 4, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jun 5, 2024

This PR adds waiting for fetching endpoint of a newly launched cluster on kubernetes.

The following does not work on master:
sky launch -c test-port --ports 8889 --cloud kubernetes --cpus 2; sky status --endpoint 8889 test-port

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch -c test-port --ports 8889 --cloud kubernetes --cpus 2; sky status --endpoint 8889 test-port
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@Michaelvll Michaelvll changed the title [K8s] Wait until endpoint to be ready for load balancer [K8s] Wait until endpoint to be ready for --endpoint call Jun 7, 2024
@Michaelvll Michaelvll added this to the v0.6.1 milestone Jun 25, 2024
Comment on lines 237 to 251
start_time = time.time()
retry_cnt = 0
while ip is None and time.time() - start_time < timeout:
service = core_api.read_namespaced_service(
service_name, namespace, _request_timeout=kubernetes.API_TIMEOUT)
if service.status.load_balancer.ingress is not None:
ip = (service.status.load_balancer.ingress[0].ip or
service.status.load_balancer.ingress[0].hostname)
if ip is None:
retry_cnt += 1
if retry_cnt % 5 == 0:
logger.debug('Waiting for load balancer IP to be assigned'
'...')
time.sleep(1)
return ip
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the end-to-end argument, I'm thinking this retry + timeout functionality is better implemented higher up in the stack (perhaps in backend_utils.get_endpoints when provision_lib.query_ports is called). E.g., it might also be required for other port query methods/clouds in the future.

wdyt?

Copy link
Collaborator Author

@Michaelvll Michaelvll Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this issue happen to ingress mode as well? Adding to the higher level makes sense to me, when the issue is general. Otherwise, it may introduce unnecessary overheads for retrying before erroring out, when the ports actually fail to expose. Wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - I'm okay with having this check here then :) Maybe we can make a note in provision_lib.query_ports doc str that the underlying implementation is responsible for retries and timeout.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Updated.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll - tested on GKE and local clusters. Left a small comment about timeout=0 condition, otherwise LGTM.

sky/provision/kubernetes/network_utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll!

@Michaelvll
Copy link
Collaborator Author

Tested:

  • sky serve up --cloud kubernetes --cpus 2 examples/serve/http_server/task.yaml (controller on kubernetes, GKE)
  • sky launch -c test-port --ports 8889 --cloud kubernetes --cpus 2 -d; sky status --endpoint 8889 test-port

@Michaelvll Michaelvll merged commit 92f55a4 into master Jul 4, 2024
20 checks passed
@Michaelvll Michaelvll deleted the wait-until-endpoint-ready branch July 4, 2024 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants