Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s-on-prem] Timeout issue with Traefik deployment replicas more than 1 #7370

Open
Ryan-ZL-Lin opened this issue Jun 25, 2024 · 0 comments
Open

Comments

@Ryan-ZL-Lin
Copy link

Description
I'm trying to use Triton with K8S om-prem by following this repo
Here is my setup

With only 1 Traefik Pod, the instances in the Target Group of Load Balancer on AWS is always 1 healthy and 1 unhealthy except port 80
image

The healthy instance (worker1) is the node where the Traefik Pod is deployed to.

Triton Information
imageName: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3

To Reproduce
Steps to reproduce the behavior.

  1. Helm install example . Here are all the Pods.
    image

  2. Change replicas to 2 in Traefik Deployment

  • kubectl edit deployment example-traefik    
    
  • image

  • image

  1. Monitor the status changed
    image

  2. Check the connectivity on each node

  • Although all the targets are healthy, but the connection (for all 4 ports) becomes timeout sometimes (30 ~ 50 % probability), it also happens on the client side that is making inference request from internet to Triton. The timeout issue has gone once I reduce the replica of Traefik Pod to 1.

  • The timeout here is referring to the connection from AWS load balancer Listener to NodePort (32266, 31647 etc) on each node. I tried to directly run curl -v localhost:32266 on both nodes and still got the same timeout result.

  • Note that I changed the Unhealthy Threshold on AWS Console to 10, so all the instances are healthy in the Resource Map, but actually there were lots of timeout happening there.
    image

  • Because of this timeout issue, I observed that the third Triton Pod is auto scheduled but always stays in status Pending since it's not allowed to deploy to master node.
    image
    image
    image

Expected behavior
I'd like to have 2 Traefik Pods running properly with Load Balancer and Triton Pods in K8S Cluster on AWS. Supposedly, there should be no timeout happening like this.

@Ryan-ZL-Lin Ryan-ZL-Lin changed the title [k8s-on-prem] timeout issue with Traefik deployment with replicas more than 1 [k8s-on-prem] timeout issue with Traefik deployment replicas more than 1 Jun 25, 2024
@Ryan-ZL-Lin Ryan-ZL-Lin changed the title [k8s-on-prem] timeout issue with Traefik deployment replicas more than 1 [k8s-on-prem] Timeout issue with Traefik deployment replicas more than 1 Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant