[k8s-on-prem] Timeout issue with Traefik deployment replicas more than 1 #7370

Ryan-ZL-Lin · 2024-06-25T02:24:19Z

Description
I'm trying to use Triton with K8S om-prem by following this repo
Here is my setup

3 EC2s on AWS, 1 master and 2 worker nodes.
CNI: Calico
AWS Load Balancer Controller is used

With only 1 Traefik Pod, the instances in the Target Group of Load Balancer on AWS is always 1 healthy and 1 unhealthy except port 80

The healthy instance (worker1) is the node where the Traefik Pod is deployed to.

Triton Information
imageName: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3

To Reproduce
Steps to reproduce the behavior.

Helm install example . Here are all the Pods.
Change replicas to 2 in Traefik Deployment

kubectl edit deployment example-traefik

Monitor the status changed
Check the connectivity on each node

Although all the targets are healthy, but the connection (for all 4 ports) becomes timeout sometimes (30 ~ 50 % probability), it also happens on the client side that is making inference request from internet to Triton. The timeout issue has gone once I reduce the replica of Traefik Pod to 1.
The timeout here is referring to the connection from AWS load balancer Listener to NodePort (32266, 31647 etc) on each node. I tried to directly run curl -v localhost:32266 on both nodes and still got the same timeout result.
Note that I changed the Unhealthy Threshold on AWS Console to 10, so all the instances are healthy in the Resource Map, but actually there were lots of timeout happening there.
Because of this timeout issue, I observed that the third Triton Pod is auto scheduled but always stays in status Pending since it's not allowed to deploy to master node.

Expected behavior
I'd like to have 2 Traefik Pods running properly with Load Balancer and Triton Pods in K8S Cluster on AWS. Supposedly, there should be no timeout happening like this.

The text was updated successfully, but these errors were encountered:

Ryan-ZL-Lin changed the title ~~[k8s-on-prem] timeout issue with Traefik deployment with replicas more than 1~~ [k8s-on-prem] timeout issue with Traefik deployment replicas more than 1 Jun 25, 2024

Ryan-ZL-Lin changed the title ~~[k8s-on-prem] timeout issue with Traefik deployment replicas more than 1~~ [k8s-on-prem] Timeout issue with Traefik deployment replicas more than 1 Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s-on-prem] Timeout issue with Traefik deployment replicas more than 1 #7370

[k8s-on-prem] Timeout issue with Traefik deployment replicas more than 1 #7370

Ryan-ZL-Lin commented Jun 25, 2024

[k8s-on-prem] Timeout issue with Traefik deployment replicas more than 1 #7370

[k8s-on-prem] Timeout issue with Traefik deployment replicas more than 1 #7370

Comments

Ryan-ZL-Lin commented Jun 25, 2024