You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With only 1 Traefik Pod, the instances in the Target Group of Load Balancer on AWS is always 1 healthy and 1 unhealthy except port 80
The healthy instance (worker1) is the node where the Traefik Pod is deployed to.
Triton Information
imageName: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
To Reproduce
Steps to reproduce the behavior.
Helm install example . Here are all the Pods.
Change replicas to 2 in Traefik Deployment
kubectl edit deployment example-traefik
Monitor the status changed
Check the connectivity on each node
Although all the targets are healthy, but the connection (for all 4 ports) becomes timeout sometimes (30 ~ 50 % probability), it also happens on the client side that is making inference request from internet to Triton. The timeout issue has gone once I reduce the replica of Traefik Pod to 1.
The timeout here is referring to the connection from AWS load balancer Listener to NodePort (32266, 31647 etc) on each node. I tried to directly run curl -v localhost:32266 on both nodes and still got the same timeout result.
Note that I changed the Unhealthy Threshold on AWS Console to 10, so all the instances are healthy in the Resource Map, but actually there were lots of timeout happening there.
Because of this timeout issue, I observed that the third Triton Pod is auto scheduled but always stays in status Pending since it's not allowed to deploy to master node.
Expected behavior
I'd like to have 2 Traefik Pods running properly with Load Balancer and Triton Pods in K8S Cluster on AWS. Supposedly, there should be no timeout happening like this.
The text was updated successfully, but these errors were encountered:
Ryan-ZL-Lin
changed the title
[k8s-on-prem] timeout issue with Traefik deployment with replicas more than 1
[k8s-on-prem] timeout issue with Traefik deployment replicas more than 1
Jun 25, 2024
Ryan-ZL-Lin
changed the title
[k8s-on-prem] timeout issue with Traefik deployment replicas more than 1
[k8s-on-prem] Timeout issue with Traefik deployment replicas more than 1
Jun 25, 2024
Description
I'm trying to use Triton with K8S om-prem by following this repo
Here is my setup
With only 1 Traefik Pod, the instances in the Target Group of Load Balancer on AWS is always 1 healthy and 1 unhealthy except port 80
![image](https://private-user-images.githubusercontent.com/33056320/342553776-904c23c9-a43c-41b0-a9e1-0d6920bab232.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAxMTYyMjIsIm5iZiI6MTcyMDExNTkyMiwicGF0aCI6Ii8zMzA1NjMyMC8zNDI1NTM3NzYtOTA0YzIzYzktYTQzYy00MWIwLWE5ZTEtMGQ2OTIwYmFiMjMyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzA0VDE3NTg0MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTc2YjVlODdmMjdhMjZlZGMwMzdjOTY0MGNkYjMzM2NiNWZhY2ZjMjZjMzYwZDhkZTkwNTMwNWVkOThkYTEyNzgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.rLIpPIzKi3m0NN2BojupTFK0-nqMEUbM06w8IsxrTO8)
The healthy instance (worker1) is the node where the Traefik Pod is deployed to.
Triton Information
imageName: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
To Reproduce
Steps to reproduce the behavior.
Helm install example . Here are all the Pods.
![image](https://private-user-images.githubusercontent.com/33056320/342557348-e3991ca0-2bdd-4e85-bb97-7b9e864eff38.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAxMTYyMjIsIm5iZiI6MTcyMDExNTkyMiwicGF0aCI6Ii8zMzA1NjMyMC8zNDI1NTczNDgtZTM5OTFjYTAtMmJkZC00ZTg1LWJiOTctN2I5ZTg2NGVmZjM4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzA0VDE3NTg0MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTg1Y2ZhZGVjNzZlY2E1NWYzMjQ4NmIxYTNkMjcwZjdjNmFiYjhlZjkzNzkyMmNlZmRhMjdhODA2MGU0ZDY4OTkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.z_xB2qNfjG1QvVHHD2sjKWSLCJlY1Gjd9uxwDY4YevQ)
Change replicas to 2 in Traefik Deployment
Monitor the status changed
![image](https://private-user-images.githubusercontent.com/33056320/342554147-5947f529-ba6a-435c-a461-be27225e7fed.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAxMTYyMjIsIm5iZiI6MTcyMDExNTkyMiwicGF0aCI6Ii8zMzA1NjMyMC8zNDI1NTQxNDctNTk0N2Y1MjktYmE2YS00MzVjLWE0NjEtYmUyNzIyNWU3ZmVkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzA0VDE3NTg0MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTg2OGNkNzYwMTMyNDlkMDg4MDVlNGRkZjQ5MmFiYTJiMDk5ODk1OWEzMjU2ZThhYmY4YjU1MTA0Y2IxODhmZWEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.1FdU6hgcpJacftyZAg0Q1WAdOf9BxCF36ESyFIBOddk)
Check the connectivity on each node
Although all the targets are healthy, but the connection (for all 4 ports) becomes timeout sometimes (30 ~ 50 % probability), it also happens on the client side that is making inference request from internet to Triton. The timeout issue has gone once I reduce the replica of Traefik Pod to 1.
The timeout here is referring to the connection from AWS load balancer Listener to NodePort (32266, 31647 etc) on each node. I tried to directly run curl -v localhost:32266 on both nodes and still got the same timeout result.
Note that I changed the Unhealthy Threshold on AWS Console to 10, so all the instances are healthy in the Resource Map, but actually there were lots of timeout happening there.
![image](https://private-user-images.githubusercontent.com/33056320/342555939-310d6998-2627-4989-8a7c-6ec6a3f72e72.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAxMTYyMjIsIm5iZiI6MTcyMDExNTkyMiwicGF0aCI6Ii8zMzA1NjMyMC8zNDI1NTU5MzktMzEwZDY5OTgtMjYyNy00OTg5LThhN2MtNmVjNmEzZjcyZTcyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzA0VDE3NTg0MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWEwZDY1ZmZiOGQ2ZjExM2EwNmQ4MWNlYTc2ZmQzMzdiYmUwZWNkYmNiNTdjNjE1NWRhMjQ0NmM3MDczNjQ1ZGImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.AgBYUcQ3eQfPVvK6sI08KgyXIj1Dm_72VfDior4y8o0)
Because of this timeout issue, I observed that the third Triton Pod is auto scheduled but always stays in status Pending since it's not allowed to deploy to master node.
![image](https://private-user-images.githubusercontent.com/33056320/342556265-4fc64331-c650-4d57-bdf7-73ffd6905612.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAxMTYyMjIsIm5iZiI6MTcyMDExNTkyMiwicGF0aCI6Ii8zMzA1NjMyMC8zNDI1NTYyNjUtNGZjNjQzMzEtYzY1MC00ZDU3LWJkZjctNzNmZmQ2OTA1NjEyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzA0VDE3NTg0MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTBhNGMwYTdjMWZkYWVkYTBlMjM3NjdhMzQ2ZDFiNTUzZDVjZjBlYTViYTA5ZGIyMzk3NTI3ZTk2ZWM5N2Q1YWEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.tr7Yfysxxi3QQTvyww_u8yzwoXN-5ezfjOEmEDS_0x4)
![image](https://private-user-images.githubusercontent.com/33056320/342556297-ba5a2db7-cbb3-4885-9581-3543e6a80b60.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAxMTYyMjIsIm5iZiI6MTcyMDExNTkyMiwicGF0aCI6Ii8zMzA1NjMyMC8zNDI1NTYyOTctYmE1YTJkYjctY2JiMy00ODg1LTk1ODEtMzU0M2U2YTgwYjYwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzA0VDE3NTg0MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQzZTVkZWM0MDNhZTE3ZDM4NWJhOTEyNjg4YzljZDRiMWFkOGNkNDhhZjg5ZjUxM2QxMDVmNDU5ZDE4MTk2NmMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.dXbpDAMFR5hyMOx3IFC1pXzTk31lUK2rHHoZgT94Iw8)
![image](https://private-user-images.githubusercontent.com/33056320/342556347-32aae144-00bf-4d00-b29a-cc0f3c2bacf3.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAxMTYyMjIsIm5iZiI6MTcyMDExNTkyMiwicGF0aCI6Ii8zMzA1NjMyMC8zNDI1NTYzNDctMzJhYWUxNDQtMDBiZi00ZDAwLWIyOWEtY2MwZjNjMmJhY2YzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzA0VDE3NTg0MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTE1ZGE0YWU3NWE0NWJjYmNkZTJhNjYyMmQ3MDc4M2RiODU1N2E1ZTQ5ZmVjNzdhMGFlMTExZmQwYjY1ZGFjZDEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.I9gkoGhO4g0X2p2HPzmdNq3DiyUZuE1kIycyW1wvyGc)
Expected behavior
I'd like to have 2 Traefik Pods running properly with Load Balancer and Triton Pods in K8S Cluster on AWS. Supposedly, there should be no timeout happening like this.
The text was updated successfully, but these errors were encountered: