Latency Issue when using AWS CLI V2 inside my EKS pods #8660
Unanswered
Mohamed-Sharif
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
We are currently encountering a significant latency issue as we transition from an old EKS cluster (k8s 1.24) to a newer version (k8s 1.29). Despite thorough internal investigations and profiling, we have been unable to pinpoint the exact cause of this latency, which seems to involve AWS SDK for PHP and CLI behaviors in our new environment.
To better illustrate the problem and the steps we've taken to analyze it, we've detailed our process below:
Initial Discovery:
Issue Identified: Increased latency when making calls to the AWS SSM agent using AWS SDK for PHP (version 3.173.19).
Latency Metrics: Response times increased from 0.05 seconds in the old cluster to 1.1 seconds in the new cluster.
Connectivity Tests:
We suspected the internenect connnection but the Internet Connection Speed Test: showed that the new cluster has better connectivity (2779 Mbit/s) compared to the old one (2100 Mbit/s), suggesting that internet speed is not the root cause of the issue.
AWS CLI:
To better debug this, we tried using AWS CLI instead of the SDK to test whether this is an AWS APIs or AWS SDK issue. We used this command time aws sts get-caller-identity.
Here is what we have found:
a- when using AWS CLI V1 (1.11.13), we got faster responses in the new cluster compared to the old one (.5 seconds vs .8 seconds).
b- when using AWS CLI V2(2.15.42), we had a huge higher latency in the new cluster compared to the old one (2.8 seconds vs 1.1 seconds).
We used the --debug option with the above command: time aws --debug sts get-caller-identity to debug what happens with the AWS CLI V2. When using that we found out that this latency comes from the connection to the IMDS which, the first one was to get the region and the second one was to get the IAM role attached to the EC2 instance. The problem with the new cluster is that with every connection to the IMDS, there are two trials to initiate the HTTP connection, the first attempt consistently fails after exactly one second, followed by a successful second attempt. This pattern was distinct in the new cluster as shown in the attached pics
Also, specifying the --region parameter improved the response time from 2.8 to1.8 seconds.
What we really need to know is:
1- Why the pods in the new cluster try initiating two HTTP connection attempts for each IMDS call when using AWS CLI V2 comapared to the old one, which initiates just one HTTP connection for each call.
2- Why this behaviour is not present at all with AWS CLI V1.
3- Does AWS SDK for PHP try following the same path as the AWS CLI V2 or V1, or if those behaviours are not comparable at all?
Beta Was this translation helpful? Give feedback.
All reactions