Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a nutanix prism client cache #425

Closed
wants to merge 2 commits into from

Conversation

thunderboltsid
Copy link
Contributor

@thunderboltsid thunderboltsid commented Apr 30, 2024

During a recent incident, it was observed that creating a new Nutanix client for each request implies basic authentication for every request. This places unnecessary stress on IAM services. This stress was particularly problematic when the IAM services were already in a degraded state, thereby prolonging recovery efforts. Each basic auth request gets processed through the entire IAM stack, significantly increasing the load and impacting performance.

It's recommend that the client use session-auth cookies instead of basic auth for requests to Prism Central where possible. Given how the CAPX controller works currently, a new client is created per reconcile cycle. In #418 we switched to using Session-Auth instead of Basic-Auth. However, switching from Basic-Auth to Session-Auth alone wouldn’t solve the problem of consistent Basic-Auth calls. This is because each time a client is created, which is every reconcile cycle, it will still result in one Basic-Auth call to /users/me to fetch the session cookie. To alleviate this, we are adding a cache of clients and reusing the client from the cache across reconciliation cycles of the same cluster for both the NutanixCluster and NutanixMachine reconciliation.

Screenshot 2024-05-03 at 15 16 31

In a large-scale setup of 40+ clusters w/ 4 nodes each, we were able to see a noticeable drop in QPS to the IAM stack for the oidc/token calls. Before the client caching, a controller restart led to 10+ QPS on oidc/token endpoint with a steady state at around 0.5 QPS. After deploying the client cache changes, we saw a peak of ~3 QPS as caches warmed up and dropped to 0 QPS afterwards with sporadic requests only when session token refresh was needed. As we can see, with the changes proposed in this document, we were able to reduce the number of high-impact calls to IAM significantly.

@nutanix-cn-prow-bot
Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@thunderboltsid thunderboltsid marked this pull request as ready for review May 2, 2024 18:39
@thunderboltsid thunderboltsid requested review from adiantum and removed request for tuxtof May 2, 2024 18:40
Copy link

codecov bot commented May 2, 2024

Codecov Report

Attention: Patch coverage is 36.20690% with 37 lines in your changes are missing coverage. Please review.

Project coverage is 8.57%. Comparing base (52f3f8c) to head (88bb02f).

Files Patch % Lines
controllers/helpers.go 47.72% 23 Missing ⚠️
controllers/nutanixcluster_controller.go 0.00% 4 Missing ⚠️
controllers/nutanixmachine_controller.go 0.00% 4 Missing ⚠️
api/v1beta1/nutanixcluster_types.go 0.00% 2 Missing ⚠️
pkg/client/cache.go 0.00% 2 Missing ⚠️
pkg/context/context.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@               Coverage Diff               @@
##           release-v1.2    #425      +/-   ##
===============================================
- Coverage          9.62%   8.57%   -1.05%     
===============================================
  Files                10      22      +12     
  Lines              1340    2017     +677     
===============================================
+ Hits                129     173      +44     
- Misses             1211    1844     +633     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@thunderboltsid thunderboltsid requested a review from dkoshkin May 2, 2024 19:10
@thunderboltsid thunderboltsid force-pushed the jira/krbn-8098-1.2 branch 5 times, most recently from 62f4c24 to 519b8f4 Compare May 2, 2024 20:31
@thunderboltsid
Copy link
Contributor Author

/retest

The cache stores a prismgoclient.V3 client instance for each NutanixCluster instance.
The cache is shared between nutanixcluster and nutanixmachine controllers.
@thunderboltsid thunderboltsid force-pushed the jira/krbn-8098-1.2 branch 3 times, most recently from 7be47a4 to f3adaa1 Compare May 3, 2024 12:32
@thunderboltsid
Copy link
Contributor Author

/retest

@thunderboltsid
Copy link
Contributor Author

/retest

This is to ensure newer versions of interfaces from SharedIndexInformers
don't cause compile failures.

Update go version to v1.22 becasue cmp.Or is only available in go v1.22

update prism-go-client
Copy link

github-actions bot commented May 3, 2024

✅ None of your dependencies violate policy!

@thunderboltsid
Copy link
Contributor Author

/retest

@thunderboltsid
Copy link
Contributor Author

Closed in favor of #430

@thunderboltsid thunderboltsid deleted the jira/krbn-8098-1.2 branch May 8, 2024 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants