Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dynamic number of proxy-servers #273

Open
jkh52 opened this issue Oct 15, 2021 · 19 comments
Open

Support dynamic number of proxy-servers #273

jkh52 opened this issue Oct 15, 2021 · 19 comments
Assignees
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@jkh52
Copy link
Contributor

jkh52 commented Oct 15, 2021

Feature Request: support clusters with dynamic number of proxy-servers.

Example use case: gracefully add a 2nd control plane node to a 1 control plane node cluster.

Current state:

  • proxy-server takes static --server-count flag for the life of the process.
  • proxy-agent learns the count from a Connect RPC response, and caches the first-seen value for the life of the process.

Proxy Agent
We need proxy-agent syncOnce() to stop short circuiting as aggressively.

Option 1: add new server RPC GetServerCount(); call at the top of syncOnce()

This seems logically the cleanest, but a big downside is an additional authentication (compared with Connect).

Option 2: have syncOnce() still try at some lower rate, even when connected to the last-seen server count.

Proxy Server

We could keep the --server-count flag but add a mutually exclusive --server-count-file to support a dynamic config value (avoid restarting the process).

Any other suggestions or example patterns for this side?

Expected Behavior

Add one master node:

  • proxy-agent will notice and establish connections within a predictable / deterministic time.
  • tunnels through existing proxy-server(s) should not be interrupted.

Subtract one master node:

  • we cannot avoid disrupting some tunnels in this case.
  • proxy-server process shutdown is the mechanism that will result in proxy-agent cleaning up old connections (keepalive).
  • proxy-agent may briefly see lower server count value than actual observed unique server_ids. This shouldn't cause it to panic or trigger any cleanup.

STATUS UPDATE (May 2023):

Proxy Agent has implemented Option 2; but at #358 there is some discussion in favor of Option 1.

Proxy Server does not yet support dynamic count; there is not yet a design consensus. In a recent community meeting there was some discussion of using kubernetes/enhancements#1965, but it was pointed out that kube-apiserver is not necessarily 1:1 with konnectivity-server. However, a similar implementation could be used (summary: introduce konnectivity-server leases with TTL; a given server can then count the un-expired leases to get a current server count).

@jkh52
Copy link
Contributor Author

jkh52 commented Oct 15, 2021

Soliciting feedback.

/assign @cheftako
/assign @caesarxuchao

@jtherin
Copy link

jtherin commented Nov 2, 2021

but add a mutually exclusive --server-count-file to support a dynamic config value (avoid restarting the process).

Why not a ConfigMap ?

jkh52 added a commit to jkh52/apiserver-network-proxy that referenced this issue Nov 5, 2021
This is a minimal proxy-agent half of
kubernetes-sigs#273

It is already useful, in the scenario where proxy-server are restarted.
@jkh52
Copy link
Contributor Author

jkh52 commented Nov 5, 2021

Why not a ConfigMap ?

Assuming you mean: proxy-server use a ConfigMap (set by a cluster admin or bootstrapping infra) to learn accurate ServerCount. This is unfortunate / inconsistent because then the value is potentially visible to the cluster (even the proxy-agent) and we already have Connect() RPC protocol for the agent to learn server count.

jkh52 added a commit to jkh52/apiserver-network-proxy that referenced this issue Nov 12, 2021
This is a minimal proxy-agent half of
kubernetes-sigs#273

It is already useful, in the scenario where proxy-server are restarted.
jkh52 added a commit to jkh52/apiserver-network-proxy that referenced this issue Nov 12, 2021
This is a minimal proxy-agent half of
kubernetes-sigs#273

It is already useful, in the scenario where proxy-server are restarted.
jkh52 added a commit to jkh52/apiserver-network-proxy that referenced this issue Nov 12, 2021
This is a minimal proxy-agent half of
kubernetes-sigs#273

It is already useful, in the scenario where proxy-server are restarted.
jkh52 added a commit to jkh52/apiserver-network-proxy that referenced this issue Nov 22, 2021
This is a minimal proxy-agent half of
kubernetes-sigs#273

It is already useful, in the scenario where proxy-server are restarted.
jkh52 added a commit to jkh52/apiserver-network-proxy that referenced this issue Nov 22, 2021
This is a minimal proxy-agent half of
kubernetes-sigs#273

It is already useful, in the scenario where proxy-server are restarted.
jkh52 added a commit to jkh52/apiserver-network-proxy that referenced this issue Nov 23, 2021
This is a minimal proxy-agent half of
kubernetes-sigs#273

It is already useful, in the scenario where proxy-server are restarted.
@zqzten
Copy link
Member

zqzten commented Jan 11, 2022

+1 for this feature

For server side, I wonder if we can find a way for servers to communicate with each other to dynamically get the replica count rather than depending on explicit external config. This would be super helpful for HPA.

@janiskemper
Copy link

+1 - would be very useful for clusters managed by Cluster API

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 6, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 5, 2022
@zqzten
Copy link
Member

zqzten commented Aug 25, 2022

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 25, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 23, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 22, 2023
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jkh52
Copy link
Contributor Author

jkh52 commented Apr 3, 2023

/lifecycle frozen

@jkh52 jkh52 reopened this Apr 3, 2023
@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Apr 3, 2023
@jkh52
Copy link
Contributor Author

jkh52 commented May 26, 2023

The current status:

Proxy Agent

Has implemented Option 2. As noted at #358 there are associated server log errors and some opinions preferring Option 1 instead. (I'm open to revisiting; if we do, my main concerns are: A. backward compatibility for {new agents, old server}, and B. permissions (would GetServerCount also require agent token review, or widely visible?).

Proxy Server

Does not yet support dynamic count; there is no clear best approach.

In a recent community meeting there was some discussion of using kubernetes/enhancements#1965, but it was pointed out that kube-apiserver is not necessarily 1:1 with konnectivity-server. However, a similar implementation could be used (introduce konnectivity-server leases with TTL).

@jkh52
Copy link
Contributor Author

jkh52 commented Apr 17, 2024

This was discussed in cloud-provider + apiserver-network-proxy OSS sync this morning. See meeting notes For April 17, 2024.

In particular:

  • it would be best to write a design doc (including alternatives) for the remaining work
  • in the "lease" approach, in case RBAC or similar prevents ability for konnectivity-server to read/write leases, it may be prudent to have an additional "minimum server count" flag.

@carreter
Copy link
Contributor

carreter commented Jun 5, 2024

Hey all! I'm interning at Google this summer under @avrittrohwer. This issue will be my main project.

I'll be drawing up a design doc over the next couple of days! Are there any major considerations other than the ones mentioned so far here and in #358?

@carreter
Copy link
Contributor

Design doc is ready! Here's the Google Doc.

The general idea is to have each proxy server publish a lease to the k8s apiserver and count the number of valid leases to determine the current server count, which it will then return to the agent via the gRPC Connect() call. A future iteration will have the agent directly read the leases from the apiserver and determine the count that way.

Feel free to drop any comments or suggestions you have on the doc!

@carreter
Copy link
Contributor

At the most recent KNP meeting (7/10/24), @cheftako brought up that it will likely be necessary for us to roll out a way for KNP servers to manage their own leases regardless of whether we shorten the apiserver lease duration.

Would anyone be able to help with this?

@carreter
Copy link
Contributor

Fixed by #643 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

9 participants