Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: use learner mode for joining etcd members #1793

Open
fabriziopandini opened this issue Sep 17, 2019 · 62 comments
Open

RFE: use learner mode for joining etcd members #1793

fabriziopandini opened this issue Sep 17, 2019 · 62 comments
Assignees
Labels
area/etcd area/HA kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@fabriziopandini
Copy link
Member

fabriziopandini commented Sep 17, 2019

Growing a local etcd cluster is a complex operation, and in the past, we already faced some issues like e.g. kubernetes-sigs/kind#588

Now that the implementation of the etcd learner mode is progressing, we should start considering if to use it in kubeadm in order to make join --control-plane implementation more robust.

at a high level what we would like to achieve is:

  • a new etcd member should be created as a learner and became a voting member only after the etcd data are fully aligned.
    ideally
  • we should also prevent the api-server to read from a learner node

Ref docs:


(edit by neolit123)

1.26:

1.27(alpha):

1.29(beta):

1.32(GA):

1.33:

  • TODO: remove the FG from kubeadm code
  • TODO: update the k/website "kubeadm init" page
@fabriziopandini fabriziopandini added area/HA priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. kind/feature Categorizes issue or PR as related to a new feature. area/etcd labels Sep 17, 2019
@fabriziopandini fabriziopandini added this to the v1.17 milestone Sep 17, 2019
@SataQiu
Copy link
Member

SataQiu commented Sep 18, 2019

/cc

@rosti
Copy link

rosti commented Sep 18, 2019

We have to be careful, but we certainly need to act upon it. The plan is that from etcd 3.5 new members will be joined only as learners.
In etcd 3.5 it will be possible to use a learner node for reading, but still the problem with writing continues. And, as LBs are out of the scope of kubeadm, things might become a bit difficult.
We probably need to direct API servers to healthy leaders and possibly do that via an etcd LB. Another possibility is to not expose the API servers, that have a local learner etcd node from the API LB (not sure if this would actually work though).
In short, we need to experiment a bit with this to find what's viable and easy for use.

@RA489
Copy link
Contributor

RA489 commented Oct 10, 2019

/assign

@neolit123
Copy link
Member

neolit123 commented Jan 18, 2020

@prksu
Copy link

prksu commented Jan 31, 2020

/cc

@ereslibre
Copy link
Contributor

Some context on this: etcd-io/etcd#11640, we might want to wait for an etcd version that includes this patch.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2020
@pacoxu pacoxu modified the milestones: Next, v1.29 Aug 29, 2023
@neolit123
Copy link
Member

great feedback. thank you @tobiasgiese

maybe you could provide some feedback, I think you folks still use it right?

We (Mercedes-Benz) are using it already, yes. Also we have backported it to v1.2[4-6] (since kubernetes/kubernetes#115038) and it is working quite well. We have never had any problems with the learner mode and we have alot of nightly builds (about 50 periodic nightly builds and 40 Prow trigger builds/jobs).

@pacoxu pacoxu self-assigned this Aug 31, 2023
@neolit123 neolit123 modified the milestones: v1.29, v1.30 Nov 1, 2023
@pacoxu pacoxu modified the milestones: v1.30, Next Dec 25, 2023
@pacoxu
Copy link
Member

pacoxu commented Dec 25, 2023

I suppose that we should graduate this feature later in v1.31+ and get more feedback before GA.

So no action item for v1.30.

@neolit123
Copy link
Member

I suppose that we should graduate this feature later in v1.31+ and get more feedback before GA.

So no action item for v1.30.

i got contacted in slack by a person that had feedback about learner mode in kubeadm, but they never send me the info.
learner mode was broken for them in some way.

i will see if i can message them about this after NY.

@pacoxu
Copy link
Member

pacoxu commented Dec 25, 2023

One issue that I may imagine is timeout for a step of promotion ready waiting or promotion may be a problem.

@pacoxu
Copy link
Member

pacoxu commented Jan 29, 2024

#2997 (comment)
We have a short discussion about if we need to add progress percentage of synced in logging.

  • It needs change in etcd side.

#2997 (comment)
Another potentiel improvement is adding a configurable timeout for etcd learner ready for promoting. There are already a lot of timeout configuration in v1beta4 timeouts structs. (+0 for this as 2 min should be enough for most scanerios.)

@neolit123
Copy link
Member

We have a short discussion about if we need to add progress percentage of synced in logging.

ok, i don't think it's GA blocking.

Another potentiel improvement is adding a configurable timeout for etcd learner ready for promoting. There are already a lot of timeout configuration in v1beta4 timeouts structs. (+0 for this as 2 min should be enough for most scanerios.)

+0 as well from me.
our 2 minutes timeout will apply to all etcd client calls by default.

@neolit123
Copy link
Member

i got contacted in slack by a person that had feedback about learner mode in kubeadm, but they never send me the info.
learner mode was broken for them in some way.

they did not log an issue...

@pacoxu
Copy link
Member

pacoxu commented Jan 29, 2024

I updated beta related PRs in this issue description.

I think we may wait for at least another 1 or 2 release cycles for feedbacks to make this GA, as most users are not using v1.29 yet, which make it beta, by default enabled.

@pacoxu

This comment was marked as abuse.

@neolit123
Copy link
Member

40s timeout for waiting to be ready to promote a learner.

should we increase this time to 2 minutes, or more by default?

@pacoxu
Copy link
Member

pacoxu commented Apr 10, 2024

https://github.com/kubernetes/kubernetes/blob/227c2e7c2b2c05a9c8b2885460e28e4da25cf558/cmd/kubeadm/app/util/etcd/etcd.go#L531-L557

already 2m.

I miss the log that The learner was promoted as a voting member success finally. Sorry for disturb.

@neolit123
Copy link
Member

the flakes on https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-upgrade-addons-before-controlplane-1-29-latest
seem like slow infra problems, 5 minutes should be plenty of time for a few nodes to join and be ready :/

@neolit123
Copy link
Member

I think we may wait for at least another 1 or 2 release cycles for feedbacks to make this GA, as most users are not using v1.29 yet, which make it beta, by default enabled.

@pacoxu should we GA this in 1.32?

@pacoxu
Copy link
Member

pacoxu commented Jun 26, 2024

Agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd area/HA kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests