v1.16 - HA master join failure - etcdserver: leader changed #1843

rrichardson · 2019-10-18T16:26:12Z

BUG REPORT

Versions

kubeadm version:

kubeadm version: &version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T19:15:39Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T19:18:23Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T19:09:08Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:
In a VirtualBox VM network. 6 VMS. 3 masters, 3 workers.
OS (e.g. from /etc/os-release):
Ubuntu 16.04
Kernel (e.g. uname -a):
Linux 192-168-123-102 4.15.0-65-generic #74~16.04.1-Ubuntu SMP Wed Sep 18 09:51:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Others:

We use an automated script which uses kubeadm to spin up the first master. It then captures the relevant details to spin up 2 additional masters "simultaneously".

What happened?

Upon attempting to bring up the 3rd of 3 HA masters using kubeadm, the kubeadm join command fails with the error below. It seems pretty explanatory. kubeadm doesn't deal well if the leader changes, and I'm guessing that the leader changes when the 2nd node joins the cluster.

We can consistently reproduce this, even if we wait a while between spinning up master #2 and master #3.

This has never occurred, to my knowledge, in version 1.14. We have spun up hundreds of clusters in 1.14.

Oct 18 16:03:17 192-168-123-102 kubeadm[6588]: [download-certs] Downloading the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
Oct 18 16:03:25 192-168-123-102 kubeadm[6588]: error execution phase control-plane-prepare/download-certs: error downloading certs: error downloading the secret: rpc error: code = Unavailable desc = etcdserver: leader changed
Oct 18 16:03:25 192-168-123-102 kubeadm[6588]: To see the stack trace of this error execute with --v=5 or higher

What you expected to happen?

I expected kubeadm join to succeed and the current node to join the HA master quorum.

How to reproduce it (as minimally and precisely as possible)?

Create a master node, collect the relevant details (token and certhash etc) then us it to start 2 additional masters, as close to simultaneously as possible.

Anything else we need to know?

You people rock. I love kubeadm.

The text was updated successfully, but these errors were encountered:

neolit123 · 2019-10-18T22:40:13Z

We use an automated script which uses kubeadm to spin up the first master. It then captures the relevant details to spin up 2 additional masters "simultaneously".

hi, we are seeing flakes when trying to join parallel CP nodes to the cluster.
this is problematic and until etcd releases a new version we won't be able to solve it correctly.

This has never occurred, to my knowledge, in version 1.14. We have spun up hundreds of clusters in 1.14.

we did claim that kubeadm has this working properly in 1.15/16, but unfortunately it does not work as expected.

i don't see how it would have worked in 1.14, as the etcd member join logic had no retries.

Oct 18 16:03:25 192-168-123-102 kubeadm[6588]: error execution phase control-plane-prepare/download-certs: error downloading certs: error downloading the secret: rpc error: code = Unavailable desc = etcdserver: leader changed

i actually haven't seen this particular error.
i'm guessing it will not happen if you add the 2 CPs serially?

You people rock. I love kubeadm.

thanks! :)

i wanted to fold this issue into:
#1793

but let's keep it open for visibility.

/kind bug
/triage support

neolit123 · 2019-10-25T22:48:17Z

/close
folding into #1793
which should hopefully solve the concurrent join problems.

k8s-ci-robot · 2019-10-25T22:48:18Z

@neolit123: Closing this issue.

In response to this:

/close
folding into #1793
which should hopefully solve the concurrent join problems.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question. labels Oct 18, 2019

neolit123 added this to the v1.17 milestone Oct 18, 2019

neolit123 added the area/etcd label Oct 18, 2019

fabriziopandini added the triage/needs-information Indicates an issue needs more information in order to work on it. label Oct 23, 2019

k8s-ci-robot closed this as completed Oct 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.16 - HA master join failure - etcdserver: leader changed #1843

v1.16 - HA master join failure - etcdserver: leader changed #1843

rrichardson commented Oct 18, 2019

neolit123 commented Oct 18, 2019

neolit123 commented Oct 25, 2019

k8s-ci-robot commented Oct 25, 2019

v1.16 - HA master join failure - etcdserver: leader changed #1843

v1.16 - HA master join failure - etcdserver: leader changed #1843

Comments

rrichardson commented Oct 18, 2019

Versions

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

neolit123 commented Oct 18, 2019

neolit123 commented Oct 25, 2019

k8s-ci-robot commented Oct 25, 2019