-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent CP join race condition for etcd join #2005
Comments
It's worth noting that the following is speculation:
It's also worth noting that we're not sure what the contract is with etcd for the names of the active peers returned from the MemberAdd call.
Finally, it's worth noting that we do not yet have verbose logging for the issue, and that the logging in kubeadm currently will not allow us to directly confirm this hypothesis as we never log the actual values returned by etcd at any level. |
thanks for logging the issue. |
/assign @echu23 thanks for the detailed analysis I see a possible fix for this by changing the loop on resp.Members:
With this change we are going to discard any member with empty name and PeerUrl different than the peer URL of the joining node (192.168.0.2:2380 in your example). If the resulting list has two members instead of three it should not be a problem for the new etcd node, because it will be informed about the third member by the raft protocol after joining. If instead the resulting list has one member this will be a problem, but according to your analysis this can't happen (and it seems reasonable to me because join starts only the first control plane/etcd is up and running) @echu23 @neolit123 opinions? If we agree on the solution I can send a patch a so we can check that this does not introduce regression, but for the real parallel join test we should rely on @echu23 infrastructure |
can also be:
from my understanding @fabriziopandini proposal can fix this problem. after this change if the code is still flaky for other reasons, we have the option to fully serialize the etcd member add with a lock managed by kubeadm. the alternative is to just wait until we move to etcd 3.5 and see how stable the learners are. |
but something to note about etcd learners is that they are a currently a maximum of 1 and it is not clear when multiple learners will be supported: |
etcdadm is managing this slightly differently: post "add", it obtains the current member and other members from the response. |
Just another quick question to confirm the intention of the code, so from this
Does it expect that I am asking because I am trying to reproduce this issue using kind, and I was not able to reproduce it. My reproduction flow using kind is this:
So 2 questions here:
|
as you pointed out with the discoveries in the OP, it seems like more than one members without name can occur in resp.Members.
might be easier with kinder: build kinder using go > 1.13
and call:
this will give you a set of commands.
you seem to have a setup to envy. |
@neolit123 @echu23 if I got this thread right we agree on the proposed change? Can I send the patch (with the @neolit123 variant)? |
So the proposed change is to discard the member of the member has no Sent with GitHawk |
the way etcdadm does the matching with member IDs, is overall better: but i don't see when matching
yes.
if the 2nd and 3rd members join at the same time, with the proposed patch the initial cluster on the second CP will end up only with the first and 2nd member. but because this is racy, i think the same can happen for the so the etcd boostrap can look like this:
|
Ok so this
Will form a health cluster, right? I’m totally fine with this. Sent with GitHawk |
i think it has side effects. i can try simulating this tomorrow. if the initial cluster does not matter, we can always do this in kubeadm:
yet something is telling me it's not right. also i would like to bring something from the etcd docs:
i.e. concurrent join is by design not supported... |
Yeah, we are aware of that concurrent etcd join is Sent with GitHawk |
as the docs say:
so if the 2nd fails the 3rd should not be added.
i think it's valid. will double check this tomorrow: |
i experimented and observed the following: joined sequentially two kubeadm CP nodes to an existing CP node, with a modified version of kubeadm. the modified version simulated the case of two members not having names:
it makes it so that CP3 only includes the 1st and 3rd member.
so it has to include all members:
i tried investigating what is considered a valid
thus i ended up with the following change:
this ends doing the following:
the list of members still respects the
so this is my proposal to workaround the problem, but it introduces a minor change in the way kubeadm writes the etcd.yaml. also i've only tested this with a total of 5 etcd members. 4 joining concurrently. however on my setup i'm seeing other problems, which happen quite often:
|
the following patch makes the 4 member concurrent join more reliable: diff --git a/cmd/kubeadm/app/util/etcd/etcd.go b/cmd/kubeadm/app/util/etcd/etcd.go
index 9d3c6be046b..0e5ad4434e8 100644
--- a/cmd/kubeadm/app/util/etcd/etcd.go
+++ b/cmd/kubeadm/app/util/etcd/etcd.go
@@ -38,11 +38,11 @@ import (
"k8s.io/kubernetes/cmd/kubeadm/app/util/config"
)
-const etcdTimeout = 2 * time.Second
+const etcdTimeout = 20 * time.Second
// Exponential backoff for etcd operations
var etcdBackoff = wait.Backoff{
- Steps: 9,
+ Steps: 16,
Duration: 50 * time.Millisecond,
Factor: 2.0,
Jitter: 0.1,
@@ -130,7 +130,7 @@ func NewFromCluster(client clientset.Interface, certificatesDir string) (*Client
// dialTimeout is the timeout for failing to establish a connection.
// It is set to 20 seconds as times shorter than that will cause TLS connections to fail
// on heavily loaded arm64 CPUs (issue #64649)
-const dialTimeout = 20 * time.Second
+const dialTimeout = 40 * time.Second
// Sync synchronizes client's endpoints with the known endpoints from the etcd membership.
func (c *Client) Sync() error {
@@ -303,12 +303,11 @@ func (c *Client) AddMember(name string, peerAddrs string) ([]Member, error) {
// Returns the updated list of etcd members
ret := []Member{}
for _, m := range resp.Members {
- // fixes the entry for the joining member (that doesn't have a name set in the initialCluster returned by etcd)
- if m.Name == "" {
- ret = append(ret, Member{Name: name, PeerURL: m.PeerURLs[0]})
- } else {
- ret = append(ret, Member{Name: m.Name, PeerURL: m.PeerURLs[0]})
+ if peerAddrs == m.PeerURLs[0] {
+ ret = append(ret, Member{Name: name, PeerURL: peerAddrs})
+ continue
}
+ ret = append(ret, Member{Name: strconv.FormatUint(m.ID, 16), PeerURL: m.PeerURLs[0]})
}
// Add the new member client address to the list of endpoints |
Yes, this proposal was what we had thought of as well that there only need to be 3 unique Name. We are ok with this change if no regression is introduced. Thanks for the quick turnaround. Sent with GitHawk |
@neolit123 I'm ok as well, but If possible I would apply the random name only if the name is empty (so the created manifest remain unchanged for all the cases except the race condition discussed in this thread) |
@fabriziopandini it's unfortunate that given the nature of etcd bootstrap our etcd.yaml is overall non-deterministic WRT the |
/lifecycle active PR is here kubernetes/kubernetes#87505 |
This issue has been reported to upstream kubeadm: kubernetes/kubeadm#2005 And this patch implements the suggested workaround. As mentioned in PR2487240, this issue happens very rarely in a very small gap where the 3rd member is performing etcd join while the 2nd member had joined but not yet fully started. The symptom is that only the 3rd member's etcd container is not coming up and complaining member count is unequal The root cause is that when kubeadm is performing etcd member join, it calls etcd client's MemberAdd function with the local peerURL and this function returns a list of etcd Member. kubeadm expects that there should be 1 and ONLY 1 Member in this list that does not have Name field, which is the currently-being-added member. And then it will insert the provided local name to the Name field of those Members that don't have Name. However, this issue happens because somehow there are more than 1 Member that do not have Name. In this case, kubeadm code incorrectly inserted local name to all those Members that don't have Name. Therefore, the resulting etcd manifest contains the value of --initial-cluster with duplicate Names, in a 3-node etcd cluster, the incorrect etcd manifest will have 2 unique Name and 3 unique peerURLs. And when the local etcd starts, it looks at the --initial-cluster and thinks that there are only 2 members, due to the duplicate Names. But actually there are 3 members, hence the member count is unequal issue. The fix here is to assign the etcd ID to Name if the member has no Name and peerURL does not match.
What keywords did you search in kubeadm issues before filing this one?
etcd join race
I did find #1319 #2001 #1793
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
kubeadm version (use
kubeadm version
):kubeadm version: &version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:18:19Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Environment:
kubectl version
):Client Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:19:15Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:16:41Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
NAME="VMware Photon OS"
VERSION="3.0"
ID=photon
VERSION_ID=3.0
PRETTY_NAME="VMware Photon OS/Linux"
uname -a
):Linux 42332700031c0a2cfd7ef78ebc8af387 4.19.84-1.ph3-esx kubeadm join on slave node fails preflight checks #1-photon SMP Tue Nov 19 00:39:50 UTC 2019 x86_64 GNU/Linux
This should happen in any env if our analysis is valid.
What happened?
For setting up a 3 (or multi)-node cluster, we understand that the etcd learner is coming but we can't refactor our code for now to adopt that. And we really need to do this concurrently. So wondering if any workaround/suggestion can be offered.
We use kubeadm to boostrap a 3-CP cluster, however, we need to inject some customization along the way, so we are calling the join phase one by one instead of simply
kubeadm join
.This issue only happens only when adding the 3rd member.
When we call this command on the 3rd node
kubeadm join phase control-plane-join etcd
Very rarely we observed that the generated etcd manifest (/etc/kubernetes/manifest/etcd.yaml) has incorrect
--initial-cluster
value.Assuming etcd-0 is the first member, etcd-1 is the second and etcd-2 is the third. A correct
--initial-cluster
value for etcd-2 might look like this--initial-cluster=etcd-0=https://192.168.0.1:2380,etcd-1=https://192.168.0.2:2380,etcd-2=https://192.168.0.3:2380
However, in this rare case, we are getting something like this
--initial-cluster=etcd-0=https://192.168.0.1:2380,etcd-2=https://192.168.0.2:2380,etcd-2=https://192.168.0.3:2380
Basically the name of etcd-1 was incorrectly configured as etcd-2, this incorrect manifest results in etcd container failed to start and complain:
etcdmain: error validating peerURLs {"ClusterID":"31c63dd3d7c3da6a","Members":[{"ID":"1b98ed58f9be3e7d","RaftAttributes":{"PeerURLs":["https://192.168.0.2:2380"]},"Attributes":{"Name":"etcd-1","ClientURLs":["https://192.168.0.2:2379"]}},{"ID":"6d631ff1c84da117","RaftAttributes":{"PeerURLs":["https://192.168.0.3:2380"]},"Attributes":{"Name":"","ClientURLs":[]}},{"ID":"f0c11b3401371571","RaftAttributes":{"PeerURLs":["https://192.168.0.1:2380"]},"Attributes":{"Name":"etcd-0","ClientURLs":["https://192.168.0.1:2379"]}}],"RemovedMemberIDs":[]}: member count is unequal\n","stream":"stderr","time":"2020-01-08T17:27:52.63704563Z"}
We think this error message was because the manifest
--initial-cluster
has only 2 unique names there while the etcd cluster actually has 3 members.We spent some time tracking the code to see what could be the issue and we have a theory here.
Calling
kubeadm join phase control-plane-join etcd
Then the above command calls this
It then calls etcdClient.AddMember()
func (c *Client) AddMember(name string, peerAddrs string)
here name is the current master's Name and peerAddrs is the current master's peerURL.Then in
L290: resp, err = cli.MemberAdd(ctx, []string{peerAddrs})
it calls the real MemberAdd which will return a
[]Member
that includes the currently-being-added one.So the response of this
MemberAdd()
will have all previous members and current member.Once AddMember() receives the response
Here resp is the response from MemberAdd() as described above. And this section is to insert the given Name for the members that do not have
Name
. We think that it is expected only the currently-being-added member that should be the only one that does not haveName
, it loops theresp.Members
, find the member that does not have aName
and setname
as the memberName
.But if the resp.Members, which in this case returned 3 members (because this happens in the 3rd member), there were somehow 2 Members that do not have "Name" because the 2nd member had just joined the etcd cluster, but the etcd container of 2nd is still coming up, in this case, "etcdctl member list" would return something like
cat ../../../../commands/etcdctl_member-list.txt
1b98ed58f9be3e7d, started, etcd-0, https://20.20.0.37:2380, https://192.168.0.1:2379
6d631ff1c84da117, unstarted, , https://192.168.0.2:2380, <-- this is etcd-1, but not started yet so no
Name
In this case, there are 2 out of 3 Members that do not have
Name
, hence the above for loop inserted the 3rd Name(etcd-2) to both 2nd and 3rd Member.We concluded that this issue only happens if the 3rd member that is running
MemberAdd
during which the 2nd Member is not yet started which is considered racy.For this ticket, we want to understand that:
What you expected to happen?
The generated etcd manifest should have correct
--initial-cluster
value.How to reproduce it (as minimally and precisely as possible)?
Note that this happens really rarely, the frequency is probably 1/1000
Anything else we need to know?
The text was updated successfully, but these errors were encountered: