-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Antipattern in status checks #186
Comments
There are two pieces of information:
If the cluster is in quorum:
If quorum is absent, but the cluster is known to have been bootstrapped (
Paths to recovery from majority failure are outside the scope of this issue and can be considered in a future release. |
If there's no configmap or it indicates that the cluster is new, this is normal create mode. The controller should wait for all pods to start running, quorum shall eventually be acquired. As before, timeouts/deadlines should apply. A failure mode here is that it is possible that a cluster actually existed along with persistent volumes on some members. These members will ignore the |
I would suggest that the operator automatically saves the cluster state in the resource spec, for example into options map: spec.options["initial-cluster-state"] = "existing" Since options is a map, this should not pose a problem for the user. It's quite normal to have multiple controllers working with one type of resource in Kubernetes and not interfere with each other due to three-way merge. Such a resource can be backed up and restored at any time without fear of losing its status. Accordingly, on the server side, this flag should be immutable, meaning it can be assigned but not changed, similar to the |
Majority failure scenariosI've run some tests involving a majority failure wtih data loss. Here are my observations. Etcd launch parameters are:
Scenario 1: Surviving minority raft term greater than new cluster raft termSteps to reproduce
Observations
Scenario 2: Minority raft term is behind new majoritySteps to reproduce
Observations
Before starting etcd-0.
After starting etcd-0:
This happens regardless of the raft term index. I conjecture that in most cases, this behavior is sufficient for the cluster to recover from a majority failure. If the cluster has existed for some time, its raft term can be expected to be quite large, while the failed-and-then-recovered majority that suffered data loss will have a raft term around 0 or 1. The successful first scenario did not require setting |
More majority failure scenariosScenarios 1 & 2 all start with This leads to consistent and repeatable member and cluster IDs and this is why etcd-1/2 happily merge back with etcd-0. Since the etcd cluster can potentially be created in one configuration, then change over time, as members are added and removed, the initial cluster parameters might not match those at bootstrap. Scenario 3: majority failure with inconsistent cluster/member IDsSteps to reproduce
Observations
Scenario 4: as before, but with
|
WIP: Reconciliation scenariosIt looks like recovery from majority failure is possible, but is probably hard enough to implement that other lower-hanging fruit would be more productive. Instead I will consider some simpler reconciliation scenarios, which occur during normal operations, such as creating and updating a healthy cluster. The scenarios, however, will not be "cluster creation" or "cluster scaling". The operator has no way of knowing, which kind of event triggered a reconciliation. This is a first shot at working out the reconciliation loop. It is superceded by the flowchart linked below.Step 1: an
|
TL;DR
This issue is acknowledged and work is in progress to resolve it:
Please read further to understand the background reasoning.
Summary
The status field of the
EtcdCluster
is self-referential and is not determined by the status of an actual etcd cluster.Background
Kubernetes controlelrs are level based. This generally means, that a controller should be able to determine its status from the state of the surrounding environment. It is generally considered a bad idea to read the status of the root object (in this case, the
EtcdCluster
CR).Controllers also do not distinguish between types of events, which trigger reconciliation (was a resource created, updated, deleted, etc). Trying to fight this restriction may cause complications further down the line. For example, implementing statuses, which describe the lifecycle of a resource and then relying on them to select a mode of operation of the controller can lead to new failure modes which need to be handled in different ways, growing the complexity of the codebase.
Issue
In its current state, the controller relies on an
EtcdCluster
's status field to determine the next steps in the reconciliation logic. If the controller reads an incorrect status without actually validating it, the etcd cluster will fail to launch. A contrived example can produce this issue as follows:First apply the following yaml
This object will generate a statefulset which will launch and pass all readiness checks, but will not in fact build a working cluster, instead three containers will simply run
sleep 3600
. The cluster-state configmap will set the value ofETCD_INITIAL_CLUSTER_STATE
toexisting
.Next, attempt to fix the problem, by applying a correct manifest
The statefulset's podspec will be updated to correct values to launch an etcd cluster, but since the initial cluster state is erroneously set to "existing", a cluster will not be bootstrapped.
Analysis
In my opinion, the root cause of this problem is that no attempt is made to verify the "actual" status of the etcd cluster. Although at the moment the logic of the controller is quite simple and one must craft rather contrived bad inputs to break it, the examples do show that a failed transaction here or there, such as a dropped packet, a crashed node, etc, can leave an
EtcdCluster
(the custom resource) in an unrecoverable state.Since implementing communication with the etcd cluster will be necessary in any case, for instance, to implement scaling up and down, I think it is prudent to start work on such a feature now and use it to determine the status of the cluster.
The text was updated successfully, but these errors were encountered: