Sabakan is a network boot service. It has a registry of machines in a data center and keeps status information of machines. It is convenient if CKE can reference sabakan registry and generates Kubernetes cluster configuration by itself.
In fact, CKE can be integrated with sabakan to achieve this.
When enabled, CKE periodically query sabakan to retrieve available machines in a data center and generates cluster configuration from a user-supplied template.
Users can specify variables for the query to choose machines. The query
will be executed by sabakan GraphQL searchMachines
API.
Labels and other attributes of sabakan Machine
will be
translated into Kubernetes Node labels.
To keep Kubernetes and etcd cluster stable, CKE deliberately changes the cluster configuration. Details are described in the following sections.
Sabakan integration generates the cluster definition from a user-defined template. The template syntax is the same as cluster.yml.
The difference is that the template must have at least one control plane node and one non-control plane node.
A minimal template looks like:
name: cluster
nodes:
- user: cybozu
control_plane: true
- user: cybozu
control_plane: false
service_subnet: 10.68.0.0/16
When the configuration template is updated, CKE will soon regenerate the cluster configuration from the new template.
In general, servers in data centers can be classified into several types. For example, compute servers are typically used to run VM or container workloads while storage servers are used to store large persistent data.
Servers defined in sabakan have a mandatory attribute "role" for this classification.
To support a Kubernetes cluster consisting of multiple roles such as "compute", "storage", and "gpu", sabakan integration of CKE allows users to specify node templates for each server role and its weight (ratio) in the cluster.
The node role is specified with cke.cybozu.com/role
label value.
The weight (ratio) in the cluster is specified with cke.cybozu.com/weight
label value.
An example to specify the ratio between compute/storage/gpu to 6:3:1 looks like:
name: cluster
nodes:
- user: cybozu
control_plane: true
labels:
# Use compute servers for control plane nodes
cke.cybozu.com/role: "compute"
- user: cybozu
labels:
cke.cybozu.com/role: "compute"
cke.cybozu.com/weight: "6.0"
- user: cybozu
labels:
cke.cybozu.com/role: "storage"
cke.cybozu.com/weight: "3.0"
taints:
- key: cke.cybozu.com/role
value: storage
effect: NoExecute
- user: cybozu
labels:
cke.cybozu.com/role: "gpu"
cke.cybozu.com/weight: "1.0"
taints:
- key: cke.cybozu.com/role
value: gpu
effect: PreferNoSchedule
service_subnet: 10.68.0.0/16
If a node template lacks cke.cybozu.com/role
label, any servers can match it.
If there are more than two templates for non-control plane nodes, they must have
cke.cybozu.com/role
label.
If a non-control plane node template lacks cke.cybozu.com/weight
label, its weight becomes "1.0".
CKE uses the following GraphQL query to retrieve machine information from sabakan:
query ckeSearch($having: MachineParams = null,
$notHaving: MachineParams = {
roles: ["boot"]
states: [RETIRED]
}) {
searchMachines(having: $having, notHaving: $notHaving) {
# snip
}
}
Users may specify $having
and $notHaving
to change the search conditions.
They can be specified by a JSON object like this:
{
"having": {
"labels": [{"name": "foo", "value": "bar"}],
"roles": ["worker", "gpu"]
},
"notHaving": {
"states": ["UNINITIALIZED", "RETIRED"]
}
}
$having
and $notHaving
are MachineParams
. Consult GraphQL schema
for the definition of MachineParams
.
CKE generates cluster configuration with the following conditions.
Musts:
- User-specified constraints must be satisfied.
- etcd cluster must not be broken.
- Kubernetes cluster must not be broken.
Shoulds:
- If the template node for control plane has
cke.cybozu.com/role
label, servers of control plane nodes should be the specified role. - The number of servers for each role should be proportional to the given weights in the template.
- The servers which have the same role should be distributed evenly over the racks.
- Newer machines should be preferred than old ones.
- Healthy machines should be preferred than non-healthy ones.
- Unreachable machines in the cluster should be tainted with
NoSchedule
. - Retiring and retired machines should be tainted with
NoExecute
. - Retired machines should be removed if the machines are kept retired for a while.
- Rebooting machines should not be removed from the cluster nor be tainted.
- Each change of the cluster configuration should be made as small as possible.
- Control plane nodes should be distributed across different racks.
- All control plane nodes should be healthy.
- All control plane nodes should not be tainted. The following taints are tolerated:
- Transitional taints added by the Kubernetes system such as
node.kubernetes.io/not-ready
- Transitional taints added by CKE, i.e.
cke.cybozu.com/state
- Taints for control plane nodes added by CKE, i.e.
cke.cybozu.com/master
- User-tolerated taints specified in the cluster template
- Transitional taints added by the Kubernetes system such as
To understand the status and lifecycle of a machine, see sabakan lifecycle.
When a new node needs to be added to the cluster configuration, the algorithm selects a machine as follows:
- Deselect non-healthy machines.
- Deselect machines used in the current cluster configuration.
- Select machines of preferred roles.
- If the new node will be a control plane and a role is specified, select servers of the same role.
- Otherwise, choose a node template with the smallest number of servers than the specified weight, and use the role specified for the template.
- Add the following score to each machine:
- (100 - (machine counts which have the same role and in the same rack)) * 10
- Add the following scores to each machine:
- If the machine's lifetime is > 250 days, +1.
- If the machine's lifetime is > 500 days, +1 (+2 in total).
- If the machine's lifetime is > 1000 days, +1 (+3 in total).
- If the machine's lifetime is < -250 days, -1.
- If the machine's lifetime is < -500 days, -1 (-2 in total).
- If the machine's lifetime is < -1000 days, -1 (-3 in total).
- Select the highest scored machine.
When an existing node need to be removed from the cluster configuration, the algorithm select one as follows:
- Add the following score to each machine:
- If the machine's state is healthy, +1000.
- Add the following score to each machine:
- (100 - (machine counts which have the same role and in the same rack)) * 10
- Add scores to each machine as follows:
- If the machine's lifetime is > 250 days, +1.
- If the machine's lifetime is > 500 days, +1 (+2 in total).
- If the machine's lifetime is > 1000 days, +1 (+3 in total).
- If the machine's lifetime is < -250 days, -1.
- If the machine's lifetime is < -500 days, -1 (-2 in total).
- If the machine's lifetime is < -1000 days, -1 (-3 in total).
- Select the lowest scored machine.
Note that node selection should be done separately for control plane nodes and non-control plane nodes.
The first time CKE generates cluster configuration from a template, it works as follows:
- Search sabakan to acquire the list of available machines.
- Select
control-plane-count
nodes for Kubernetes/etcd control plane. - Select
minimum-workers
nodes.
The algorithm fails when available healthy nodes are not enough to satisfy constraints.
While CKE is idle, it queries sabakan periodically to check any updates on the available machines. The algorithm tries to minimize the change of the cluster configuration.
First, CKE acquires the list of available machines and its statuses from sabakan then compares the list with the current cluster configuration.
Then it selects one of the following actions if the condition matches.
If the current cluster contains nodes that no longer exist in the list, they are removed. This should be executed at the beginning because non-existent nodes block CKE.
New nodes may be added to satisfy constraints.
If too many control plane nodes would be removed, this algorithm cannot work because replacing more than half of etcd servers would break the cluster. In this case, administrators need to fix the cluster configuration manually.
When control-plane-count
constraint is increased, control plane nodes are
added. If there are too few unused healthy-and-untainted machines and
the number of worker nodes is greater than minimum-workers
constraint,
existing healthy-and-untainted worker nodes are changed to control plane
nodes.
When control-plane-count
constraint is decreased, control plane nodes are
changed to non-control-plane nodes.
If the total number of worker nodes exceeds maximum-workers
, existing
worker nodes are removed to satisfy the constraint.
If a control plane node (1) is neither healthy, updating, nor uninitialized, or (2) has intolerable taints, the node is demoted to a worker, and a new machine is added as a control plane node.
When there is no unused healthy-and-untainted machine, a healthy-and-untainted worker node is selected to be promoted to a control plane node.
If the number of healthy worker nodes is less than minimum-workers
and
the total number of worker nodes is less than maximum-workers
, new worker
nodes are added.
If a worker node is kept retired for a while,
- it is removed from the cluster if the number of workers is greater than
minimum-workers
, or - it is replaced with a new machine if available, otherwise,
- it is left untouched.
CKE adds taints to nodes as follows.
The taint key is cke.cybozu.com/state
.
Machine state | Taint value | Taint effect |
---|---|---|
Unreachable | unreachable |
NoSchedule |
Retiring | retiring |
NoExecute |
Retired | retired |
NoExecute |
For other machine states, the taint is removed.
Sabakan Machine
labels
are translated to Kubernetes Node labels.
The label key will be prefixed by sabakan.cke.cybozu.com/
.
Other Machine fields are also translated to labels as follows.
topology.kubernetes.io/zone
and failure-domain.beta.kubernetes.io/zone
(deprecated) are well-known labels.
node-role.kubernetes.io/<role>
are used by kubectl
to display the node's role.
Field | Label key | Value |
---|---|---|
spec.rack |
cke.cybozu.com/rack |
spec.rack converted to string. |
spec.rack |
topology.kubernetes.io/zone |
spec.rack converted to string with prefix rack . |
spec.rack |
failure-domain.beta.kubernetes.io/zone |
spec.rack converted to string with prefix rack . |
spec.indexInRack |
cke.cybozu.com/index-in-rack |
spec.indexInRack converted to string. |
spec.role |
cke.cybozu.com/role |
The same as spec.role . |
spec.role |
node-role.kubernetes.io/<role> |
"true" |
spec.registerDate |
cke.cybozu.com/register-month |
spec.registerDate in yyyy-MM format. |
spec.retireDate |
cke.cybozu.com/retire-month |
spec.retireDate in yyyy-MM format. |
In addition node-role.kubernetes.io/master
is set to "true"
in the control plane node.
Following Machine fields are translated to Node annotations:
Field | Annotation key | Value |
---|---|---|
spec.serial |
cke.cybozu.com/serial |
The same as spec.serial . |
spec.registerDate |
cke.cybozu.com/register-date |
spec.registerDate in RFC3339 format. |
spec.retireDate |
cke.cybozu.com/retire-date |
spec.retireDate in RFC3339 format. |