This is a guide on how to troubleshoot issues related to the Cluster API provider for vSphere (CAPV).
- Troubleshooting
- Debugging issues
- Common issues
This section describes how to debug issues tha occur while trying to deploy a new cluster with clusterctl
and CAPV.
The first step to figuring out what went wrong is to increase the logging.
There are three places to adjust the log level when bootstrapping cluster.
The following steps may be used to adjust the CAPI manager's log level:
-
Open the
provider-components.yaml
file, ex../out/management-cluster/provider-components.yaml
-
Search for
cluster-api/cluster-api-controller
-
Modify the pod spec for the CAPI manager to indicate where to send logs and the log level:
spec: containers: - args: - --logtostderr - -v=6 command: - /manager image: registry.k8s.io/cluster-api/cluster-api-controller:v1.6.0 name: manager
A log level of six should provided additional information useful for figuring out most issues.
-
Open the
provider-components.yaml
file, ex../out/management-cluster/provider-components.yaml
-
Search for
cluster-api-provider-vsphere
-
Modify the pod spec for the CAPV manager to indicate the log level:
spec: containers: - args: - --logtostderr - -v=6 command: - /manager env: - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName image: registry.k8s.io/cluster-api-vsphere/cluster-api-vsphere-controller:v1.8.7 name: manager
A log level of six should provided additional information useful for figuring out most issues.
The clusterctl
log level may be specified when running clusterctl
:
clusterctl create cluster \
-a ./out/management-cluster/addons.yaml \
-c ./out/management-cluster/cluster.yaml \
-m ./out/management-cluster/machines.yaml \
-p ./out/management-cluster/provider-components.yaml \
--kubeconfig-out ./out/management-cluster/kubeconfig \
--provider vsphere \
--bootstrap-type kind \
-v 6
The last line of the above command, -v 6
, tells clusterctl
to log messages at level six. This should provide additional information that may be used to diagnose issues.
The clusterctl
logs are client-side only. The more interesting information is occurring inside of the bootstrap cluster. This section describes how to access the logs in the bootstrap cluster.
To make the subsequent steps easier, please go ahead and export a KUBECONFIG
environment variable to point to the bootstrap cluster that is or will be running via Kind:
export KUBECONFIG=$(kind get kubeconfig-path --name=clusterapi)
The following command may be used to follow the logs from the CAPI manager:
$ while ! kubectl \
-n cluster-api-system logs cluster-api-controller-manager-0 -f || \
true; do sleep 1; done
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
I0726 18:52:10.267212 1 main.go:65] Registering Components
I0726 18:52:10.269442 1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource" "controller"="machinedeployment-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"selector":{},"template":{"metadata":{"creationTimestamp":null},"spec":{"metadata":{"creationTimestamp":null},"providerSpec":{},"versions":{"kubelet":""}}}},"status":{}}}
I0726 18:52:10.269819 1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource" "controller"="machinedeployment-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"selector":{},"template":{"metadata":{"creationTimestamp":null},"spec":{"metadata":{"creationTimestamp":null},"providerSpec":{},"versions":{"kubelet":""}}}},"status":{"replicas":0}}}
The above command immediately begins trying to follow the CAPI manager log, even before the bootstrap cluster and the CAPI manager pod exist. Once the latter is finally available, the command will start following its log.
To tail the logs from the CAPV manager image, use the following command:
$ while ! kubectl \
-n vsphere-provider-system logs vsphere-provider-controller-manager-0 \
-f || true; do sleep 1; done
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
Error from server (NotFound): namespaces "vsphere-provider-system" not found
Error from server (NotFound): namespaces "vsphere-provider-system" not found
Error from server (NotFound): pods "vsphere-provider-controller-manager-0" not found
Error from server (NotFound): pods "vsphere-provider-controller-manager-0" not found
Error from server (BadRequest): container "manager" in pod "vsphere-provider-controller-manager-0" is waiting to start: ContainerCreating
Error from server (BadRequest): container "manager" in pod "vsphere-provider-controller-manager-0" is waiting to start: ContainerCreating
I0726 18:30:17.988872 1 main.go:92] Cluster-api objects are synchronized every 10m0s
I0726 18:30:17.989473 1 main.go:93] The default requeue period is 20s
I0726 18:30:18.066304 1 round_trippers.go:405] GET https://10.96.0.1:443/api?timeout=32s 200 OK in 76 milliseconds
The above command immediately begins trying to follow the CAPV manager log, even before the bootstrap cluster and the CAPV manager pod exist. Once the latter is finally available, the command will start following its log.
Solving issues may also require accessing the logs from the bootstrap cluster's core components:
kubectl -n kube-system logs kube-apiserver-clusterapi-control-plane -f
kubectl -n kube-system logs kube-controller-manager-clusterapi-control-plane -f
kubectl -n kube-system logs kube-scheduler-clusterapi-control-plane -f
This section contains issues commonly encountered by people using CAPV.
The Getting Started guide lists the prerequisites for deploying clusters with CAPV. Make sure those prerequisites, such as clusterctl
, kubectl
, kind
, etc. are up to date.
If you are using CAPV from a previously tested CAPV on, you may be using an out of date manifest docker image. You can remedy this by removing your existing CAPV manifest image by using docker rmi gcr.io/cluster-api-provider-vsphere/release/manifests:latest
or by updating the command to specify a specific manifest image, for example:
$ docker run --rm \
> -v "$(pwd)":/out \
> -v "$(pwd)/envvars.txt":/envvars.txt:ro \
> gcr.io/cluster-api-provider-vsphere/release/manifests:0.5.2-alpha.1 \
> -c management-cluster
This will ensure that the desired image is being used.
When generating the YAML manifest the following error may occur:
$ docker run --rm \
> -v "$(pwd)":/out \
> -v "$(pwd)/envvars.txt":/envvars.txt:ro \
> gcr.io/cluster-api-provider-vsphere/release/manifests:latest \
> -c management-cluster
/build/hack/generate-yaml.sh: line 90: source: /envvars.txt: is a directory
This means that "$(pwd)/envvars.txt"
does not refer to an existing file on the localhost. So instead of bind mounting a file into the container, Docker created a new directory on the localhost at the path "$(pwd)/envvars.txt"
and bind mounted it into the container.
Make sure the path to the envvars.txt
file is correct before using it to generate the YAML manifests.
When bootstrapping the management cluster, the vSphere manager log may emit errors similar to the following:
E0726 17:12:54.812485 1 actuator.go:217] [cluster-actuator]/cluster.k8s.io/v1alpha1/default/v0.4.0-beta.2 "msg"="target cluster is not ready" "error"="unable to get client for target cluster: failed to retrieve kubeconfig secret for Cluster \"management-cluster\" in namespace \"default\": secret not found"
The above error does not mean there is a problem. Kubernetes components operate in a reconciliation model -- a message loops attempts to reconcile the desired state over and over until it is achieved or a timeout occurs.
The error message simply indicates that the first control plane node for the target cluster has not yet come online and provided the information necessary to generate the kubeconfig
for the target cluster.
It is quite typical to see many errors in Kubernetes service logs, from the API server, to the controller manager, to the kubelet -- the errors are eventually reconciled as the expected configurations are provided and the desired state is reconciled.
When clusterctl
times out waiting for the management cluster to come online, and the vSphere manager log repeats failed to retrieve kubeconfig secret for Cluster
over and over again, it means there was an error bringing the management cluster's first control plane node online. Possible reasons include:
Two common causes for a failed deployment are related to accessing the remote vSphere endpoint:
- The host from which
clusterctl
is executed must have access to the vSphere endpoint to which the management cluster is being deployed. - The provided vSphere credentials are invalid.
A quick way to validate both access and the credentials is using the program govc
or its container image, vmware/govc
:
# Define the vSphere's endpoint and access information.
$ export GOVC_URL="myvcenter.com" GOVC_USERNAME="username" GOVC_PASSWORD="password"
# Use "govc" to list the contents of the vSphere endpoint.
$ docker run --rm \
-e GOVC_URL -e GOVC_USERNAME -e GOVC_PASSWORD \
vmware/govc \
ls -k
/my-datacenter/vm
/my-datacenter/network
/my-datacenter/host
/my-datacenter/datastore
If the above command fails then there is an issue with accessing the vSphere endpoint, and it must be corrected before clusterctl
will succeed.
Deployed VMs get their names from the names of the machines in machines.yaml
and machineset.yaml
. If a VM with the same name already exists in the same location as one of the VMs that would be created by a new cluster, then the new cluster will fail to deploy and the CAPV manager log will include an error similar to the following:
I0726 18:52:48.920975 1 util.go:195] default-logger/cluster.k8s.io/v1alpha1/default/v0.4.0-beta.2/management-cluster-controlplane-1/task-231288 "level"=2 "msg"="task failed" "description-id"="VirtualMachine.clone"
Use the govc
image to check to see if there is a VM with the same name:
$ docker run --rm \
-e GOVC_URL -e GOVC_USERNAME -e GOVC_PASSWORD \
vmware/govc \
vm.info -k management-cluster-controlplane-1
Name: management-cluster-controlplane-1
Path: /my-datacenter/vm/management-cluster-controlplane-1
UUID: 4230a650-c92a-d99a-d9f7-fa2fd770e536
Guest name: Other 3.x or later Linux (64-bit)
Memory: 2048MB
CPU: 2 vCPU(s)
Power state: poweredOn
Boot time: <nil>
IP address: 5.6.7.8
Host: 1.2.3.4
Another common error is to omit the segment length when using a static IP address. For example:
network:
devices:
- networkName: "sddc-cgw-network-6"
gateway4: 192.168.6.1
ipAddrs:
- 192.168.6.20/24
nameservers:
- 1.1.1.1
- 1.0.0.1
The above network configuration defines a static IP address, 192.168.6.20
, but also includes the required segment length. Without this, clusterctl
will timeout waiting for the control plane to come online.
A machine with multiple networks may cause the bootstrap process to fail for various reasons.
A machine that defines two networks may lead to failure if both networks use DHCP and two default routes are defined on the guest. For example:
network:
devices:
- networkName: "sddc-cgw-network-5"
dhcp4: true
- networkName: "sddc-cgw-network-6"
dhcp4: true
The above network configuratoin from a machine definition includes two network devices, both using DHCP. This likely causes two default routes to be defined on the guest, meaning it's not possible to determine the default IPv4 address that should be used by Kubernetes.
Another reason a machine with two networks can lead to failure is because the order in which IP addresses are returned externally from a VM is not guaranteed to be the same order as they are when inspected inside the guest. The solution for this is to define a preferred CIDR -- the network segment that contains the IP that the kubeadm
bootstrap process selected for the API server. For example:
network:
preferredAPIServerCidr: "192.168.5.0/24"
devices:
- networkName: "sddc-cgw-network-6"
ipAddrs:
- 192.168.6.20/24
- networkName: "sddc-cgw-network-5"
dhcp4: true
The above network definition specifies the CIDR to which the IP address belongs that is bound to the Kubernetes API server on the guest.
During the bootstrapping process a CA certificate is transferred to the new VM. This CA has a "not valid until" date associated with it. If the ESXI host does not have NTP properly configured there is a chance you will get an error during the kubeadm bootstrapping process which will output an error similar to this in the /var/log/cloud-init-output.log
log on the VM:
[certs] Using certificateDir folder "/etc/kubernetes/pki"
error execution phase certs/ca: failure loading ca certificate: failed to load certificate: the certificate is not valid yet
The solution for this is to either properly configure NTP in vCenter Configuring Network Time Protocol (NTP) on an ESXi host using the vSphere Web Client (57147) or add a NTP server block to the KubeadmConfig
:
spec:
ntp:
enabled: true
servers:
- 192.168.2.1
This section discusses issues that can cause a Machine object to be stuck in a provisioning state.
kubectl get machine
NAME PROVIDERID PHASE
capi-quickstart-controlplane-0 provisioning
To troubleshoot these type of scenarios capv-controller-manager
logs are a good starting point. These logs can be retrived using kubectl logs capv-controller-manager-88f646758-nj8fs -n capv-system
One of the scenarios where a machine object fails to provision successfully and is stuck in a provisioning state is when the VM folder specified in the manifest does not exist. Below error messages can be seen in the capv-controller-manager
logs:
kubectl logs capv-controller-manager-88f646758-nj8fs -n capv-system
I0219 15:15:28.802698 1 vspheremachine_controller.go:249] capv-controller-manager/vspheremachine-controller/default/capi-quickstart-controlplane-0 "level"=0 "msg"="resource patch was not required" "local-resource-version"="63276" "remote-resource-version"="63276"
E0219 15:15:28.802789 1 controller.go:218] controller-runtime/controller "msg"="Reconciler error" "error"="failed to reconcile VM: unable to get folder for \"infrastructure.cluster.x-k8s.io/v1alpha2, Kind=VSphereCluster default/capi-quickstart/capi-quickstart-controlplane-0\": folder 'clusterapiVM' not found" "controller"="vspheremachine" "request"={"Namespace":"default","Name":"capi-quickstart-controlplane-0"}
To resolve this error create a VM folder with the name as specified in the manifest. This can be done using the vCenter UI or govc
. For example in case of this error, govc folder.create /Datacenter/vm/clusterapiVM
, resolves the issue.