Troubleshooting kOps clusters

The Control Plane

If the above-mentioned command complains about an unavailable API server, it means the control plane isn’t working properly. In order to diagnose further, you need to log into one of the control plane nodes.

Run kops get instances (1.19+) or look in the AWS console to identify a node with the master role. Then ssh into the IP address listed.

The logs on the control plane resides in/var/log. Assume the logs are there unless otherwise noted.

Nodeup is the process responsible for the initial provisioning of a node. It is a oneshot systemd service called kops-configuration.service. You can see the logs for this service running journalctl -u kops-configuration.service.

If it succeed, you should be able to see the following log entries:

Note that if the node booted some time ago, the logs for this unit may be empty.

Either way, we would appreciate a GitHub issue as we try to avoid clusters running into problems during the nodeup process.

If nodeup succeeds, the core kube containers should have started. Look for the API server logs in .

Often the issue is obvious such as passing incorrect CLI flags.

After resizing an etcd cluster or restoring backup, the kubernetes API can contain too many endpoints. You can confirm this by running kubectl get endpoints -n default kubernetes. This command should list exactly as many IPs as you have control plane nodes.

caueses old apiserver leases to get stuck. In order to recover from this you need to remove the leases from etcd directly:

  1. CONTAINER=$(kubectl get pods -n kube-system | grep etcd-manager-main | head -n 1 | awk '{print $1}')
  2. kubectl exec -it -n kube-system $CONTAINER -- sh
  3. cd /opt/etcd-v3.4.13-linux-amd64/
  4. ./etcdctl --cacert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt --cert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.crt --key=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.key --endpoints=https://127.0.0.1:4001 del --prefix /registry/masterleases/

The remaining api servers will immediately recreate their own leases.

One of the more common reasons for the API server not working properly is that etcd is unavailable. If you see connection errors to port 4001 or 4002, it means that main and/or events respectively is unavailable.

The etcd clusters are managed by etcd-manager and most likely it is something wrong with the manager rather than etcd in itself. The logs for etcd is passed through etcd-manager, so you will be able to find the logs for both in and etcd-events.log. Since both etcd-manager and etcd are quorum-based clusters there can be some misleading errors in these files that may suggest that etcd is broken, when in fact it is etcd-manager that is.

DNS

Troubleshooting Kubernetes DNS is perhaps worth a whole book. The Kubernetes docs have on how to debug DNS.

It is worth mentioning that failing DNS is often a symptom of a broken pod network. So you may want to ensure that two pods can talk to each other using IP addresses before starting to troubleshoot DNS.

CNI

If the CNI bin directory is completely empty it may be a symptom of nodeup not working properly. See more on troubleshooting nodeup above. In most cases, nodeup will write the most common CNI plugins to that directory so it should rarely be completely empty.

If the directory is there, but the CNI plugin and configuration is missing, it means that the process responsible for writing these files are not working properly. In most cases this is a DaemonSet running in kube-system.

If the API is working, and the CNI is installed through a , check that the pods are running. If pods are expected, but absent, it may be an issue with installing the CNI addon. kOps will try to install addons regularly, so run journalctl -f on a control plane node to spot any errors.