Troubleshoot Clusters
The first thing to debug in your cluster is if your nodes are all registered correctly.
Run
And verify that all of the nodes you expect to see are present and that they are all in the state.
kubectl cluster-info dump
For now, digging deeper into the cluster requires logging into the relevant machines. Here are the locations of the relevant log files. (note that on systemd-based systems, you may need to use instead)
/var/log/kube-apiserver.log
- API Server, responsible for serving the API- - Scheduler, responsible for making scheduling decisions
/var/log/kube-controller-manager.log
- Controller that manages replication controllers
- - Kubelet, responsible for running containers on the node
/var/log/kube-proxy.log
- Kube Proxy, responsible for service load balancing
This is an incomplete list of things that could go wrong, and how to adjust your cluster setup to mitigate the problems.
- VM(s) shutdown
- Network partition within cluster, or between cluster and users
- Crashes in Kubernetes software
- Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
- Operator error, for example misconfigured Kubernetes software or application software
- Apiserver VM shutdown or apiserver crashing
- Results
- unable to stop, update, or start new pods, services, replication controller
- Results
- Apiserver backing storage lost
- Results
- apiserver should fail to come up
- kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
- manual recovery or recreation of apiserver state necessary before apiserver is restarted
- Results
- Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
- currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
- in future, these will be replicated as well and may not be co-located
- they do not have their own persistent state
- Individual node (VM or physical machine) shuts down
- Results
- pods on that Node stop running
- Results
- Network partition
- Results
- partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
- Results
- Kubelet software fault
- Results
- crashing kubelet cannot start new pods on the node
- kubelet might delete the pods or not
- node marked unhealthy
- replication controllers start new pods elsewhere
- Results
- Cluster operator error
Action: Use IaaS provider’s automatic VM restarting feature for IaaS VMs
- Mitigates: Apiserver VM shutdown or apiserver crashing
- Mitigates: Supporting services VM shutdown or crashes
-
- Mitigates: Apiserver backing storage lost
Action: Use high-availability configuration
- Mitigates: Control plane node shutdown or control plane components (scheduler, API server, controller-manager) crashing
- Will tolerate one or more simultaneous node or component failures
- Mitigates: API server backing storage (i.e., etcd’s data directory) lost
- Assumes HA (highly-available) etcd configuration
- Mitigates: Control plane node shutdown or control plane components (scheduler, API server, controller-manager) crashing
Action: Snapshot apiserver PDs/EBS-volumes periodically
- Mitigates: Apiserver backing storage lost
- Mitigates: Some cases of operator error
- Mitigates: Some cases of Kubernetes software fault
Action: use replication controller and services in front of pods
- Mitigates: Node shutdown
- Mitigates: Kubelet software fault
-
- Mitigates: Kubelet software fault