Troubleshooting Clusters

    This doc is about cluster troubleshooting; we assume you have already ruled out your application as the root cause of the problem you are experiencing. See the application troubleshooting guide for tips on application debugging. You may also visit the for more information.

    The first thing to debug in your cluster is if your nodes are all registered correctly.

    Run the following command:

    And verify that all of the nodes you expect to see are present and that they are all in the state.

    1. kubectl cluster-info dump

    Sometimes when debugging it can be useful to look at the status of a node — for example, because you’ve noticed strange behavior of a Pod that’s running on the node, or to find out why a Pod won’t schedule onto the node. As with Pods, you can use kubectl describe node and kubectl get node -o yaml to retrieve detailed information about nodes. For example, here’s what you’ll see if a node is down (disconnected from the network, or kubelet dies and won’t restart, etc.). Notice the events that show the node is NotReady, and also notice that the pods are no longer running (they are evicted after five minutes of NotReady status).

    1. NAME STATUS ROLES AGE VERSION
    2. kube-worker-1 NotReady <none> 1h v1.23.3
    3. kubernetes-node-bols Ready <none> 1h v1.23.3
    4. kubernetes-node-st6x Ready <none> 1h v1.23.3
    5. kubernetes-node-unaj Ready <none> 1h v1.23.3
    1. Name: kube-worker-1
    2. Roles: <none>
    3. Labels: beta.kubernetes.io/arch=amd64
    4. beta.kubernetes.io/os=linux
    5. kubernetes.io/arch=amd64
    6. kubernetes.io/hostname=kube-worker-1
    7. kubernetes.io/os=linux
    8. Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock
    9. node.alpha.kubernetes.io/ttl: 0
    10. volumes.kubernetes.io/controller-managed-attach-detach: true
    11. CreationTimestamp: Thu, 17 Feb 2022 16:46:30 -0500
    12. Taints: node.kubernetes.io/unreachable:NoExecute
    13. node.kubernetes.io/unreachable:NoSchedule
    14. Unschedulable: false
    15. Lease:
    16. HolderIdentity: kube-worker-1
    17. AcquireTime: <unset>
    18. RenewTime: Thu, 17 Feb 2022 17:13:09 -0500
    19. Conditions:
    20. Type Status LastHeartbeatTime LastTransitionTime Reason Message
    21. ---- ------ ----------------- ------------------ ------ -------
    22. NetworkUnavailable False Thu, 17 Feb 2022 17:09:13 -0500 Thu, 17 Feb 2022 17:09:13 -0500 WeaveIsUp Weave pod has set this
    23. MemoryPressure Unknown Thu, 17 Feb 2022 17:12:40 -0500 Thu, 17 Feb 2022 17:13:52 -0500 NodeStatusUnknown Kubelet stopped posting node status.
    24. DiskPressure Unknown Thu, 17 Feb 2022 17:12:40 -0500 Thu, 17 Feb 2022 17:13:52 -0500 NodeStatusUnknown Kubelet stopped posting node status.
    25. PIDPressure Unknown Thu, 17 Feb 2022 17:12:40 -0500 Thu, 17 Feb 2022 17:13:52 -0500 NodeStatusUnknown Kubelet stopped posting node status.
    26. Ready Unknown Thu, 17 Feb 2022 17:12:40 -0500 Thu, 17 Feb 2022 17:13:52 -0500 NodeStatusUnknown Kubelet stopped posting node status.
    27. Addresses:
    28. InternalIP: 192.168.0.113
    29. Hostname: kube-worker-1
    30. Capacity:
    31. cpu: 2
    32. ephemeral-storage: 15372232Ki
    33. hugepages-2Mi: 0
    34. memory: 2025188Ki
    35. pods: 110
    36. Allocatable:
    37. cpu: 2
    38. ephemeral-storage: 14167048988
    39. hugepages-2Mi: 0
    40. memory: 1922788Ki
    41. pods: 110
    42. System Info:
    43. Machine ID: 9384e2927f544209b5d7b67474bbf92b
    44. System UUID: aa829ca9-73d7-064d-9019-df07404ad448
    45. Boot ID: 5a295a03-aaca-4340-af20-1327fa5dab5c
    46. Kernel Version: 5.13.0-28-generic
    47. OS Image: Ubuntu 21.10
    48. Architecture: amd64
    49. Container Runtime Version: containerd://1.5.9
    50. Kubelet Version: v1.23.3
    51. Kube-Proxy Version: v1.23.3
    52. Non-terminated Pods: (4 in total)
    53. Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
    54. --------- ---- ------------ ---------- --------------- ------------- ---
    55. default nginx-deployment-67d4bdd6f5-cx2nz 500m (25%) 500m (25%) 128Mi (6%) 128Mi (6%) 23m
    56. default nginx-deployment-67d4bdd6f5-w6kd7 500m (25%) 500m (25%) 128Mi (6%) 128Mi (6%) 23m
    57. kube-system kube-proxy-dnxbz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28m
    58. kube-system weave-net-gjxxp 100m (5%) 0 (0%) 200Mi (10%) 0 (0%) 28m
    59. Allocated resources:
    60. (Total limits may be over 100 percent, i.e., overcommitted.)
    61. -------- -------- ------
    62. cpu 1100m (55%) 1 (50%)
    63. memory 456Mi (24%) 256Mi (13%)
    64. ephemeral-storage 0 (0%) 0 (0%)
    65. hugepages-2Mi 0 (0%) 0 (0%)
    66. Events:
    67. ...
    1. apiVersion: v1
    2. kind: Node
    3. metadata:
    4. annotations:
    5. kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock
    6. node.alpha.kubernetes.io/ttl: "0"
    7. volumes.kubernetes.io/controller-managed-attach-detach: "true"
    8. creationTimestamp: "2022-02-17T21:46:30Z"
    9. labels:
    10. beta.kubernetes.io/arch: amd64
    11. beta.kubernetes.io/os: linux
    12. kubernetes.io/arch: amd64
    13. kubernetes.io/hostname: kube-worker-1
    14. kubernetes.io/os: linux
    15. name: kube-worker-1
    16. resourceVersion: "4026"
    17. uid: 98efe7cb-2978-4a0b-842a-1a7bf12c05f8
    18. spec: {}
    19. status:
    20. addresses:
    21. - address: 192.168.0.113
    22. type: InternalIP
    23. - address: kube-worker-1
    24. type: Hostname
    25. allocatable:
    26. cpu: "2"
    27. ephemeral-storage: "14167048988"
    28. hugepages-2Mi: "0"
    29. memory: 1922788Ki
    30. pods: "110"
    31. capacity:
    32. cpu: "2"
    33. ephemeral-storage: 15372232Ki
    34. hugepages-2Mi: "0"
    35. pods: "110"
    36. conditions:
    37. - lastHeartbeatTime: "2022-02-17T22:20:32Z"
    38. lastTransitionTime: "2022-02-17T22:20:32Z"
    39. message: Weave pod has set this
    40. reason: WeaveIsUp
    41. status: "False"
    42. type: NetworkUnavailable
    43. - lastHeartbeatTime: "2022-02-17T22:20:15Z"
    44. lastTransitionTime: "2022-02-17T22:13:25Z"
    45. message: kubelet has sufficient memory available
    46. reason: KubeletHasSufficientMemory
    47. status: "False"
    48. type: MemoryPressure
    49. - lastHeartbeatTime: "2022-02-17T22:20:15Z"
    50. lastTransitionTime: "2022-02-17T22:13:25Z"
    51. message: kubelet has no disk pressure
    52. reason: KubeletHasNoDiskPressure
    53. status: "False"
    54. type: DiskPressure
    55. - lastHeartbeatTime: "2022-02-17T22:20:15Z"
    56. lastTransitionTime: "2022-02-17T22:13:25Z"
    57. message: kubelet has sufficient PID available
    58. reason: KubeletHasSufficientPID
    59. status: "False"
    60. type: PIDPressure
    61. - lastHeartbeatTime: "2022-02-17T22:20:15Z"
    62. lastTransitionTime: "2022-02-17T22:15:15Z"
    63. message: kubelet is posting ready status. AppArmor enabled
    64. reason: KubeletReady
    65. status: "True"
    66. type: Ready
    67. daemonEndpoints:
    68. kubeletEndpoint:
    69. Port: 10250
    70. architecture: amd64
    71. bootID: 22333234-7a6b-44d4-9ce1-67e31dc7e369
    72. containerRuntimeVersion: containerd://1.5.9
    73. kernelVersion: 5.13.0-28-generic
    74. kubeProxyVersion: v1.23.3
    75. kubeletVersion: v1.23.3
    76. machineID: 9384e2927f544209b5d7b67474bbf92b
    77. operatingSystem: linux
    78. osImage: Ubuntu 21.10
    79. systemUUID: aa829ca9-73d7-064d-9019-df07404ad448

    For now, digging deeper into the cluster requires logging into the relevant machines. Here are the locations of the relevant log files. On systemd-based systems, you may need to use journalctl instead of examining log files.

    Control Plane nodes

    • /var/log/kube-apiserver.log - API Server, responsible for serving the API
    • /var/log/kube-scheduler.log - Scheduler, responsible for making scheduling decisions
    • /var/log/kube-controller-manager.log - a component that runs most Kubernetes built-in controllers, with the notable exception of scheduling (the kube-scheduler handles scheduling).
    • /var/log/kubelet.log - logs from the kubelet, responsible for running containers on the node
    • /var/log/kube-proxy.log - logs from kube-proxy, which is responsible for directing traffic to Service endpoints

    This is an incomplete list of things that could go wrong, and how to adjust your cluster setup to mitigate the problems.

    Contributing causes

    • VM(s) shutdown
    • Network partition within cluster, or between cluster and users
    • Crashes in Kubernetes software
    • Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
    • Operator error, for example misconfigured Kubernetes software or application software
    • API server VM shutdown or apiserver crashing
      • Results
        • unable to stop, update, or start new pods, services, replication controller
        • existing pods and services should continue to work normally, unless they depend on the Kubernetes API
    • API server backing storage lost
      • Results
        • the kube-apiserver component fails to start successfully and become healthy
        • kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
        • manual recovery or recreation of apiserver state necessary before apiserver is restarted
    • Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
      • currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
      • in future, these will be replicated as well and may not be co-located
      • they do not have their own persistent state
    • Individual node (VM or physical machine) shuts down
      • Results
        • pods on that Node stop running
    • Network partition
      • Results
        • partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
    • Kubelet software fault
      • Results
        • crashing kubelet cannot start new pods on the node
        • kubelet might delete the pods or not
        • node marked unhealthy
        • replication controllers start new pods elsewhere
    • Cluster operator error
      • Results
        • loss of pods, services, etc
        • lost of apiserver backing store
        • users unable to read API
        • etc.

    Mitigations

    • Action: Use IaaS provider’s automatic VM restarting feature for IaaS VMs

      • Mitigates: Apiserver VM shutdown or apiserver crashing
      • Mitigates: Supporting services VM shutdown or crashes
      • Mitigates: Apiserver backing storage lost
    • Action: Use high-availability configuration

      • Mitigates: Control plane node shutdown or control plane components (scheduler, API server, controller-manager) crashing
        • Will tolerate one or more simultaneous node or component failures
      • Mitigates: API server backing storage (i.e., etcd’s data directory) lost
        • Assumes HA (highly-available) etcd configuration
    • Action: Snapshot apiserver PDs/EBS-volumes periodically

      • Mitigates: Apiserver backing storage lost
      • Mitigates: Some cases of operator error
      • Mitigates: Some cases of Kubernetes software fault
    • Action: use replication controller and services in front of pods

      • Mitigates: Node shutdown
      • Mitigates: Kubelet software fault
    • Action: applications (containers) designed to tolerate unexpected restarts

      • Mitigates: Node shutdown
      • Mitigates: Kubelet software fault