Common Deployment Failures of TiDB in Kubernetes

    After creating a cluster, if the Pod is not created, you can diagnose it using the following commands:

    After creating a backup/restore task, if the Pod is not created, you can perform a diagnostic operation by executing the following commands:

    1. kubectl get backups -n ${namespace}
    2. kubectl get jobs -n ${namespace}
    3. kubectl describe backups -n ${namespace} ${backup_name}
    4. kubectl describe jobs -n ${namespace} ${backupjob_name}
    5. kubectl describe restores -n ${namespace} ${restore_name}

    The Pending state of a Pod is usually caused by conditions of insufficient resources, for example:

    • The StorageClass of the PVC used by PD, TiKV, TiFlash, Pump, Monitor, Backup, and Restore Pods does not exist or the PV is insufficient.
    • No nodes in the Kubernetes cluster can satisfy the CPU or memory resources requested by the Pod
    • The number of TiKV or PD replicas and the number of nodes in the cluster do not satisfy the high availability scheduling policy of tidb-scheduler

    You can check the specific reason for Pending by using the kubectl describe pod command:

    1. kubectl describe po -n ${namespace} ${pod_name}

    If the CPU or memory resources are insufficient, you can lower the CPU or memory resources requested by the corresponding component for scheduling, or add a new Kubernetes node.

    StorageClass of the PVC does not exist

    If the StorageClass of the PVC cannot be found, take the following steps:

    1. Get the available StorageClass in the cluster:

    2. Update the configuration file:

      • If you want to start the TiDB cluster, execute kubectl edit tc ${cluster_name} -n ${namespace} to update the cluster.
      • If you want to run a backup/restore task, first execute kubectl delete bk ${backup_name} -n ${namespace} to delete the old backup/restore task, and then execute kubectl apply -f backup.yaml to create a new backup/restore task.
    3. Delete StatefulSet and the corresponding PVCs:

      1. kubectl delete pvc -n ${namespace} ${pvc_name} && \

    Insufficient available PVs

    If a StorageClass exists in the cluster but the available PV is insufficient, you need to add PV resources correspondingly. For Local PV, you can expand it by referring to Local PV Configuration.

    tidb-scheduler has a high availability scheduling policy for PD and TiKV. For the same TiDB cluster, if there are N replicas of TiKV or PD, then the number of PD Pods that can be scheduled to each node is M=(N-1)/2 (if N<3, then M=1) at most, and the number of TiKV Pods that can be scheduled to each node is M=ceil(N/3) (if N<3, then M=1; ceil means rounding up) at most.

    If the Pod’s state becomes Pending because the high availability scheduling policy is not satisfied, you need to add more nodes in the cluster.

    A Pod in the CrashLoopBackOff state means that the container in the Pod repeatedly aborts (in the loop of abort - restart by kubelet - abort). There are many potential causes of CrashLoopBackOff.

    1. kubectl -n ${namespace} logs -f ${pod_name}

    View the log when the container was last restarted

    After checking the error messages in the log, you can refer to Cannot start tidb-server, , and Cannot start pd-server for further troubleshooting.

    “cluster id mismatch”

    If you confirm that the TiKV should join the cluster as a new node and that the data on the PV should be deleted, you can delete the TiKV Pod and the corresponding PVC. The TiKV Pod automatically rebuilds and binds the new PV for use. When configuring local storage, delete local storage on the machine to avoid Kubernetes using old data. In cluster operation and maintenance, manage PV using the local volume provisioner and do not delete it forcibly. You can manage the lifecycle of PV by creating, deleting PVCs, and setting reclaimPolicy for the PV.

    TiKV might fail to start when ulimit is not big enough. In this case, you can modify the /etc/security/limits.conf file of the Kubernetes node to increase the ulimit:

    1. root soft nofile 1000000
    2. root soft core unlimited
    3. root soft stack 10240

    PD Pod

    You should see some log of PD Pod like:

    1. Thu Jan 13 14:55:52 IST 2022
    2. ;; Got recursion not available from 10.43.0.10, trying next server
    3. ;; Got recursion not available from 10.43.0.10, trying next server
    4. ;; Got recursion not available from 10.43.0.10, trying next server
    5. Server: 10.43.0.10
    6. Address: 10.43.0.10#53
    7. ** server can't find basic-pd-0.basic-pd-peer.default.svc: NXDOMAIN
    8. nslookup domain basic-pd-0.basic-pd-peer.default.svc failed

    This type of failure occurs when the cluster meets both of the following two conditions:

    • There are two nameserver in /etc/resolv.conf, and the second one is not IP of CoreDNS.
    • The version of PD is:
      • Greater than or equal to v5.0.5.
      • Greater than or equal to v5.1.4.
      • All 5.3 versions.

    To address this failure, add startUpScriptVersion to TidbCluster as:

    This failure occurs because there is something wrong with the nslookup in the base image (see detail in ). After configuring startUpScriptVersion to v1, TiDB Operator uses dig to check DNS instead of using nslookup.

    Other causes

    If you cannot confirm the cause from the log and ulimit is also a normal value, troubleshoot the issue by .