Restarting the cluster gracefully

    Even though the cluster is expected to be functional after the restart, the cluster might not recover due to unexpected conditions, for example:

    • etcd data corruption during shutdown

    • Node failure due to hardware

    • Network connectivity issues

    If your cluster fails to recover, follow the steps to restore to a previous cluster state.

    Restarting the cluster

    You can restart your cluster after it has been shut down gracefully.

    Prerequisites

    • You have access to the cluster as a user with the role.

    • This procedure assumes that you gracefully shut down the cluster.

    Procedure

    1. Start all cluster machines.

      Use the appropriate method for your cloud environment to start the machines, for example, from your cloud provider’s web console.

      Wait approximately 10 minutes before continuing to check the status of control plane nodes.

    2. Verify that all control plane nodes are ready.

      The control plane nodes are ready if the status is Ready, as shown in the following output:

      1. NAME STATUS ROLES AGE VERSION
      2. ip-10-0-168-251.ec2.internal Ready master 75m v1.26.0
      3. ip-10-0-170-223.ec2.internal Ready master 75m v1.26.0
      4. ip-10-0-211-16.ec2.internal Ready master 75m v1.26.0
    3. If the control plane nodes are not ready, then check whether there are any pending certificate signing requests (CSRs) that must be approved.

      1. Get the list of current CSRs:

        1. Review the details of a CSR to verify that it is valid:

          1. $ oc describe csr <csr_name> (1)
        2. Approve each valid CSR:

      2. After the control plane nodes are ready, verify that all worker nodes are ready.

        1. $ oc get nodes -l node-role.kubernetes.io/worker
        1. NAME STATUS ROLES AGE VERSION
        2. ip-10-0-179-95.ec2.internal Ready worker 64m v1.26.0
        3. ip-10-0-250-100.ec2.internal Ready worker 64m v1.26.0
      3. If the worker nodes are not ready, then check whether there are any pending certificate signing requests (CSRs) that must be approved.

        1. Get the list of current CSRs:

          1. $ oc get csr
        2. Review the details of a CSR to verify that it is valid:

          1<csr_name> is the name of a CSR from the list of current CSRs.
        3. Approve each valid CSR:

          1. $ oc adm certificate approve <csr_name>
      4. Verify that the cluster started properly.

        1. Check that there are no degraded cluster Operators.

            Check that there are no cluster Operators with the DEGRADED condition set to True.

            1. NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
            2. authentication 4.13.0 True False False 59m
            3. cloud-credential 4.13.0 True False False 85m
            4. cluster-autoscaler 4.13.0 True False False 73m
            5. config-operator 4.13.0 True False False 73m
            6. csi-snapshot-controller 4.13.0 True False False 66m
            7. dns 4.13.0 True False False 76m
            8. etcd 4.13.0 True False False 76m
            9. ...
          1. Check that all nodes are in the Ready state:

            Check that the status for all nodes is Ready.

            1. NAME STATUS ROLES AGE VERSION
            2. ip-10-0-168-251.ec2.internal Ready master 82m v1.26.0
            3. ip-10-0-170-223.ec2.internal Ready master 82m v1.26.0
            4. ip-10-0-179-95.ec2.internal Ready worker 70m v1.26.0
            5. ip-10-0-182-134.ec2.internal Ready worker 70m v1.26.0
            6. ip-10-0-211-16.ec2.internal Ready master 82m v1.26.0

        If the cluster did not start properly, you might need to restore your cluster using an etcd backup.