Deploying machine health checks

    You can only apply a machine health check to control plane machines on clusters that use control plane machine sets.

    To monitor machine health, create a resource to define the configuration for a controller. Set a condition to check, such as staying in the NotReady status for five minutes or displaying a permanent condition in the node-problem-detector, and a label for the set of machines to monitor.

    The controller that observes a MachineHealthCheck resource checks for the defined condition. If a machine fails the health check, the machine is automatically deleted and one is created to take its place. When a machine is deleted, you see a machine deleted event.

    To limit disruptive impact of the machine deletion, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the maxUnhealthy threshold allows for in the targeted pool of machines, remediation stops and therefore enables manual intervention.

    Consider the timeouts carefully, accounting for workloads and requirements.

    • Long timeouts can result in long periods of downtime for the workload on the unhealthy machine.

    • Too short timeouts can result in a remediation loop. For example, the timeout for checking the NotReady status must be long enough to allow the machine to complete the startup process.

    To stop the check, remove the resource.

    There are limitations to consider before deploying a machine health check:

    • Only machines owned by a machine set are remediated by a machine health check.

    • If the node for a machine is removed from the cluster, a machine health check considers the machine to be unhealthy and remediates it immediately.

    • If the corresponding node for a machine does not join the cluster after the nodeStartupTimeout, the machine is remediated.

    • A machine is remediated immediately if the Machine resource phase is Failed.

    Additional resources

    The MachineHealthCheck resource for all cloud-based installation types, and other than bare metal, resembles the following YAML file:

    1. apiVersion: machine.openshift.io/v1beta1
    2. kind: MachineHealthCheck
    3. metadata:
    4. name: example (1)
    5. namespace: openshift-machine-api
    6. spec:
    7. selector:
    8. matchLabels:
    9. machine.openshift.io/cluster-api-machine-role: <role> (2)
    10. machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone> (3)
    11. unhealthyConditions:
    12. - type: "Ready"
    13. timeout: "300s" (4)
    14. status: "False"
    15. - type: "Ready"
    16. timeout: "300s" (4)
    17. status: "Unknown"
    18. maxUnhealthy: "40%" (5)
    19. nodeStartupTimeout: "10m" (6)
    1Specify the name of the machine health check to deploy.
    2Specify a label for the machine pool that you want to check.
    3Specify the machine set to track in <cluster_name>-<label>-<zone> format. For example, prod-node-us-east-1a.
    4Specify the timeout duration for a node condition. If a condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine.
    5Specify the amount of machines allowed to be concurrently remediated in the targeted pool. This can be set as a percentage or an integer. If the number of unhealthy machines exceeds the limit set by maxUnhealthy, remediation is not performed.
    6Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy.

    Short-circuiting machine health check remediation

    Short-circuiting ensures that machine health checks remediate machines only when the cluster is healthy. Short-circuiting is configured through the maxUnhealthy field in the MachineHealthCheck resource.

    If the user defines a value for the maxUnhealthy field, before remediating any machines, the MachineHealthCheck compares the value of maxUnhealthy with the number of machines within its target pool that it has determined to be unhealthy. Remediation is not performed if the number of unhealthy machines exceeds the maxUnhealthy limit.

    If maxUnhealthy is not set, the value defaults to 100% and the machines are remediated regardless of the state of the cluster.

    The appropriate maxUnhealthy value depends on the scale of the cluster you deploy and how many machines the MachineHealthCheck covers. For example, you can use the maxUnhealthy value to cover multiple compute machine sets across multiple availability zones so that if you lose an entire zone, your maxUnhealthy setting prevents further remediation within the cluster. In global Azure regions that do not have multiple availability zones, you can use availability sets to ensure high availability.

    If you configure a MachineHealthCheck resource for the control plane, set the value of maxUnhealthy to 1.

    This configuration ensures that the machine health check takes no action when multiple control plane machines appear to be unhealthy. Multiple unhealthy control plane machines can indicate that the etcd cluster is degraded or that a scaling operation to replace a failed machine is in progress.

    If the etcd cluster is degraded, manual intervention might be required. If a scaling operation is in progress, the machine health check should allow it to finish.

    Setting maxUnhealthy by using an absolute value

    If maxUnhealthy is set to 2:

    • Remediation will be performed if 2 or fewer nodes are unhealthy

    • Remediation will not be performed if 3 or more nodes are unhealthy

    These values are independent of how many machines are being checked by the machine health check.

    Setting maxUnhealthy by using percentages

    If maxUnhealthy is set to 40% and there are 25 machines being checked:

    • Remediation will be performed if 10 or fewer nodes are unhealthy

    • Remediation will not be performed if 11 or more nodes are unhealthy

    If maxUnhealthy is set to 40% and there are 6 machines being checked:

    • Remediation will be performed if 2 or fewer nodes are unhealthy

    • Remediation will not be performed if 3 or more nodes are unhealthy

    The allowed number of machines is rounded down when the percentage of maxUnhealthy machines that are checked is not a whole number.

    You can create a MachineHealthCheck resource for machine sets in your cluster.

    Prerequisites

    • Install the oc command line interface.

    Procedure

    1. Create a healthcheck.yml file that contains the definition of your machine health check.

    2. Apply the healthcheck.yml file to your cluster:

    You can configure and deploy a machine health check to detect and repair unhealthy bare metal nodes.

    In a bare metal cluster, remediation of nodes is critical to ensuring the overall health of the cluster. Physically remediating a cluster can be challenging and any delay in putting the machine into a safe or an operational state increases the time the cluster remains in a degraded state, and the risk that subsequent failures might bring the cluster offline. Power-based remediation helps counter such challenges.

    Instead of reprovisioning the nodes, power-based remediation uses a power controller to power off an inoperable node. This type of remediation is also called power fencing.

    OKD uses the MachineHealthCheck controller to detect faulty bare metal nodes. Power-based remediation is fast and reboots faulty nodes instead of removing them from the cluster.

    Power-based remediation provides the following capabilities:

    • Allows the recovery of control plane nodes

    • Reduces the risk of data loss in hyperconverged environments

    • Reduces the downtime associated with recovering physical machines

    Machine deletion on bare metal cluster triggers reprovisioning of a bare metal host. Usually bare metal reprovisioning is a lengthy process, during which the cluster is missing compute resources and applications might be interrupted.

    There are two ways to change the default remediation process from machine deletion to host power-cycle:

    1. Annotate the MachineHealthCheck resource with the machine.openshift.io/remediation-strategy: external-baremetal annotation.

    2. Create a Metal3RemediationTemplate resource, and refer to it in the of the MachineHealthCheck.

    Understanding the annotation-based remediation process

    The remediation process operates as follows:

    1. The MachineHealthCheck (MHC) controller detects that a node is unhealthy.

    2. The MHC notifies the bare metal machine controller which requests to power-off the unhealthy node.

    3. After the power is off, the node is deleted, which allows the cluster to reschedule the affected workload on other nodes.

    4. The bare metal machine controller requests to power on the node.

    5. After the node is up, the node re-registers itself with the cluster, resulting in the creation of a new node.

    6. After the node is recreated, the bare metal machine controller restores the annotations and labels that existed on the unhealthy node before its deletion.

    If the power operations did not complete, the bare metal machine controller triggers the reprovisioning of the unhealthy node unless this is a control plane node or a node that was provisioned externally.

    The remediation process operates as follows:

    1. The MachineHealthCheck (MHC) controller detects that a node is unhealthy.

    2. After the power is off, the node is deleted, which allows the cluster to reschedule the affected workload on other nodes.

    3. The metal3 remediation controller requests to power on the node.

    4. After the node is up, the node re-registers itself with the cluster, resulting in the creation of a new node.

    5. After the node is recreated, the metal3 remediation controller restores the annotations and labels that existed on the unhealthy node before its deletion.

    If the power operations did not complete, the metal3 remediation controller triggers the reprovisioning of the unhealthy node unless this is a control plane node or a node that was provisioned externally.

    Creating a MachineHealthCheck resource for bare metal

    Prerequisites

    • The OKD is installed using installer-provisioned infrastructure (IPI).

    • Access to BMC credentials (or BMC access to each node).

    • Network access to the BMC interface of the unhealthy node.

    Procedure

    1. Create a healthcheck.yaml file that contains the definition of your machine health check.

    2. Apply the healthcheck.yaml file to your cluster using the following command:

    1. $ oc apply -f healthcheck.yaml

    Sample MachineHealthCheck resource for bare metal, annotation-based remediation

    1Specify the name of the machine health check to deploy.
    2For bare metal clusters, you must include the machine.openshift.io/remediation-strategy: external-baremetal annotation in the annotations section to enable power-cycle remediation. With this remediation strategy, unhealthy hosts are rebooted instead of removed from the cluster.
    3Specify a label for the machine pool that you want to check.
    4Specify the compute machine set to track in <cluster_name>-<label>-<zone> format. For example, prod-node-us-east-1a.
    5Specify the timeout duration for the node condition. If the condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine.
    6Specify the amount of machines allowed to be concurrently remediated in the targeted pool. This can be set as a percentage or an integer. If the number of unhealthy machines exceeds the limit set by maxUnhealthy, remediation is not performed.
    7Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy.

    Sample MachineHealthCheck resource for bare metal, metal3-based remediation

    1. apiVersion: machine.openshift.io/v1beta1
    2. kind: MachineHealthCheck
    3. metadata:
    4. name: example
    5. namespace: openshift-machine-api
    6. spec:
    7. selector:
    8. matchLabels:
    9. machine.openshift.io/cluster-api-machine-role: <role>
    10. machine.openshift.io/cluster-api-machine-type: <role>
    11. machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone>
    12. selector:
    13. apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    14. kind: Metal3RemediationTemplate
    15. name: metal3-remediation-template
    16. namespace: openshift-machine-api
    17. unhealthyConditions:
    18. - type: "Ready"
    19. timeout: "300s"

    Sample Metal3RemediationTemplate resource for bare metal, metal3-based remediation

    The are examples only; you must map your machine groups based on your specific needs. The annotations section does not apply to metal3-based remediation. Annotation-based remediation and metal3-based remediation are mutually exclusive.

    To troubleshoot an issue with power-based remediation, verify the following:

    • BMC is connected to the control plane node that is responsible for running the remediation task.