Deploying machine health checks
Machine health checks automatically repair unhealthy machines in a particular machine pool.
To monitor machine health, create a resource to define the configuration for a controller. Set a condition to check, such as staying in the NotReady
status for five minutes or displaying a permanent condition in the node-problem-detector, and a label for the set of machines to monitor.
You cannot apply a machine health check to a machine with the master role. |
The controller that observes a MachineHealthCheck
resource checks for the defined condition. If a machine fails the health check, the machine is automatically deleted and one is created to take its place. When a machine is deleted, you see a machine deleted
event.
To limit disruptive impact of the machine deletion, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the maxUnhealthy
threshold allows for in the targeted pool of machines, remediation stops and therefore enables manual intervention.
Consider the timeouts carefully, accounting for workloads and requirements.
|
To stop the check, remove the resource.
For example, you should stop the check during the upgrade process because the nodes in the cluster might become temporarily unavailable. The MachineHealthCheck
might identify such nodes as unhealthy and reboot them. To avoid rebooting such nodes, remove any MachineHealthCheck
resource that you have deployed before updating the cluster. However, a MachineHealthCheck
resource that is deployed by default (such as machine-api-termination-handler
) cannot be removed and will be recreated.
There are limitations to consider before deploying a machine health check:
Control plane machines are not currently supported and are not remediated if they are unhealthy.
If the node for a machine is removed from the cluster, a machine health check considers the machine to be unhealthy and remediates it immediately.
If the corresponding node for a machine does not join the cluster after the
nodeStartupTimeout
, the machine is remediated.A machine is remediated immediately if the
Machine
resource phase isFailed
.
Additional resources
For more information about the node conditions you can define in a
MachineHealthCheck
CR, see About listing all the nodes in a cluster.For more information about short-circuiting, see .
The MachineHealthCheck
resource for all cloud-based installation types, and other than bare metal, resembles the following YAML file:
The |
Short-circuiting machine health check remediation
Short circuiting ensures that machine health checks remediate machines only when the cluster is healthy. Short-circuiting is configured through the field in the MachineHealthCheck
resource.
If the user defines a value for the maxUnhealthy
field, before remediating any machines, the MachineHealthCheck
compares the value of maxUnhealthy
with the number of machines within its target pool that it has determined to be unhealthy. Remediation is not performed if the number of unhealthy machines exceeds the maxUnhealthy
limit.
The appropriate maxUnhealthy
value depends on the scale of the cluster you deploy and how many machines the MachineHealthCheck
covers. For example, you can use the maxUnhealthy
value to cover multiple machine sets across multiple availability zones so that if you lose an entire zone, your maxUnhealthy
setting prevents further remediation within the cluster.
The maxUnhealthy
field can be set as either an integer or percentage. There are different remediation implementations depending on the maxUnhealthy
value.
Setting maxUnhealthy
by using an absolute value
If maxUnhealthy
is set to 2
:
Remediation will be performed if 2 or fewer nodes are unhealthy
Remediation will not be performed if 3 or more nodes are unhealthy
These values are independent of how many machines are being checked by the machine health check.
Setting maxUnhealthy
by using percentages
If maxUnhealthy
is set to 40%
and there are 25 machines being checked:
Remediation will be performed if 10 or fewer nodes are unhealthy
Remediation will not be performed if 11 or more nodes are unhealthy
If is set to 40%
and there are 6 machines being checked:
Remediation will be performed if 2 or fewer nodes are unhealthy
Remediation will not be performed if 3 or more nodes are unhealthy
Additional resources
You can create a MachineHealthCheck
resource for all MachineSets
in your cluster. You should not create a MachineHealthCheck
resource that targets control plane machines.
Prerequisites
- Install the
oc
command line interface.
Procedure
Create a
healthcheck.yml
file that contains the definition of your machine health check.
You can configure and deploy a machine health check to detect and repair unhealthy bare metal nodes.
In a bare metal cluster, remediation of nodes is critical to ensuring the overall health of the cluster. Physically remediating a cluster can be challenging and any delay in putting the machine into a safe or an operational state increases the time the cluster remains in a degraded state, and the risk that subsequent failures might bring the cluster offline. Power-based remediation helps counter such challenges.
Instead of reprovisioning the nodes, power-based remediation uses a power controller to power off an inoperable node. This type of remediation is also called power fencing.
OKD uses the MachineHealthCheck
controller to detect faulty bare metal nodes. Power-based remediation is fast and reboots faulty nodes instead of removing them from the cluster.
Power-based remediation provides the following capabilities:
Reduces the risk data loss in hyperconverged environments
Reduces the downtime associated with recovering physical machines
Machine deletion on bare metal cluster triggers reprovisioning of a bare metal host. Usually bare metal reprovisioning is a lengthy process, during which the cluster is missing compute resources and applications might be interrupted. To change the default remediation process from machine deletion to host power-cycle, annotate the MachineHealthCheck
resource with the machine.openshift.io/remediation-strategy: external-baremetal
annotation.
After you set the annotation, unhealthy machines are power-cycled by using BMC credentials.
Understanding the remediation process
The remediation process operates as follows:
The MachineHealthCheck (MHC) controller detects that a node is unhealthy.
The MHC notifies the bare metal machine controller which requests to power-off the unhealthy node.
After the power is off, the node is deleted, which allows the cluster to reschedule the affected workload on other nodes.
The bare metal machine controller requests to power on the node.
After the node is up, the node re-registers itself with the cluster, resulting in the creation of a new node.
After the node is recreated, the bare metal machine controller restores the annotations and labels that existed on the unhealthy node before its deletion.
If the power operations did not complete, the bare metal machine controller triggers the reprovisioning of the unhealthy node unless this is a control plane node (also known as the master node) or a node that was provisioned externally. |
Prerequisites
The OKD is installed using installer-provisioned infrastructure (IPI).
Access to Baseboard Management Controller (BMC) credentials (or BMC access to each node)
Network access to the BMC interface of the unhealthy node.
Procedure
Create a
healthcheck.yaml
file that contains the definition of your machine health check.Apply the
healthcheck.yaml
file to your cluster using the following command:
Sample MachineHealthCheck
resource for bare metal
1 | Specify the name of the machine health check to deploy. |
2 | For bare metal clusters, you must include the machine.openshift.io/remediation-strategy: external-baremetal annotation in the annotations section to enable power-cycle remediation. With this remediation strategy, unhealthy hosts are rebooted instead of removed from the cluster. |
3 | Specify a label for the machine pool that you want to check. |
4 | Specify the machine set to track in <cluster_name>-<label>-<zone> format. For example, prod-node-us-east-1a . |
5 | Specify the timeout duration for the node condition. If the condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine. |
6 | Specify the amount of machines allowed to be concurrently remediated in the targeted pool. This can be set as a percentage or an integer. If the number of unhealthy machines exceeds the limit set by maxUnhealthy , remediation is not performed. |
7 | Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy. |
Troubleshooting issues with power-based remediation
To troubleshoot an issue with power-based remediation, verify the following:
BMC is connected to the control plane node that is responsible for running the remediation task.