Remediating nodes with the Poison Pill Operator

    The Poison Pill Operator runs on the cluster nodes and reboots nodes that are identified as unhealthy. The Operator uses the controller to detect the health of a node in the cluster. When a node is identified as unhealthy, the MachineHealthCheck resource creates the PoisonPillRemediation custom resource (CR), which triggers the Poison Pill Operator.

    The Poison Pill Operator provides the following capabilities:

    • Minimizes downtime for stateful applications and restores compute capacity if transient failures occur.

    • Independent of any management interface, such as IPMI or an API to provision a node.

    The Poison Pill Operator creates the PoisonPillConfig CR with the name poison-pill-config in the Poison Pill Operator’s namespace. You can edit this CR. However, you cannot create a new CR for the Poison Pill Operator.

    A change in the PoisonPillConfig CR re-creates the Poison Pill daemon set.

    The PoisonPillConfig CR resembles the following YAML file:

    Installing the Poison Pill Operator by using the web console

    You can use the OKD web console to install the Poison Pill Operator.

    Prerequisites

    • Log in as a user with cluster-admin privileges.

    Procedure

    1. In the OKD web console, navigate to OperatorsOperatorHub.

    2. Search for the Poison Pill Operator from the list of available Operators, and then click Install.

    3. Keep the default selection of Installation mode and namespace to ensure that the Operator is installed to the poison-pill namespace.

    4. Click Install.

    Verification

    To confirm that the installation is successful:

    1. Navigate to the OperatorsInstalled Operators page.

    2. Check that the Operator is installed in the poison-pill namespace and its status is Succeeded.

    If the Operator is not installed successfully:

    1. Navigate to the OperatorsInstalled Operators page and inspect the Status column for any errors or failures.

    2. Navigate to the WorkloadsPods page and check the logs in any pods in the poison-pill-controller-manager project that are reporting issues.

    You can use the OpenShift CLI (oc) to install the Poison Pill Operator.

    Prerequisites

    • Install the OpenShift CLI (oc).

    Procedure

    1. Create a Namespace custom resource (CR) for the Poison Pill Operator:

      1. Define the Namespace CR and save the YAML file, for example, poison-pill-namespace.yaml:

        1. apiVersion: v1
        2. kind: Namespace
        3. metadata:
        4. name: poison-pill
      2. To create the Namespace CR, run the following command:

        1. $ oc create -f poison-pill-namespace.yaml
    2. Create an OperatorGroup CR:

      1. Define the OperatorGroup CR and save the YAML file, for example, poison-pill-operator-group.yaml:

        1. apiVersion: operators.coreos.com/v1
        2. kind: OperatorGroup
        3. metadata:
        4. name: poison-pill-manager
        5. spec:
        6. targetNamespaces:
        7. - poison-pill
      2. To create the OperatorGroup CR, run the following command:

        1. $ oc create -f poison-pill-operator-group.yaml
    3. Create a CR:

      1. Define the Subscription CR and save the YAML file, for example, poison-pill-subscription.yaml:

        1. apiVersion: operators.coreos.com/v1alpha1
        2. kind: Subscription
        3. metadata:
        4. name: poison-pill-manager
        5. namespace: poison-pill
        6. spec:
        7. channel: alpha
        8. name: poison-pill-manager
        9. source: redhat-operators
        10. sourceNamespace: openshift-marketplace
        11. package: poison-pill-manager
      2. To create the Subscription CR, run the following command:

        1. $ oc create -f poison-pill-subscription.yaml

    Verification

    1. Verify that the installation succeeded by inspecting the CSV resource:

      Example output

      1. NAME DISPLAY VERSION REPLACES PHASE
      2. poison-pill.v0.1.4 Poison Pill Operator 0.1.4 Succeeded
    2. Verify that the Poison Pill Operator is up and running:

      1. $ oc get deploy -n poison-pill

      Example output

      1. NAME READY UP-TO-DATE AVAILABLE AGE
      2. poison-pill-controller-manager 1/1 1 1 10d
    3. Verify that the Poison Pill Operator created the PoisonPillConfig CR:

      1. $ oc get PoisonPillConfig -n poison-pill

      Example output

      1. NAME AGE
      2. poison-pill-config 10d
    4. Verify that each poison pill pod is scheduled and running on each worker node:

      1. $ oc get daemonset -n poison-pill

      Example output

    Configuring machine health checks to use the Poison Pill Operator

    Use the following procedure to configure the machine health checks to use the Poison Pill Operator as a remediation provider.

    Prerequisites

    • Install the OpenShift CLI (oc).

    • Log in as a user with cluster-admin privileges.

    1. Create a PoisonPillRemediationTemplate CR:

      1. Define the PoisonPillRemediationTemplate CR:

        1. apiVersion: poison-pill.medik8s.io/v1alpha1
        2. metadata:
        3. namespace: openshift-machine-api
        4. name: poisonpillremediationtemplate-sample
        5. spec:
        6. template:
        7. spec: {}
      2. To create the PoisonPillRemediationTemplate CR, run the following command:

    2. Create or update the MachineHealthCheck CR to point to the PoisonPillRemediationTemplate CR:

      1. Define or update the MachineHealthCheck CR:

        1. apiVersion: machine.openshift.io/v1beta1
        2. kind: MachineHealthCheck
        3. metadata:
        4. name: machine-health-check
        5. namespace: openshift-machine-api
        6. spec:
        7. selector:
        8. matchLabels:
        9. machine.openshift.io/cluster-api-machine-role: "worker"
        10. machine.openshift.io/cluster-api-machine-type: "worker"
        11. unhealthyConditions:
        12. - type: "Ready"
        13. timeout: "300s"
        14. status: "False"
        15. - type: "Ready"
        16. timeout: "300s"
        17. status: "Unknown"
        18. maxUnhealthy: "40%"
        19. nodeStartupTimeout: "10m"
        20. remediationTemplate: (1)
        21. kind: PoisonPillRemediationTemplate
        22. apiVersion: poison-pill.medik8s.io/v1alpha1
        23. name: <poison-pill-remediation-template-sample>
      2. To create a MachineHealthCheck CR, run the following command:

        1. $ oc create -f <file-name>.yaml
      3. To update a MachineHealthCheck CR, run the following command:

        1. $ oc apply -f <file-name>.yaml

    Issue

    You want to troubleshoot issues with the Poison Pill Operator.

    Resolution

    Check the Operator logs.

    Issue

    The Poison Pill Operator is installed but the daemon set is not available.

    Resolution

    Check the Operator logs for errors or warnings.

    Issue

    An unhealthy node was not remediated.

    Resolution

    Verify that the PoisonPillRemediation CR was created by running the following command:

    1. $ oc get ppr -A

    If the MachineHealthCheck controller did not create the PoisonPillRemediation CR when the node turned unhealthy, check the logs of the MachineHealthCheck controller. Additionally, ensure that the MachineHealthCheck CR includes the required specification to use the remediation template.

    If the PoisonPillRemediation CR was created, ensure that its name matches the unhealthy node or the machine object.

    Issue

    The Poison Pill daemon set exists even after after uninstalling the Operator.

    Resolution

    Additional resources

    The Poison Pill Operator is supported in a restricted network environment. For more information, see .