Controlling pod placement using node taints

    A taint allows a node to refuse a pod to be scheduled unless that pod has a matching toleration.

    You apply taints to a node through the specification (NodeSpec) and apply tolerations to a pod through the Pod specification (PodSpec). When you apply a taint a node, the scheduler cannot place a pod on that node unless the pod can tolerate the taint.

    Example taint in a node specification

    Example toleration in a Pod spec

    1. spec:
    2. tolerations:
    3. - key: "key1"
    4. operator: "Equal"
    5. value: "value1"
    6. effect: "NoExecute"
    7. tolerationSeconds: 3600
    8. ....

    Taints and tolerations consist of a key, value, and effect.

    1. If you add a NoSchedule taint to a control plane node, the node must have the node-role.kubernetes.io/master=:NoSchedule taint, which is added by default.

      For example:

      1. apiVersion: v1
      2. kind: Node
      3. metadata:
      4. annotations:
      5. machine.openshift.io/machine: openshift-machine-api/ci-ln-62s7gtb-f76d1-v8jxv-master-0
      6. machineconfiguration.openshift.io/currentConfig: rendered-master-cdc1ab7da414629332cc4c3926e6e59c
      7. ...
      8. spec:
      9. taints:
      10. - effect: NoSchedule
      11. key: node-role.kubernetes.io/master
      12. ...

    A toleration matches a taint:

    • If the operator parameter is set to Equal:

      • the key parameters are the same;

      • the value parameters are the same;

      • the effect parameters are the same.

    • If the operator parameter is set to Exists:

      • the key parameters are the same;

      • the effect parameters are the same.

    The following taints are built into OKD:

    • node.kubernetes.io/not-ready: The node is not ready. This corresponds to the node condition Ready=False.

    • node.kubernetes.io/unreachable: The node is unreachable from the node controller. This corresponds to the node condition Ready=Unknown.

    • node.kubernetes.io/memory-pressure: The node has memory pressure issues. This corresponds to the node condition MemoryPressure=True.

    • node.kubernetes.io/disk-pressure: The node has disk pressure issues. This corresponds to the node condition DiskPressure=True.

    • node.kubernetes.io/network-unavailable: The node network is unavailable.

    • node.kubernetes.io/unschedulable: The node is unschedulable.

    • node.cloudprovider.kubernetes.io/uninitialized: When the node controller is started with an external cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint.

    • node.kubernetes.io/pid-pressure: The node has pid pressure. This corresponds to the node condition PIDPressure=True.

      OKD does not set a default pid.available evictionHard.

    You can specify how long a pod can remain bound to a node before being evicted by specifying the tolerationSeconds parameter in the Pod specification or MachineSet object. If a taint with the NoExecute effect is added to a node, a pod that does tolerate the taint, which has the tolerationSeconds parameter, the pod is not evicted until that time period expires.

    Example output

    1. spec:
    2. tolerations:
    3. - key: "key1"
    4. operator: "Equal"
    5. value: "value1"
    6. effect: "NoExecute"
    7. tolerationSeconds: 3600

    Here, if this pod is running but does not have a matching toleration, the pod stays bound to the node for 3,600 seconds and then be evicted. If the taint is removed before that time, the pod is not evicted.

    Understanding how to use multiple taints

    You can put multiple taints on the same node and multiple tolerations on the same pod. OKD processes multiple taints and tolerations as follows:

    1. Process the taints for which the pod has a matching toleration.

      • If there is at least one unmatched taint with effect NoSchedule, OKD cannot schedule a pod onto that node.

      • If there is no unmatched taint with effect NoSchedule but there is at least one unmatched taint with effect PreferNoSchedule, OKD tries to not schedule the pod onto the node.

      • If there is at least one unmatched taint with effect NoExecute, OKD evicts the pod from the node if it is already running on the node, or the pod is not scheduled onto the node if it is not yet running on the node.

        • Pods that do not tolerate the taint are evicted immediately.

        • Pods that tolerate the taint without specifying tolerationSeconds in their Pod specification remain bound forever.

        • Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time.

    For example:

    • Add the following taints to the node:

      1. $ oc adm taint nodes node1 key1=value1:NoSchedule
      1. $ oc adm taint nodes node1 key1=value1:NoExecute
      1. $ oc adm taint nodes node1 key2=value2:NoSchedule
    • The pod has the following tolerations:

      1. spec:
      2. - key: "key1"
      3. operator: "Equal"
      4. effect: "NoSchedule"
      5. - key: "key1"
      6. operator: "Equal"
      7. value: "value1"
      8. effect: "NoExecute"

    In this case, the pod cannot be scheduled onto the node, because there is no toleration matching the third taint. The pod continues running if it is already running on the node when the taint is added, because the third taint is the only one of the three that is not tolerated by the pod.

    Understanding pod scheduling and node conditions (taint node by condition)

    The Taint Nodes By Condition feature, which is enabled by default, automatically taints nodes that report conditions such as memory pressure and disk pressure. If a node reports a condition, a taint is added until the condition clears. The taints have the NoSchedule effect, which means no pod can be scheduled on the node unless the pod has a matching toleration.

    The scheduler checks for these taints on nodes before scheduling pods. If the taint is present, the pod is scheduled on a different node. Because the scheduler checks for taints and not the actual node conditions, you configure the scheduler to ignore some of these node conditions by adding appropriate pod tolerations.

    To ensure backward compatibility, the daemon set controller automatically adds the following tolerations to all daemons:

    • node.kubernetes.io/memory-pressure

    • node.kubernetes.io/disk-pressure

    • node.kubernetes.io/unschedulable (1.10 or later)

    • node.kubernetes.io/network-unavailable (host network only)

    You can also add arbitrary tolerations to daemon sets.

    The control plane also adds the node.kubernetes.io/memory-pressure toleration on pods that have a QoS class. This is because Kubernetes manages pods in the Guaranteed or Burstable QoS classes. The new BestEffort pods do not get scheduled onto the affected node.

    The Taint-Based Evictions feature, which is enabled by default, evicts pods from a node that experiences specific conditions, such as not-ready and unreachable. When a node experiences one of these conditions, OKD automatically adds taints to the node, and starts evicting and rescheduling the pods on different nodes.

    Taint Based Evictions have a NoExecute effect, where any pod that does not tolerate the taint is evicted immediately and any pod that does tolerate the taint will never be evicted, unless the pod uses the tolerationSeconds parameter.

    The tolerationSeconds parameter allows you to specify how long a pod stays bound to a node that has a node condition. If the condition still exists after the tolerationSeconds period, the taint remains on the node and the pods with a matching toleration are evicted. If the condition clears before the tolerationSeconds period, pods with matching tolerations are not removed.

    If you use the tolerationSeconds parameter with no value, pods are never evicted because of the not ready and unreachable node conditions.

    OKD automatically adds a toleration for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with tolerationSeconds=300, unless the Pod configuration specifies either toleration.

    1. spec:
    2. tolerations:
    3. - key: node.kubernetes.io/not-ready
    4. operator: Exists
    5. effect: NoExecute
    6. tolerationSeconds: 300 (1)
    7. - key: node.kubernetes.io/unreachable
    8. operator: Exists
    9. effect: NoExecute
    10. tolerationSeconds: 300
    1These tolerations ensure that the default pod behavior is to remain bound for five minutes after one of these node conditions problems is detected.

    You can configure these tolerations as needed. For example, if you have an application with a lot of local state, you might want to keep the pods bound to node for a longer time in the event of network partition, allowing for the partition to recover and avoiding pod eviction.

    Pods spawned by a daemon set are created with NoExecute tolerations for the following taints with no tolerationSeconds:

    • node.kubernetes.io/unreachable

    • node.kubernetes.io/not-ready

    As a result, daemon set pods are never evicted because of these node conditions.

    Tolerating all taints

    You can configure a pod to tolerate all taints by adding an operator: "Exists" toleration with no key and value parameters. Pods with this toleration are not removed from a node that has taints.

    Pod spec for tolerating all taints

    1. spec:
    2. tolerations:
    3. - operator: "Exists"

    You add tolerations to pods and taints to nodes to allow the node to control which pods should or should not be scheduled on them. For existing pods and nodes, you should add the toleration to the pod first, then add the taint to the node to avoid pods being removed from the node before you can add the toleration.

    Procedure

    1. Add a toleration to a pod by editing the Pod spec to include a tolerations stanza:

      Sample pod configuration file with an Equal operator

      1. spec:
      2. tolerations:
      3. - key: "key1" (1)
      4. value: "value1"
      5. operator: "Equal"
      6. effect: "NoExecute"
      7. tolerationSeconds: 3600 (2)
      1The toleration parameters, as described in the Taint and toleration components table.
      2The tolerationSeconds parameter specifies how long a pod can remain bound to a node before being evicted.

      For example:

      Sample pod configuration file with an Exists operator

      1The Exists operator does not take a value.

      This example places a taint on node1 that has key key1, value value1, and taint effect NoExecute.

    2. Add a taint to a node by using the following command with the parameters described in the Taint and toleration components table:

      1. $ oc adm taint nodes <node_name> <key>=<value>:<effect>

      For example:

      1. $ oc adm taint nodes node1 key1=value1:NoExecute

      This command places a taint on node1 that has key key1, value value1, and effect NoExecute.

      If you add a NoSchedule taint to a control plane node, the node must have the node-role.kubernetes.io/master=:NoSchedule taint, which is added by default.

      For example:

      1. apiVersion: v1
      2. kind: Node
      3. metadata:
      4. annotations:
      5. machine.openshift.io/machine: openshift-machine-api/ci-ln-62s7gtb-f76d1-v8jxv-master-0
      6. machineconfiguration.openshift.io/currentConfig: rendered-master-cdc1ab7da414629332cc4c3926e6e59c
      7. spec:
      8. taints:
      9. - effect: NoSchedule
      10. key: node-role.kubernetes.io/master

      The tolerations on the pod match the taint on the node. A pod with either toleration can be scheduled onto node1.

    Adding taints and tolerations using a compute machine set

    You can add taints to nodes using a compute machine set. All nodes associated with the MachineSet object are updated with the taint. Tolerations respond to taints added by a compute machine set in the same manner as taints added directly to the nodes.

    Procedure

    1. Add a toleration to a pod by editing the Pod spec to include a tolerations stanza:

      1. spec:
      2. tolerations:
      3. - key: "key1" (1)
      4. value: "value1"
      5. operator: "Equal"
      6. effect: "NoExecute"
      7. tolerationSeconds: 3600 (2)

      For example:

      Sample pod configuration file with Exists operator

      1. spec:
      2. tolerations:
      3. - key: "key1"
      4. operator: "Exists"
      5. effect: "NoExecute"
      6. tolerationSeconds: 3600
    2. Add the taint to the MachineSet object:

      1. Edit the MachineSet YAML for the nodes you want to taint or you can create a new MachineSet object:

        1. $ oc edit machineset <machineset>
      2. Add the taint to the spec.template.spec section:

        Example taint in a compute machine set specification

        1. spec:
        2. ....
        3. template:
        4. ....
        5. spec:
        6. taints:
        7. - effect: NoExecute
        8. value: value1
        9. ....

        This example places a taint that has the key key1, value value1, and taint effect NoExecute on the nodes.

      3. Scale down the compute machine set to 0:

        1. $ oc scale --replicas=0 machineset <machineset> -n openshift-machine-api

        You can alternatively apply the following YAML to scale the compute machine set:

        1. apiVersion: machine.openshift.io/v1beta1
        2. kind: MachineSet
        3. metadata:
        4. name: <machineset>
        5. namespace: openshift-machine-api
        6. spec:
        7. replicas: 0

        Wait for the machines to be removed.

      4. Scale up the compute machine set as needed:

        1. $ oc scale --replicas=2 machineset <machineset> -n openshift-machine-api

        Or:

        Wait for the machines to start. The taint is added to the nodes associated with the MachineSet object.

    If you want to dedicate a set of nodes for exclusive use by a particular set of users, add a toleration to their pods. Then, add a corresponding taint to those nodes. The pods with the tolerations are allowed to use the tainted nodes or any other nodes in the cluster.

    If you want ensure the pods are scheduled to only those tainted nodes, also add a label to the same set of nodes and add a node affinity to the pods so that the pods can only be scheduled onto nodes with that label.

    Procedure

    To configure a node so that users can use only that node:

    1. Add a corresponding taint to those nodes:

      For example:

      1. $ oc adm taint nodes node1 dedicated=groupName:NoSchedule

      You can alternatively apply the following YAML to add the taint:

      1. kind: Node
      2. apiVersion: v1
      3. metadata:
      4. name: <node_name>
      5. labels:
      6. spec:
      7. taints:
      8. - key: dedicated
      9. value: groupName
      10. effect: NoSchedule
    2. Add a toleration to the pods by writing a custom admission controller.

    Creating a project with a node selector and toleration

    You can create a project that uses a node selector and toleration, which are set as annotations, to control the placement of pods onto specific nodes. Any subsequent resources created in the project are then scheduled on nodes that have a taint matching the toleration.

    Prerequisites

    • A label for node selection has been added to one or more nodes by using a compute machine set or editing the node directly.

    • A taint has been added to one or more nodes by using a compute machine set or editing the node directly.

    Procedure

    1. Create a Project resource definition, specifying a node selector and toleration in the metadata.annotations section:

      Example project.yaml file

      1. kind: Project
      2. apiVersion: project.openshift.io/v1
      3. metadata:
      4. name: <project_name> (1)
      5. annotations:
      6. openshift.io/node-selector: '<label>' (2)
      7. scheduler.alpha.kubernetes.io/defaultTolerations: >-
      8. [{"operator": "Exists", "effect": "NoSchedule", "key":
      9. "<key_name>"} (3)
      10. ]
      1The project name.
      2The default node selector label.
      3The toleration parameters, as described in the Taint and toleration components table. This example uses the NoSchedule effect, which allows existing pods on the node to remain, and the Exists operator, which does not take a value.
    2. Use the oc apply command to create the project:

      1. $ oc apply -f project.yaml

    Any subsequent resources created in the <project_name> namespace should now be scheduled on the specified nodes.

    Additional resources

    Controlling nodes with special hardware using taints and tolerations

    In a cluster where a small subset of nodes have specialized hardware, you can use taints and tolerations to keep pods that do not need the specialized hardware off of those nodes, leaving the nodes for pods that do need the specialized hardware. You can also require pods that need specialized hardware to use specific nodes.

    You can achieve this by adding a toleration to pods that need the special hardware and tainting the nodes that have the specialized hardware.

    Procedure

    To ensure nodes with specialized hardware are reserved for specific pods:

    1. Add a toleration to pods that need the special hardware.

      For example:

      1. spec:
      2. tolerations:
      3. - key: "disktype"
      4. value: "ssd"
      5. operator: "Equal"
      6. effect: "NoSchedule"
      7. tolerationSeconds: 3600
    2. Taint the nodes that have the specialized hardware using one of the following commands:

      1. $ oc adm taint nodes <node-name> disktype=ssd:NoSchedule

      Or:

      1. $ oc adm taint nodes <node-name> disktype=ssd:PreferNoSchedule

      You can alternatively apply the following YAML to add the taint:

      1. kind: Node
      2. apiVersion: v1
      3. metadata:
      4. name: <node_name>
      5. labels:
      6. spec:
      7. taints:
      8. - key: disktype
      9. value: ssd
      10. effect: PreferNoSchedule

    You can remove taints from nodes and tolerations from pods as needed. You should add the toleration to the pod first, then add the taint to the node to avoid pods being removed from the node before you can add the toleration.

    Procedure

    To remove taints and tolerations:

    1. To remove a taint from a node:

      1. $ oc adm taint nodes <node-name> <key>-

      For example:

      1. $ oc adm taint nodes ip-10-0-132-248.ec2.internal key1-

      Example output

    2. To remove a toleration from a pod, edit the Pod spec to remove the toleration:

      1. spec:
      2. tolerations:
      3. - key: "key2"
      4. operator: "Exists"
      5. tolerationSeconds: 3600