Working with nodes

    Evacuating pods allows you to migrate all or selected pods from a given node or nodes.

    You can only evacuate pods backed by a replication controller. The replication controller creates new pods on other nodes and removes the existing pods from the specified node(s).

    Bare pods, meaning those not backed by a replication controller, are unaffected by default. You can evacuate a subset of pods by specifying a pod-selector. Pod selectors are based on labels, so all the pods with the specified label will be evacuated.

    Procedure

    1. Mark the nodes unschedulable before performing the pod evacuation.

      1. Mark the node as unschedulable:

        Example output

      2. Check that the node status is NotReady,SchedulingDisabled:

        1. $ oc get node <node1>

        Example output

        1. NAME STATUS ROLES AGE VERSION
        2. <node1> NotReady,SchedulingDisabled worker 1d v1.21.0
    2. Evacuate the pods using one of the following methods:

      • Evacuate all or selected pods on one or more nodes:

        1. $ oc adm drain <node1> <node2> [--pod-selector=<pod_selector>]
      • Force the deletion of bare pods using the --force option. When set to true, deletion continues even if there are pods not managed by a replication controller, replica set, job, daemon set, or stateful set:

        1. $ oc adm drain <node1> <node2> --force=true
      • Set a period of time in seconds for each Pod to terminate gracefully, use --grace-period. If negative, the default value specified in the Pod will be used:

        1. $ oc adm drain <node1> <node2> --grace-period=-1
      • Ignore pods managed by daemon sets using the --ignore-daemonsets flag set to true:

        1. $ oc adm drain <node1> <node2> --ignore-daemonsets=true
      • Set the length of time to wait before giving up using the --timeout flag. A value of 0 sets an infinite length of time:

        1. $ oc adm drain <node1> <node2> --timeout=5s
      • Delete pods even if there are pods using emptyDir using the --delete-local-data flag set to true. Local data is deleted when the node is drained:

        1. $ oc adm drain <node1> <node2> --delete-local-data=true
      • List objects that will be migrated without actually performing the evacuation, using the --dry-run option set to true:

        1. $ oc adm drain <node1> <node2> --dry-run=true

        Instead of specifying specific node names (for example, <node1> <node2>), you can use the --selector=<node_selector> option to evacuate pods on selected nodes.

    3. Mark the node as schedulable when done.

      1. $ oc adm uncordon <node1>

    Understanding how to update labels on nodes

    You can update any label on a node.

    Node labels are not persisted after a node is deleted even if the node is backed up by a Machine.

    • The following command adds or updates labels on a node:

      1. $ oc label node <node> <key_1>=<value_1> ... <key_n>=<value_n>

      For example:

    • The following command updates all pods in the namespace:

      1. $ oc label pods --all <key_1>=<value_1>

      For example:

      1. $ oc label pods --all status=unhealthy

    By default, healthy nodes with a Ready status are marked as schedulable, meaning that new pods are allowed for placement on the node. Manually marking a node as unschedulable blocks any new pods from being scheduled on the node. Existing pods on the node are not affected.

    • The following command marks a node or nodes as unschedulable:

      Example output

      1. $ oc adm cordon <node>
      1. $ oc adm cordon node1.example.com

      Example output

      1. node/node1.example.com cordoned
      2. NAME LABELS STATUS
      3. node1.example.com kubernetes.io/hostname=node1.example.com Ready,SchedulingDisabled
    • The following command marks a currently unschedulable node or nodes as schedulable:

      1. $ oc adm uncordon <node1>

      Alternatively, instead of specifying specific node names (for example, <node>), you can use the --selector=<node_selector> option to mark selected nodes as schedulable or unschedulable.

    Configuring control plane nodes as schedulable

    You can configure control plane nodes (also known as the master nodes) to be schedulable, meaning that new pods are allowed for placement on the master nodes. By default, control plane nodes are not schedulable.

    You can set the masters to be schedulable, but must retain the worker nodes.

    You can deploy OKD with no worker nodes on a bare metal cluster. In this case, the control plane nodes are marked schedulable by default.

    You can allow or disallow control plane nodes to be schedulable by configuring the mastersSchedulable field.

    Procedure

    1. Edit the schedulers.config.openshift.io resource.

      1. $ oc edit schedulers.config.openshift.io cluster
    2. Configure the mastersSchedulable field.

      1. apiVersion: config.openshift.io/v1
      2. kind: Scheduler
      3. metadata:
      4. creationTimestamp: "2019-09-10T03:04:05Z"
      5. generation: 1
      6. name: cluster
      7. resourceVersion: "433"
      8. uid: a636d30a-d377-11e9-88d4-0a60097bee62
      9. spec:
      10. policy:
      11. name: ""
      12. status: {}
    3. Save the file to apply the changes.

    When you delete a node using the CLI, the node object is deleted in Kubernetes, but the pods that exist on the node are not deleted. Any bare pods not backed by a replication controller become inaccessible to OKD. Pods backed by replication controllers are rescheduled to other available nodes. You must delete local manifest pods.

    Procedure

    To delete a node from the OKD cluster, edit the appropriate MachineSet object:

    If you are running cluster on bare metal, you cannot delete a node by editing MachineSet objects. Machine sets are only available when a cluster is integrated with a cloud provider. Instead you must unschedule and drain the node before manually deleting it.

    1. View the machine sets that are in the cluster:

      1. $ oc get machinesets -n openshift-machine-api

      The machine sets are listed in the form of <clusterid>-worker-<aws-region-az>.

    2. Scale the machine set:

      1. $ oc scale --replicas=2 machineset <machineset> -n openshift-machine-api

    For more information on scaling your cluster using a machine set, see Manually scaling a machine set.

    Deleting nodes from a bare metal cluster

    When you delete a node using the CLI, the node object is deleted in Kubernetes, but the pods that exist on the node are not deleted. Any bare pods not backed by a replication controller become inaccessible to OKD. Pods backed by replication controllers are rescheduled to other available nodes. You must delete local manifest pods.

    Procedure

    Delete a node from an OKD cluster running on bare metal by completing the following steps:

    1. Mark the node as unschedulable:

      1. $ oc adm cordon <node_name>
    2. Drain all pods on the node:

      1. $ oc adm drain <node_name> --force=true

      This step might fail if the node is offline or unresponsive. Even if the node does not respond, it might still be running a workload that writes to shared storage. To avoid data corruption, power down the physical hardware before you proceed.

    3. Delete the node from the cluster:

      Although the node object is now deleted from the cluster, it can still rejoin the cluster after reboot or if the kubelet service is restarted. To permanently delete the node and all its data, you must .

    4. If you powered down the physical hardware, turn it back on so that the node can rejoin the cluster.

    Setting SELinux booleans

    OKD allows you to enable and disable an SELinux boolean on a Fedora CoreOS (FCOS) node. The following procedure explains how to modify SELinux booleans on nodes using the Machine Config Operator (MCO). This procedure uses container_manage_cgroup as the example boolean. You can modify this value to whichever boolean you need.

    • You have installed the OpenShift CLI (oc).

    Procedure

    1. Create a new YAML file with a MachineConfig object, displayed in the following example:

      1. apiVersion: machineconfiguration.openshift.io/v1
      2. kind: MachineConfig
      3. metadata:
      4. labels:
      5. machineconfiguration.openshift.io/role: worker
      6. name: 99-worker-setsebool
      7. spec:
      8. config:
      9. ignition:
      10. version: 2.2.0
      11. systemd:
      12. units:
      13. - contents: |
      14. [Unit]
      15. Description=Set SELinux booleans
      16. Before=kubelet.service
      17. [Service]
      18. Type=oneshot
      19. ExecStart=/sbin/setsebool container_manage_cgroup=on
      20. RemainAfterExit=true
      21. [Install]
      22. WantedBy=multi-user.target graphical.target
      23. enabled: true
      24. name: setsebool.service
    2. Create the new MachineConfig object by running the following command:

      1. $ oc create -f 99-worker-setsebool.yaml

    In some special cases, you might want to add kernel arguments to a set of nodes in your cluster. This should only be done with caution and clear understanding of the implications of the arguments you set.

    Improper use of kernel arguments can result in your systems becoming unbootable.

    Examples of kernel arguments you could set include:

    • enforcing=0: Configures Security Enhanced Linux (SELinux) to run in permissive mode. In permissive mode, the system acts as if SELinux is enforcing the loaded security policy, including labeling objects and emitting access denial entries in the logs, but it does not actually deny any operations. While not recommended for production systems, permissive mode can be helpful for debugging.

    • nosmt: Disables symmetric multithreading (SMT) in the kernel. Multithreading allows multiple logical threads for each CPU. You could consider nosmt in multi-tenant environments to reduce risks from potential cross-thread attacks. By disabling SMT, you essentially choose security over performance.

    See for a list and descriptions of kernel arguments.

    In the following procedure, you create a MachineConfig object that identifies:

    • A set of machines to which you want to add the kernel argument. In this case, machines with a worker role.

    • Kernel arguments that are appended to the end of the existing kernel arguments.

    • A label that indicates where in the list of machine configs the change is applied.

    Prerequisites

    • Have administrative privilege to a working OKD cluster.

    Procedure

    1. List existing MachineConfig objects for your OKD cluster to determine how to label your machine config:

      1. $ oc get MachineConfig

      Example output

      1. NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE
      2. 00-master 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      3. 00-worker 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      4. 01-master-kubelet 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      5. 01-worker-container-runtime 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      6. 01-worker-kubelet 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      7. 99-master-generated-registries 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      8. 99-worker-generated-registries 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      9. 99-worker-ssh 3.2.0 40m
      10. rendered-master-23e785de7587df95a4b517e0647e5ab7 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      11. rendered-worker-5d596d9293ca3ea80c896a1191735bb1 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
    2. Create a MachineConfig object file that identifies the kernel argument (for example, 05-worker-kernelarg-selinuxpermissive.yaml)

      1. apiVersion: machineconfiguration.openshift.io/v1
      2. kind: MachineConfig
      3. metadata:
      4. labels:
      5. machineconfiguration.openshift.io/role: worker(1)
      6. name: 05-worker-kernelarg-selinuxpermissive(2)
      7. spec:
      8. config:
      9. ignition:
      10. version: 3.2.0
      11. kernelArguments:
      12. - enforcing=0(3)
    3. Create the new machine config:

      1. $ oc create -f 05-worker-kernelarg-selinuxpermissive.yaml
    4. Check the machine configs to see that the new one was added:

      1. $ oc get MachineConfig

      Example output

      1. NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE
      2. 00-master 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      3. 00-worker 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      4. 01-master-container-runtime 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      5. 01-master-kubelet 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      6. 01-worker-container-runtime 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      7. 01-worker-kubelet 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      8. 05-worker-kernelarg-selinuxpermissive 3.2.0 105s
      9. 99-master-generated-registries 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      10. 99-master-ssh 3.2.0 40m
      11. 99-worker-generated-registries 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      12. 99-worker-ssh 3.2.0 40m
      13. rendered-master-23e785de7587df95a4b517e0647e5ab7 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
      14. rendered-worker-5d596d9293ca3ea80c896a1191735bb1 52dd3ba6a9a527fc3ab42afac8d12b693534c8c9 3.2.0 33m
    5. Check the nodes:

      1. $ oc get nodes

      Example output

      1. NAME STATUS ROLES AGE VERSION
      2. ip-10-0-136-161.ec2.internal Ready worker 28m v1.21.0
      3. ip-10-0-136-243.ec2.internal Ready master 34m v1.21.0
      4. ip-10-0-141-105.ec2.internal Ready,SchedulingDisabled worker 28m v1.21.0
      5. ip-10-0-142-249.ec2.internal Ready master 34m v1.21.0
      6. ip-10-0-153-11.ec2.internal Ready worker 28m v1.21.0
      7. ip-10-0-153-150.ec2.internal Ready master 34m v1.21.0

      You can see that scheduling on each worker node is disabled as the change is being applied.

    6. Check that the kernel argument worked by going to one of the worker nodes and listing the kernel command line arguments (in /proc/cmdline on the host):

      1. $ oc debug node/ip-10-0-141-105.ec2.internal

      Example output

      1. Starting pod/ip-10-0-141-105ec2internal-debug ...
      2. To use host binaries, run `chroot /host`
      3. sh-4.2# cat /host/proc/cmdline
      4. BOOT_IMAGE=/ostree/rhcos-... console=tty0 console=ttyS0,115200n8
      5. rootflags=defaults,prjquota rw root=UUID=fd0... ostree=/ostree/boot.0/rhcos/16...
      6. coreos.oem.id=qemu coreos.oem.id=ec2 ignition.platform.id=ec2 enforcing=0
      7. sh-4.2# exit

      You should see the argument added to the other kernel arguments.

    Additional resources

    For more information on scaling your cluster using a MachineSet, see .