Using CPU Manager and Topology Manager

    CPU Manager is useful for workloads that have some of these attributes:

    • Require as much CPU time as possible.

    • Are sensitive to processor cache misses.

    • Are low-latency network applications.

    • Coordinate with other processes and benefit from sharing a single processor cache.

    Topology Manager collects hints from the CPU Manager, Device Manager, and other Hint Providers to align pod resources, such as CPU, SR-IOV VFs, and other device resources, for all Quality of Service (QoS) classes on the same non-uniform memory access (NUMA) node.

    Topology Manager uses topology information from the collected hints to decide if a pod can be accepted or rejected on a node, based on the configured Topology Manager policy and pod resources requested.

    Topology Manager is useful for workloads that use hardware accelerators to support latency-critical execution and high throughput parallel computation.

    To use Topology Manager you must configure CPU Manager with the policy.

    Procedure

    1. Optional: Label a node:

    2. Edit the MachineConfigPool of the nodes where CPU Manager should be enabled. In this example, all workers have CPU Manager enabled:

      1. # oc edit machineconfigpool worker
    3. Add a label to the worker machine config pool:

      1. metadata:
      2. creationTimestamp: 2020-xx-xxx
      3. generation: 3
      4. labels:
      5. custom-kubelet: cpumanager-enabled
    4. Create a KubeletConfig, cpumanager-kubeletconfig.yaml, custom resource (CR). Refer to the label created in the previous step to have the correct nodes updated with the new kubelet config. See the machineConfigPoolSelector section:

      1. apiVersion: machineconfiguration.openshift.io/v1
      2. kind: KubeletConfig
      3. metadata:
      4. name: cpumanager-enabled
      5. spec:
      6. machineConfigPoolSelector:
      7. matchLabels:
      8. custom-kubelet: cpumanager-enabled
      9. kubeletConfig:
      10. cpuManagerPolicy: static (1)
      11. cpuManagerReconcilePeriod: 5s (2)
    5. Create the dynamic kubelet config:

      1. # oc create -f cpumanager-kubeletconfig.yaml

      This adds the CPU Manager feature to the kubelet config and, if needed, the Machine Config Operator (MCO) reboots the node. To enable CPU Manager, a reboot is not needed.

      1. # oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7

      Example output

      1. "ownerReferences": [
      2. {
      3. "apiVersion": "machineconfiguration.openshift.io/v1",
      4. "kind": "KubeletConfig",
      5. "name": "cpumanager-enabled",
      6. "uid": "7ed5616d-6b72-11e9-aae1-021e1ce18878"
      7. }
      8. ]
    6. Check the worker for the updated kubelet.conf:

      1. # oc debug node/perf-node.example.com
      2. sh-4.2# cat /host/etc/kubernetes/kubelet.conf | grep cpuManager

      Example output

      1. cpuManagerPolicy: static (1)
      2. cpuManagerReconcilePeriod: 5s (2)
    7. Create a pod that requests a core or multiple cores. Both limits and requests must have their CPU value set to a whole integer. That is the number of cores that will be dedicated to this pod:

      Example output

      1. apiVersion: v1
      2. kind: Pod
      3. metadata:
      4. generateName: cpumanager-
      5. spec:
      6. containers:
      7. - name: cpumanager
      8. resources:
      9. requests:
      10. cpu: 1
      11. memory: "1G"
      12. limits:
      13. cpu: 1
      14. memory: "1G"
      15. nodeSelector:
      16. cpumanager: "true"
    8. Create the pod:

      1. # oc create -f cpumanager-pod.yaml
    9. Verify that the pod is scheduled to the node that you labeled:

      1. # oc describe pod cpumanager

      Example output

      1. Name: cpumanager-6cqz7
      2. Priority: 0
      3. PriorityClassName: <none>
      4. Node: perf-node.example.com/xxx.xx.xx.xxx
      5. ...
      6. Limits:
      7. cpu: 1
      8. memory: 1G
      9. Requests:
      10. cpu: 1
      11. memory: 1G
      12. ...
      13. QoS Class: Guaranteed
      14. Node-Selectors: cpumanager=true
    10. Verify that the cgroups are set up correctly. Get the process ID (PID) of the pause process:

      1. # ├─init.scope
      2. └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 17
      3. └─kubepods.slice
      4. ├─kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice
      5. ├─crio-b5437308f1a574c542bdf08563b865c0345c8f8c0b0a655612c.scope
      6. └─32706 /pause

      Pods of quality of service (QoS) tier Guaranteed are placed within the kubepods.slice. Pods of other QoS tiers end up in child cgroups of kubepods:

      1. # cd /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice/crio-b5437308f1ad1a7db0574c542bdf08563b865c0345c86e9585f8c0b0a655612c.scope
      2. # for i in `ls cpuset.cpus tasks` ; do echo -n "$i "; cat $i ; done

      Example output

      1. cpuset.cpus 1
      2. tasks 32706
    11. Check the allowed CPU list for the task:

      1. # grep ^Cpus_allowed_list /proc/32706/status

      Example output

    12. Verify that another pod (in this case, the pod in the burstable QoS tier) on the system cannot run on the core allocated for the Guaranteed pod:

      1. # cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc494a073_6b77_11e9_98c0_06bba5c387ea.slice/crio-c56982f57b75a2420947f0afc6cafe7534c5734efc34157525fa9abbf99e3849.scope/cpuset.cpus
      2. 0
      3. # oc describe node perf-node.example.com

      Example output

      1. ...
      2. Capacity:
      3. attachable-volumes-aws-ebs: 39
      4. cpu: 2
      5. ephemeral-storage: 124768236Ki
      6. hugepages-1Gi: 0
      7. hugepages-2Mi: 0
      8. memory: 8162900Ki
      9. pods: 250
      10. Allocatable:
      11. attachable-volumes-aws-ebs: 39
      12. ephemeral-storage: 124768236Ki
      13. hugepages-1Gi: 0
      14. hugepages-2Mi: 0
      15. memory: 7548500Ki
      16. pods: 250
      17. ------- ---- ------------ ---------- --------------- ------------- ---
      18. default cpumanager-6cqz7 1 (66%) 1 (66%) 1G (12%) 1G (12%) 29m
      19. Allocated resources:
      20. (Total limits may be over 100 percent, i.e., overcommitted.)
      21. Resource Requests Limits
      22. -------- -------- ------
      23. cpu 1440m (96%) 1 (66%)

      This VM has two CPU cores. The system-reserved setting reserves 500 millicores, meaning that half of one core is subtracted from the total capacity of the node to arrive at the Node Allocatable amount. You can see that Allocatable CPU is 1500 millicores. This means you can run one of the CPU Manager pods since each will take one whole core. A whole core is equivalent to 1000 millicores. If you try to schedule a second pod, the system will accept the pod, but it will never be scheduled:

      1. NAME READY STATUS RESTARTS AGE
      2. cpumanager-6cqz7 1/1 Running 0 33m

    Topology Manager aligns Pod resources of all Quality of Service (QoS) classes by collecting topology hints from Hint Providers, such as CPU Manager and Device Manager, and using the collected hints to align the Pod resources.

    Topology Manager supports four allocation policies, which you assign in the cpumanager-enabled custom resource (CR):

    This is the default policy and does not perform any topology alignment.

    best-effort policy

    For each container in a pod with the best-effort topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager stores this and admits the pod to the node.

    restricted policy

    For each container in a pod with the restricted topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager rejects this pod from the node, resulting in a pod in a Terminated state with a pod admission failure.

    single-numa-node policy

    For each container in a pod with the single-numa-node topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager determines if a single NUMA Node affinity is possible. If it is, the pod is admitted to the node. If a single NUMA Node affinity is not possible, the Topology Manager rejects the pod from the node. This results in a pod in a Terminated state with a pod admission failure.

    To use Topology Manager, you must configure an allocation policy in the cpumanager-enabled custom resource (CR). This file might exist if you have set up CPU Manager. If the file does not exist, you can create the file.

    Prequisites

    • Configure the CPU Manager policy to be static.

    Procedure

    To activate Topololgy Manager:

    1. Configure the Topology Manager allocation policy in the cpumanager-enabled custom resource (CR).

      1. $ oc edit KubeletConfig cpumanager-enabled
      1. apiVersion: machineconfiguration.openshift.io/v1
      2. kind: KubeletConfig
      3. metadata:
      4. name: cpumanager-enabled
      5. spec:
      6. machineConfigPoolSelector:
      7. matchLabels:
      8. custom-kubelet: cpumanager-enabled
      9. kubeletConfig:
      10. cpuManagerPolicy: static (1)
      11. cpuManagerReconcilePeriod: 5s
      12. topologyManagerPolicy: single-numa-node (2)

    The example Pod specs below help illustrate pod interactions with Topology Manager.

    The following pod runs in the BestEffort QoS class because no resource requests or limits are specified.

    1. spec:
    2. containers:
    3. - name: nginx
    4. image: nginx

    The next pod runs in the Burstable QoS class because requests are less than limits.

    1. spec:
    2. containers:
    3. - name: nginx
    4. image: nginx
    5. resources:
    6. limits:
    7. memory: "200Mi"
    8. requests:
    9. memory: "100Mi"

    If the selected policy is anything other than none, Topology Manager would not consider either of these Pod specifications.

    The last example pod below runs in the Guaranteed QoS class because requests are equal to limits.

    1. spec:
    2. containers:
    3. - name: nginx
    4. image: nginx
    5. resources:
    6. limits:
    7. memory: "200Mi"
    8. cpu: "2"
    9. example.com/device: "1"
    10. requests:
    11. memory: "200Mi"
    12. cpu: "2"

    Topology Manager would consider this pod. The Topology Manager would consult the hint providers, which are CPU Manager and Device Manager, to get topology hints for the pod.