Validating single-node OpenShift cluster tuning for vDU application workloads

    Additional resources

    Use the following table as the basis to configure the cluster host firmware for vDU applications running on OKD 4.13.

    Table 1. Recommended cluster host firmware settings
    Firmware settingConfigurationDescription

    HyperTransport (HT)

    Enabled

    HyperTransport (HT) bus is a bus technology developed by AMD. HT provides a high-speed link between the components in the host memory and other system peripherals.

    UEFI

    Enabled

    Enable booting from UEFI for the vDU host.

    CPU Power and Performance Policy

    Performance

    Set CPU Power and Performance Policy to optimize the system for performance over energy efficiency.

    Uncore Frequency Scaling

    Disabled

    Disable Uncore Frequency Scaling to prevent the voltage and frequency of non-core parts of the CPU from being set independently.

    Uncore Frequency

    Maximum

    Sets the non-core parts of the CPU such as cache and memory controller to their maximum possible frequency of operation.

    Performance P-limit

    Disabled

    Disable Performance P-limit to prevent the Uncore frequency coordination of processors.

    Enhanced Intel® SpeedStep Tech

    Enabled

    Enable Enhanced Intel SpeedStep to allow the system to dynamically adjust processor voltage and core frequency that decreases power consumption and heat production in the host.

    Intel® Turbo Boost Technology

    Enabled

    Enable Turbo Boost Technology for Intel-based CPUs to automatically allow processor cores to run faster than the rated operating frequency if they are operating below power, current, and temperature specification limits.

    Intel Configurable TDP

    Enabled

    Enables Thermal Design Power (TDP) for the CPU.

    Configurable TDP Level

    Level 2

    TDP level sets the CPU power consumption required for a particular performance rating. TDP level 2 sets the CPU to the most stable performance level at the cost of power consumption.

    Energy Efficient Turbo

    Disabled

    Disable Energy Efficient Turbo to prevent the processor from using an energy-efficiency based policy.

    Hardware P-States

    Enabled or Disabled

    Enable OS-controlled P-States to allow power saving configurations. Disable (performance states) to optimize the operating system and CPU for performance over power consumption.

    Package C-State

    C0/C1 state

    Use C0 or C1 states to set the processor to a fully active state (C0) or to stop CPU internal clocks running in software (C1).

    C1E

    Disabled

    CPU Enhanced Halt (C1E) is a power saving feature in Intel chips. Disabling C1E prevents the operating system from sending a halt command to the CPU when inactive.

    Processor C6

    Disabled

    C6 power-saving is a CPU feature that automatically disables idle CPU cores and cache. Disabling C6 improves system performance.

    Sub-NUMA Clustering

    Sub-NUMA clustering divides the processor cores, cache, and memory into multiple NUMA domains. Disabling this option can increase performance for latency-sensitive workloads.

    Enable global SR-IOV and VT-d settings in the firmware for the host. These settings are relevant to bare-metal environments.

    Enable both C-states and OS-controlled P-States to allow per pod power management.

    Clusters running virtualized distributed unit (vDU) applications require a highly tuned and optimized configuration. The following information describes the various elements that you require to support vDU workloads in OKD 4.13 clusters.

    Check that the MachineConfig custom resources (CRs) that you extract from the ztp-site-generate container are applied in the cluster. The CRs can be found in the extracted out/source-crs/extra-manifest/ folder.

    The following MachineConfig CRs from the ztp-site-generate container configure the cluster host:

    Additional resources

    The following Operators are required for clusters running virtualized distributed unit (vDU) applications and are a part of the baseline reference configuration:

    • Node Tuning Operator (NTO). NTO packages functionality that was previously delivered with the Performance Addon Operator, which is now a part of NTO.

    • PTP Operator

    • SR-IOV Network Operator

    • Red Hat OpenShift Logging Operator

    • Local Storage Operator

    Always use the latest supported real-time kernel version in your cluster. Ensure that you apply the following configurations in the cluster:

    1. Ensure that the following additionalKernelArgs are set in the cluster performance profile:

    2. Ensure that the performance-patch profile in the Tuned CR configures the correct CPU isolation set that matches the isolated CPU set in the related PerformanceProfile CR, for example:

      1. spec:
      2. profile:
      3. - name: performance-patch
      4. # The 'include' line must match the associated PerformanceProfile name
      5. # And the cmdline_crash CPU set must match the 'isolated' set in the associated PerformanceProfile
      6. data: |
      7. [main]
      8. summary=Configuration changes profile inherited from performance created tuned
      9. include=openshift-node-performance-openshift-node-performance-profile
      10. [bootloader]
      11. cmdline_crash=nohz_full=2-51,54-103 (1)
      12. [sysctl]
      13. kernel.timer_migration=1
      14. [scheduler]
      15. group.ice-ptp=0:f:10:*:ice-ptp.*
      16. [service]
      17. service.stalld=start,enable
      18. service.chronyd=stop,disable
      1Listed CPUs depend on the host hardware configuration, specifically the number of available CPUs in the system and the CPU topology.

    Always use the latest version of the realtime kernel in your OKD clusters. If you are unsure about the kernel version that is in use in the cluster, you can compare the current realtime kernel version to the release version with the following procedure.

    Prerequisites

    • You have installed the OpenShift CLI (oc).

    • You are logged in as a user with cluster-admin privileges.

    • You have installed podman.

    Procedure

    1. Run the following command to get the cluster version:

      1. $ OCP_VERSION=$(oc get clusterversion version -o jsonpath='{.status.desired.version}{"\n"}')
    2. Get the release image SHA number:

      1. $ DTK_IMAGE=$(oc adm release info --image-for=driver-toolkit quay.io/openshift-release-dev/ocp-release:$OCP_VERSION-x86_64)
    3. Run the release image container and extract the kernel version that is packaged with cluster’s current release:

      1. $ podman run --rm $DTK_IMAGE rpm -qa | grep 'kernel-rt-core-' | sed 's#kernel-rt-core-##'

      Example output

      1. 4.18.0-305.49.1.rt7.121.el8_4.x86_64

      This is the default realtime kernel version that ships with the release.

      The realtime kernel is denoted by the string .rt in the kernel version.

    Verification

    Check that the kernel version listed for the cluster’s current release matches actual realtime kernel that is running in the cluster. Run the following commands to check the running realtime kernel version:

    1. Open a remote shell connection to the cluster node:

      1. $ oc debug node/<node_name>
    2. Check the realtime kernel version:

      1. sh-4.4# uname -r

      Example output

      1. 4.18.0-305.49.1.rt7.121.el8_4.x86_64

    You can check that clusters are running the correct configuration. The following procedure describes how to check the various configurations that you require to deploy a DU application in OKD 4.13 clusters.

    Prerequisites

    • You have deployed a cluster and tuned it for vDU workloads.

    • You have installed the OpenShift CLI (oc).

    • You have logged in as a user with cluster-admin privileges.

    Procedure

      1. $ oc get operatorhub cluster -o yaml

      Example output

      1. spec:
      2. disableAllDefaultSources: true
    1. Check that all required CatalogSource resources are annotated for workload partitioning (PreferredDuringScheduling) by running the following command:

      1. $ oc get catalogsource -A -o jsonpath='{range .items[*]}{.metadata.name}{" -- "}{.metadata.annotations.target\.workload\.openshift\.io/management}{"\n"}{end}'

      Example output

      1. certified-operators -- {"effect": "PreferredDuringScheduling"}
      2. community-operators -- {"effect": "PreferredDuringScheduling"}
      3. ran-operators (1)
      4. redhat-marketplace -- {"effect": "PreferredDuringScheduling"}
      5. redhat-operators -- {"effect": "PreferredDuringScheduling"}
      1CatalogSource resources that are not annotated are also returned. In this example, the ran-operators CatalogSource resource is not annotated and does not have the PreferredDuringScheduling annotation.
    2. Check that all applicable OKD Operator namespaces are annotated for workload partitioning. This includes all Operators installed with core OKD and the set of additional Operators included in the reference DU tuning configuration. Run the following command:

      1. $ oc get namespaces -A -o jsonpath='{range .items[*]}{.metadata.name}{" -- "}{.metadata.annotations.workload\.openshift\.io/allowed}{"\n"}{end}'

      Example output

      1. default --
      2. openshift-apiserver -- management
      3. openshift-apiserver-operator -- management
      4. openshift-authentication -- management
      5. openshift-authentication-operator -- management

      Additional Operators must not be annotated for workload partitioning. In the output from the previous command, additional Operators should be listed without any value on the right side of the separator.

    3. Check that the ClusterLogging configuration is correct. Run the following commands:

      1. Validate that the appropriate input and output logs are configured:

        1. $ oc get -n openshift-logging ClusterLogForwarder instance -o yaml

        Example output

        1. apiVersion: logging.openshift.io/v1
        2. kind: ClusterLogForwarder
        3. metadata:
        4. creationTimestamp: "2022-07-19T21:51:41Z"
        5. generation: 1
        6. name: instance
        7. namespace: openshift-logging
        8. resourceVersion: "1030342"
        9. uid: 8c1a842d-80c5-447a-9150-40350bdf40f0
        10. spec:
        11. inputs:
        12. - infrastructure: {}
        13. name: infra-logs
        14. outputs:
        15. - name: kafka-open
        16. type: kafka
        17. url: tcp://10.46.55.190:9092/test
        18. pipelines:
        19. - inputRefs:
        20. - audit
        21. name: audit-logs
        22. outputRefs:
        23. - kafka-open
        24. - inputRefs:
        25. - infrastructure
        26. name: infrastructure-logs
        27. outputRefs:
        28. - kafka-open
        29. ...
      2. Check that the curation schedule is appropriate for your application:

        1. $ oc get -n openshift-logging clusterloggings.logging.openshift.io instance -o yaml

        Example output

    4. Check that the web console is disabled (managementState: Removed) by running the following command:

      1. $ oc get consoles.operator.openshift.io cluster -o jsonpath="{ .spec.managementState }"

      Example output

      1. Removed
    5. Check that chronyd is disabled on the cluster node by running the following commands:

      1. $ oc debug node/<node_name>

      Check the status of chronyd on the node:

        Example output

        1. chronyd.service - NTP client/server
        2. Loaded: loaded (/usr/lib/systemd/system/chronyd.service; disabled; vendor preset: enabled)
        3. Active: inactive (dead)
        4. Docs: man:chronyd(8)
        5. man:chrony.conf(5)
      1. Check that the PTP interface is successfully synchronized to the primary clock using a remote shell connection to the linuxptp-daemon container and the PTP Management Client (pmc) tool:

        1. Set the $PTP_POD_NAME variable with the name of the linuxptp-daemon pod by running the following command:

          1. $ PTP_POD_NAME=$(oc get pods -n openshift-ptp -l app=linuxptp-daemon -o name)
        2. Run the following command to check the sync status of the PTP device:

          1. $ oc -n openshift-ptp rsh -c linuxptp-daemon-container ${PTP_POD_NAME} pmc -u -f /var/run/ptp4l.0.config -b 0 'GET PORT_DATA_SET'

          Example output

          1. sending: GET PORT_DATA_SET
          2. 3cecef.fffe.7a7020-1 seq 0 RESPONSE MANAGEMENT PORT_DATA_SET
          3. portIdentity 3cecef.fffe.7a7020-1
          4. portState SLAVE
          5. logMinDelayReqInterval -4
          6. peerMeanPathDelay 0
          7. logAnnounceInterval 1
          8. announceReceiptTimeout 3
          9. logSyncInterval 0
          10. delayMechanism 1
          11. logMinPdelayReqInterval 0
          12. versionNumber 2
          13. 3cecef.fffe.7a7020-2 seq 0 RESPONSE MANAGEMENT PORT_DATA_SET
          14. portIdentity 3cecef.fffe.7a7020-2
          15. portState LISTENING
          16. logMinDelayReqInterval 0
          17. peerMeanPathDelay 0
          18. logAnnounceInterval 1
          19. announceReceiptTimeout 3
          20. logSyncInterval 0
          21. delayMechanism 1
          22. logMinPdelayReqInterval 0
          23. versionNumber 2
        3. Run the following pmc command to check the PTP clock status:

          1. $ oc -n openshift-ptp rsh -c linuxptp-daemon-container ${PTP_POD_NAME} pmc -u -f /var/run/ptp4l.0.config -b 0 'GET TIME_STATUS_NP'

          Example output

          1. sending: GET TIME_STATUS_NP
          2. 3cecef.fffe.7a7020-0 seq 0 RESPONSE MANAGEMENT TIME_STATUS_NP
          3. master_offset 10 (1)
          4. ingress_time 1657275432697400530
          5. cumulativeScaledRateOffset +0.000000000
          6. scaledLastGmPhaseChange 0
          7. gmTimeBaseIndicator 0
          8. lastGmPhaseChange 0x0000'0000000000000000.0000
          9. gmPresent true (2)
          10. gmIdentity 3c2c30.ffff.670e00
          1master_offset should be between -100 and 100 ns.
          2Indicates that the PTP clock is synchronized to a master, and the local clock is not the grandmaster clock.
        4. Check that the expected master offset value corresponding to the value in /var/run/ptp4l.0.config is found in the linuxptp-daemon-container log:

          1. $ oc logs $PTP_POD_NAME -n openshift-ptp -c linuxptp-daemon-container

          Example output

          1. phc2sys[56020.341]: [ptp4l.1.config] CLOCK_REALTIME phc offset -1731092 s2 freq -1546242 delay 497
          2. ptp4l[56020.390]: [ptp4l.1.config] master offset -2 s2 freq -5863 path delay 541
          3. ptp4l[56020.390]: [ptp4l.0.config] master offset -8 s2 freq -10699 path delay 533
      2. Check that the SR-IOV configuration is correct by running the following commands:

        1. Check that the disableDrain value in the SriovOperatorConfig resource is set to true:

          1. $ oc get sriovoperatorconfig -n openshift-sriov-network-operator default -o jsonpath="{.spec.disableDrain}{'\n'}"

          Example output

          1. true
        2. Check that the SriovNetworkNodeState sync status is Succeeded by running the following command:

          1. $ oc get SriovNetworkNodeStates -n openshift-sriov-network-operator -o jsonpath="{.items[*].status.syncStatus}{'\n'}"

          Example output

          1. Succeeded
        3. Verify that the expected number and configuration of virtual functions (Vfs) under each interface configured for SR-IOV is present and correct in the .status.interfaces field. For example:

          Example output

          1. apiVersion: v1
          2. items:
          3. - apiVersion: sriovnetwork.openshift.io/v1
          4. kind: SriovNetworkNodeState
          5. ...
          6. status:
          7. interfaces:
          8. ...
          9. - Vfs:
          10. - deviceID: 154c
          11. driver: vfio-pci
          12. pciAddress: 0000:3b:0a.0
          13. vendor: "8086"
          14. vfID: 0
          15. - deviceID: 154c
          16. driver: vfio-pci
          17. pciAddress: 0000:3b:0a.1
          18. vendor: "8086"
          19. vfID: 1
          20. - deviceID: 154c
          21. driver: vfio-pci
          22. pciAddress: 0000:3b:0a.2
          23. vendor: "8086"
          24. vfID: 2
          25. - deviceID: 154c
          26. driver: vfio-pci
          27. pciAddress: 0000:3b:0a.3
          28. vendor: "8086"
          29. vfID: 3
          30. - deviceID: 154c
          31. driver: vfio-pci
          32. pciAddress: 0000:3b:0a.4
          33. vendor: "8086"
          34. vfID: 4
          35. - deviceID: 154c
          36. driver: vfio-pci
          37. pciAddress: 0000:3b:0a.5
          38. vendor: "8086"
          39. vfID: 5
          40. - deviceID: 154c
          41. driver: vfio-pci
          42. pciAddress: 0000:3b:0a.6
          43. vendor: "8086"
          44. vfID: 6
          45. - deviceID: 154c
          46. driver: vfio-pci
          47. pciAddress: 0000:3b:0a.7
          48. vendor: "8086"
          49. vfID: 7
      3. Check that the cluster performance profile is correct. The cpu and hugepages sections will vary depending on your hardware configuration. Run the following command:

          Example output

          1. apiVersion: performance.openshift.io/v2
          2. kind: PerformanceProfile
          3. metadata:
          4. creationTimestamp: "2022-07-19T21:51:31Z"
          5. finalizers:
          6. - foreground-deletion
          7. generation: 1
          8. name: openshift-node-performance-profile
          9. resourceVersion: "33558"
          10. uid: 217958c0-9122-4c62-9d4d-fdc27c31118c
          11. spec:
          12. additionalKernelArgs:
          13. - idle=poll
          14. - rcupdate.rcu_normal_after_boot=0
          15. - efi=runtime
          16. cpu:
          17. isolated: 2-51,54-103
          18. reserved: 0-1,52-53
          19. hugepages:
          20. defaultHugepagesSize: 1G
          21. pages:
          22. - count: 32
          23. size: 1G
          24. machineConfigPoolSelector:
          25. pools.operator.machineconfiguration.openshift.io/master: ""
          26. net:
          27. userLevelNetworking: true
          28. nodeSelector:
          29. node-role.kubernetes.io/master: ""
          30. numa:
          31. topologyPolicy: restricted
          32. realTimeKernel:
          33. enabled: true
          34. status:
          35. conditions:
          36. - lastHeartbeatTime: "2022-07-19T21:51:31Z"
          37. lastTransitionTime: "2022-07-19T21:51:31Z"
          38. status: "True"
          39. type: Available
          40. - lastHeartbeatTime: "2022-07-19T21:51:31Z"
          41. lastTransitionTime: "2022-07-19T21:51:31Z"
          42. status: "True"
          43. type: Upgradeable
          44. - lastHeartbeatTime: "2022-07-19T21:51:31Z"
          45. lastTransitionTime: "2022-07-19T21:51:31Z"
          46. status: "False"
          47. type: Progressing
          48. - lastHeartbeatTime: "2022-07-19T21:51:31Z"
          49. lastTransitionTime: "2022-07-19T21:51:31Z"
          50. status: "False"
          51. type: Degraded
          52. runtimeClass: performance-openshift-node-performance-profile
          53. tuned: openshift-cluster-node-tuning-operator/openshift-node-performance-openshift-node-performance-profile

          CPU settings are dependent on the number of cores available on the server and should align with workload partitioning settings. hugepages configuration is server and application dependent.

        1. Check that the PerformanceProfile was successfully applied to the cluster by running the following command:

          1. $ oc get performanceprofile openshift-node-performance-profile -o jsonpath="{range .status.conditions[*]}{ @.type }{' -- '}{@.status}{'\n'}{end}"

          Example output

          1. Available -- True
          2. Upgradeable -- True
          3. Progressing -- False
          4. Degraded -- False
        2. Check the Tuned performance patch settings by running the following command:

          1. $ oc get tuneds.tuned.openshift.io -n openshift-cluster-node-tuning-operator performance-patch -o yaml

          Example output

          1. apiVersion: tuned.openshift.io/v1
          2. kind: Tuned
          3. metadata:
          4. creationTimestamp: "2022-07-18T10:33:52Z"
          5. generation: 1
          6. name: performance-patch
          7. namespace: openshift-cluster-node-tuning-operator
          8. resourceVersion: "34024"
          9. uid: f9799811-f744-4179-bf00-32d4436c08fd
          10. spec:
          11. profile:
          12. - data: |
          13. [main]
          14. summary=Configuration changes profile inherited from performance created tuned
          15. include=openshift-node-performance-openshift-node-performance-profile
          16. [bootloader]
          17. cmdline_crash=nohz_full=2-23,26-47 (1)
          18. [sysctl]
          19. kernel.timer_migration=1
          20. [scheduler]
          21. group.ice-ptp=0:f:10:*:ice-ptp.*
          22. [service]
          23. service.stalld=start,enable
          24. service.chronyd=stop,disable
          25. name: performance-patch
          26. recommend:
          27. - machineConfigLabels:
          28. machineconfiguration.openshift.io/role: master
          29. priority: 19
          30. profile: performance-patch
        3. Check that cluster networking diagnostics are disabled by running the following command:

          1. $ oc get networks.operator.openshift.io cluster -o jsonpath='{.spec.disableNetworkDiagnostics}'

          Example output

          1. true
        4. Check that the Kubelet housekeeping interval is tuned to slower rate. This is set in the containerMountNS machine config. Run the following command:

          1. $ oc describe machineconfig container-mount-namespace-and-kubelet-conf-master | grep OPENSHIFT_MAX_HOUSEKEEPING_INTERVAL_DURATION

          Example output

          1. Environment="OPENSHIFT_MAX_HOUSEKEEPING_INTERVAL_DURATION=60s"
        5. Check that Grafana and alertManagerMain are disabled and that the Prometheus retention period is set to 24h by running the following command:

          1. $ oc get configmap cluster-monitoring-config -n openshift-monitoring -o jsonpath="{ .data.config\.yaml }"

          Example output

          1. grafana:
          2. enabled: false
          3. alertmanagerMain:
          4. enabled: false
          5. prometheusK8s:
          6. retention: 24h
          1. Use the following commands to verify that Grafana and alertManagerMain routes are not found in the cluster:

            1. $ oc get route -n openshift-monitoring alertmanager-main
            1. $ oc get route -n openshift-monitoring grafana

            Both queries should return Error from server (NotFound) messages.

        6. Check that there is a minimum of 4 CPUs allocated as reserved for each of the PerformanceProfile, Tuned performance-patch, workload partitioning, and kernel command line arguments by running the following command:

            Example output

            1. 0-3