Troubleshooting Operator issues

    OKD 4.8 includes a default set of Operators that are required for proper functioning of the cluster. These default Operators are managed by the Cluster Version Operator (CVO).

    As a cluster administrator, you can install application Operators from the OperatorHub using the OKD web console or the CLI. You can then subscribe the Operator to one or more namespaces to make it available for developers on your cluster. Application Operators are managed by Operator Lifecycle Manager (OLM).

    If you experience Operator issues, verify Operator subscription status. Check Operator pod health across the cluster and gather Operator logs for diagnosis.

    Subscriptions can report the following condition types:

    Default OKD cluster Operators are managed by the Cluster Version Operator (CVO) and they do not have a Subscription object. Application Operators are managed by Operator Lifecycle Manager (OLM) and they have a Subscription object.

    Viewing Operator subscription status by using the CLI

    You can view Operator subscription status by using the CLI.

    Prerequisites

    • You have access to the cluster as a user with the cluster-admin role.

    • You have installed the OpenShift CLI (oc).

    Procedure

    1. List Operator subscriptions:

    2. Use the oc describe command to inspect a Subscription resource:

      1. $ oc describe sub <subscription_name> -n <operator_namespace>
    3. In the command output, find the Conditions section for the status of Operator subscription condition types. In the following example, the CatalogSourcesUnhealthy condition type has a status of false because all available catalog sources are healthy:

      Example output

      1. Conditions:
      2. Last Transition Time: 2019-07-29T13:42:57Z
      3. Message: all available catalogsources are healthy
      4. Reason: AllCatalogSourcesHealthy
      5. Status: False
      6. Type: CatalogSourcesUnhealthy

    Default OKD cluster Operators are managed by the Cluster Version Operator (CVO) and they do not have a Subscription object. Application Operators are managed by Operator Lifecycle Manager (OLM) and they have a Subscription object.

    You can view the status of an Operator catalog source by using the CLI.

    Prerequisites

    • You have access to the cluster as a user with the cluster-admin role.

    • You have installed the OpenShift CLI (oc).

    Procedure

    1. List the catalog sources in a namespace. For example, you can check the olm namespace, which is used for cluster-wide catalog sources:

      1. $ oc get catalogsources -n olm

      Example output

      1. NAME DISPLAY TYPE PUBLISHER AGE
      2. certified-operators Certified Operators grpc Red Hat 55m
      3. community-operators Community Operators grpc Red Hat 55m
      4. example-catalog Example Catalog grpc Example Org 2m25s
      5. redhat-marketplace Red Hat Marketplace grpc Red Hat 55m
      6. redhat-operators Red Hat Operators grpc Red Hat 55m
    2. Use the oc describe command to get more details and status about a catalog source:

      1. $ oc describe catalogsource example-catalog -n olm

      Example output

      1. Name: example-catalog
      2. Namespace: olm
      3. ...
      4. Status:
      5. Connection State:
      6. Address: example-catalog.olm.svc:50051
      7. Last Connect: 2021-09-09T17:07:35Z
      8. Last Observed State: TRANSIENT_FAILURE
      9. Registry Service:
      10. Created At: 2021-09-09T17:05:45Z
      11. Port: 50051
      12. Protocol: grpc
      13. Service Name: example-catalog
      14. Service Namespace: olm

      In the preceding example output, the last observed state is TRANSIENT_FAILURE. This state indicates that there is a problem establishing a connection for the catalog source.

    3. List the pods in the namespace where your catalog source was created:

      1. $ oc get pods -n olm

      Example output

      1. NAME READY STATUS RESTARTS AGE
      2. certified-operators-cv9nn 1/1 Running 0 36m
      3. community-operators-6v8lp 1/1 Running 0 36m
      4. marketplace-operator-86bfc75f9b-jkgbc 1/1 Running 0 42m
      5. example-catalog-bwt8z 0/1 ImagePullBackOff 0 3m55s
      6. redhat-marketplace-57p8c 1/1 Running 0 36m
      7. redhat-operators-smxx8 1/1 Running 0 36m

      When a catalog source is created in a namespace, a pod for the catalog source is created in that namespace. In the preceding example output, the status for the example-catalog-bwt8z pod is ImagePullBackOff. This status indicates that there is an issue pulling the catalog source’s index image.

    4. Use the oc describe command to inspect a pod for more detailed information:

      1. $ oc describe pod example-catalog-bwt8z -n olm

      Example output

      1. Name: example-catalog-bwt8z
      2. Namespace: olm
      3. Priority: 0
      4. Node: ci-ln-jyryyg2-f76d1-ggdbq-worker-b-vsxjd/10.0.128.2
      5. ...
      6. Events:
      7. Type Reason Age From Message
      8. ---- ------ ---- ---- -------
      9. Normal AddedInterface 47s multus Add eth0 [10.131.0.40/23] from openshift-sdn
      10. Warning Failed 20s (x2 over 46s) kubelet Error: ImagePullBackOff
      11. Normal Pulling 8s (x3 over 47s) kubelet Pulling image "quay.io/example-org/example-catalog:v1"
      12. Warning Failed 8s (x3 over 47s) kubelet Failed to pull image "quay.io/example-org/example-catalog:v1": rpc error: code = Unknown desc = reading manifest v1 in quay.io/example-org/example-catalog: unauthorized: access to the requested resource is not authorized
      13. Warning Failed 8s (x3 over 47s) kubelet Error: ErrImagePull

      In the preceding example output, the error messages indicate that the catalog source’s index image is failing to pull successfully because of an authorization issue. For example, the index image might be stored in a registry that requires login credentials.

    Additional resources

    Querying Operator pod status

    You can list Operator pods within a cluster and their status. You can also collect a detailed Operator pod summary.

    Prerequisites

    • You have access to the cluster as a user with the cluster-admin role.

    • Your API service is still functional.

    • You have installed the OpenShift CLI (oc).

    Procedure

    1. List Operators running in the cluster. The output includes Operator version, availability, and up-time information:

      1. $ oc get clusteroperators
    2. List Operator pods running in the Operator’s namespace, plus pod status, restarts, and age:

      1. $ oc get pod -n <operator_namespace>
    3. Output a detailed Operator pod summary:

      1. $ oc describe pod <operator_pod_name> -n <operator_namespace>
    4. If an Operator issue is node-specific, query Operator container status on that node.

      1. Start a debug pod for the node:

        1. $ oc debug node/my-node
      2. Set /host as the root directory within the debug shell. The debug pod mounts the host’s root file system in /host within the pod. By changing the root directory to /host, you can run binaries contained in the host’s executable paths:

        1. # chroot /host
      3. List details about the node’s containers, including state and associated pod IDs:

        1. # crictl ps
      4. List information about a specific Operator container on the node. The following example lists information about the network-operator container:

      5. Exit from the debug shell.

    If you experience Operator issues, you can gather detailed diagnostic information from Operator pod logs.

    Prerequisites

    • You have access to the cluster as a user with the cluster-admin role.

    • Your API service is still functional.

    • You have installed the OpenShift CLI (oc).

    • You have the fully qualified domain names of the control plane, or control plane machines (also known as the master machines).

    Procedure

    1. List the Operator pods that are running in the Operator’s namespace, plus the pod status, restarts, and age:

      1. $ oc get pods -n <operator_namespace>
    2. Review logs for an Operator pod:

      1. $ oc logs pod/<pod_name> -n <operator_namespace>

      If an Operator pod has multiple containers, the preceding command will produce an error that includes the name of each container. Query logs from an individual container:

      1. $ oc logs pod/<operator_pod_name> -c <container_name> -n <operator_namespace>
    3. If the API is not functional, review Operator pod and container logs on each control plane node by using SSH instead. Replace <master-node>.<cluster_name>.<base_domain> with appropriate values.

      1. List pods on each control plane node:

        1. $ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl pods
      2. For any Operator pods not showing a Ready status, inspect the pod’s status in detail. Replace <operator_pod_id> with the Operator pod’s ID listed in the output of the preceding command:

        1. $ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl inspectp <operator_pod_id>
      3. List containers related to an Operator pod:

        1. $ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl ps --pod=<operator_pod_id>
      4. For any Operator container not showing a Ready status, inspect the container’s status in detail. Replace <container_id> with a container ID listed in the output of the preceding command:

        1. $ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl inspect <container_id>
      5. Review the logs for any Operator containers not showing a Ready status. Replace <container_id> with a container ID listed in the output of the preceding command:

        1. $ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl logs -f <container_id>

    Disabling the Machine Config Operator from automatically rebooting

    When configuration changes are made by the Machine Config Operator (MCO), Fedora CoreOS (FCOS) must reboot for the changes to take effect. Whether the configuration change is automatic or manual, an FCOS node reboots automatically unless it is paused.

    The following modifications do not trigger a node reboot:

    • When the MCO detects any of the following changes, it applies the update without draining or rebooting the node:

      • Changes to the SSH key in the spec.config.passwd.users.sshAuthorizedKeys parameter of a machine config.

      • Changes to the global pull secret or pull secret in the openshift-config namespace.

      • Automatic rotation of the /etc/kubernetes/kubelet-ca.crt certificate authority (CA) by the Kubernetes API Server Operator.

    • When the MCO detects changes to the /etc/containers/registries.conf file, such as adding or editing an ImageContentSourcePolicy object, it drains the corresponding nodes, applies the changes, and uncordons the nodes.

    To avoid unwanted disruptions, you can modify the machine config pool (MCP) to prevent automatic rebooting after the Operator makes changes to the machine config.

    Pausing an MCP prevents the MCO from applying any configuration changes on the associated nodes. Pausing an MCP also prevents any automatically-rotated certificates from being pushed to the associated nodes, including the automatic rotation of the kube-apiserver-to-kubelet-signer CA certificate. If the MCP is paused when the kube-apiserver-to-kubelet-signer CA certificate expires, and the MCO attempts to renew the certificate automatically, the new certificate is created but not applied across the nodes in the paused MCP. This causes failure in multiple oc commands, including but not limited to oc debug, oc logs, oc exec, and oc attach. Pausing an MCP should be done with careful consideration about the kube-apiserver-to-kubelet-signer CA certificate expiration and for short periods of time only.

    New CA certificates are generated at 292 days from the installation date and removed at 365 days from that date. To determine the next automatic CA certificate rotation, see the Understand CA cert auto renewal in Red Hat OpenShift 4.

    To avoid unwanted disruptions from changes made by the Machine Config Operator (MCO), you can use the OKD web console to modify the machine config pool (MCP) to prevent the MCO from making any changes to nodes in that pool. This prevents any reboots that would normally be part of the MCO update process.

    Pausing an MCP prevents the MCO from applying any configuration changes on the associated nodes. Pausing an MCP also prevents any automatically-rotated certificates from being pushed to the associated nodes, including the automatic rotation of the kube-apiserver-to-kubelet-signer CA certificate. If the MCP is paused when the kube-apiserver-to-kubelet-signer CA certificate expires, and the MCO attempts to renew the certificate automatically, the new certificate is created but not applied across the nodes in the paused MCP. This causes failure in multiple oc commands, including but not limited to oc debug, oc logs, oc exec, and oc attach. Pausing an MCP should be done with careful consideration about the kube-apiserver-to-kubelet-signer CA certificate expiration and for short periods of time only.

    New CA certificates are generated at 292 days from the installation date and removed at 365 days from that date. To determine the next automatic CA certificate rotation, see the .

    Prerequisites

    • You have access to the cluster as a user with the cluster-admin role.

    Procedure

    To pause or unpause automatic MCO update rebooting:

    • Pause the autoreboot process:

      1. Log in to the OKD web console as a user with the cluster-admin role.

      2. Click ComputeMachine Config Pools.

      3. On the Machine Config Pools page, click either master or worker, depending upon which nodes you want to pause rebooting for.

      4. On the master or worker page, click YAML.

      5. In the YAML, update the spec.paused field to true.

        Sample MachineConfigPool object

        1. apiVersion: machineconfiguration.openshift.io/v1
        2. kind: MachineConfigPool
        3. ...
        4. spec:
        5. ...
        6. paused: true (1)
      6. To verify that the MCP is paused, return to the Machine Config Pools page.

        On the Machine Config Pools page, the Paused column reports True for the MCP you modified.

        If the MCP has pending changes while paused, the Updated column is False and Updating is False. When Updated is True and Updating is False, there are no pending changes.

        If there are pending changes (where both the Updated and Updating columns are False), it is recommended to schedule a maintenance window for a reboot as early as possible. Use the following steps for unpausing the autoreboot process to apply the changes that were queued since the last reboot.

    Disabling the Machine Config Operator from automatically rebooting by using the CLI

    To avoid unwanted disruptions from changes made by the Machine Config Operator (MCO), you can modify the machine config pool (MCP) using the OpenShift CLI (oc) to prevent the MCO from making any changes to nodes in that pool. This prevents any reboots that would normally be part of the MCO update process.

    Pausing an MCP prevents the MCO from applying any configuration changes on the associated nodes. Pausing an MCP also prevents any automatically-rotated certificates from being pushed to the associated nodes, including the automatic rotation of the kube-apiserver-to-kubelet-signer CA certificate. If the MCP is paused when the kube-apiserver-to-kubelet-signer CA certificate expires, and the MCO attempts to renew the certificate automatically, the new certificate is created but not applied across the nodes in the paused MCP. This causes failure in multiple oc commands, including but not limited to oc debug, oc logs, oc exec, and oc attach. Pausing an MCP should be done with careful consideration about the kube-apiserver-to-kubelet-signer CA certificate expiration and for short periods of time only.

    New CA certificates are generated at 292 days from the installation date and removed at 365 days from that date. To determine the next automatic CA certificate rotation, see the .

    Prerequisites

    • You have access to the cluster as a user with the cluster-admin role.

    • You have installed the OpenShift CLI (oc).

    Procedure

    To pause or unpause automatic MCO update rebooting:

    • Pause the autoreboot process:

      1. Update the MachineConfigPool custom resource to set the spec.paused field to true.

        Control plane (master) nodes

        1. $ oc patch --type=merge --patch='{"spec":{"paused":true}}' machineconfigpool/master

        Worker nodes

        1. $ oc patch --type=merge --patch='{"spec":{"paused":true}}' machineconfigpool/worker
      2. Verify that the MCP is paused:

        Control plane (master) nodes

        1. $ oc get machineconfigpool/master --template='{{.spec.paused}}'

        Worker nodes

        1. $ oc get machineconfigpool/worker --template='{{.spec.paused}}'

        Example output

        1. true

        The spec.paused field is true and the MCP is paused.

      3. Determine if the MCP has pending changes:

        1. # oc get machineconfigpool

        Example output

        If the UPDATED column is False and UPDATING is False, there are pending changes. When UPDATED is True and UPDATING is False, there are no pending changes. In the previous example, the worker node has pending changes. The control plane node (also known as the master node) does not have any pending changes.

        If there are pending changes (where both the Updated and Updating columns are False), it is recommended to schedule a maintenance window for a reboot as early as possible. Use the following steps for unpausing the autoreboot process to apply the changes that were queued since the last reboot.

    • Unpause the autoreboot process:

      1. Update the MachineConfigPool custom resource to set the spec.paused field to false.

        Control plane (master) nodes

        1. $ oc patch --type=merge --patch='{"spec":{"paused":false}}' machineconfigpool/master

        Worker nodes

        1. $ oc patch --type=merge --patch='{"spec":{"paused":false}}' machineconfigpool/worker
      2. Verify that the MCP is unpaused:

        Control plane (master) nodes

        1. $ oc get machineconfigpool/master --template='{{.spec.paused}}'

        Worker nodes

        1. $ oc get machineconfigpool/worker --template='{{.spec.paused}}'

        Example output

        1. false

        The spec.paused field is false and the MCP is unpaused.

      3. Determine if the MCP has pending changes:

        1. $ oc get machineconfigpool

        Example output

        1. NAME CONFIG UPDATED UPDATING
        2. master rendered-master-546383f80705bd5aeaba93 True False
        3. worker rendered-worker-b4c51bb33ccaae6fc4a6a5 False True

        If the MCP is applying any pending changes, the UPDATED column is False and the UPDATING column is True. When UPDATED is True and UPDATING is False, there are no further changes being made. In the previous example, the MCO is updating the worker node.

    In Operator Lifecycle Manager (OLM), if you subscribe to an Operator that references images that are not accessible on your network, you can find jobs in the openshift-marketplace namespace that are failing with the following errors:

    Example output

    1. ImagePullBackOff for
    2. Back-off pulling image "example.com/openshift4/ose-elasticsearch-operator-bundle@sha256:6d2587129c846ec28d384540322b40b05833e7e00b25cca584e004af9a1d292e"

    Example output

    1. rpc error: code = Unknown desc = error pinging docker registry example.com: Get "https://example.com/v2/": dial tcp: lookup example.com on 10.0.0.1:53: no such host

    As a result, the subscription is stuck in this failing state and the Operator is unable to install or upgrade.

    You can refresh a failing subscription by deleting the subscription, cluster service version (CSV), and other related objects. After recreating the subscription, OLM then reinstalls the correct version of the Operator.

    Prerequisites

    • You have a failing subscription that is unable to pull an inaccessible bundle image.

    • You have confirmed that the correct bundle image is accessible.

    Procedure

    1. Get the names of the Subscription and ClusterServiceVersion objects from the namespace where the Operator is installed:

      1. $ oc get sub,csv -n <namespace>

      Example output

      1. NAME PACKAGE SOURCE CHANNEL
      2. subscription.operators.coreos.com/elasticsearch-operator elasticsearch-operator redhat-operators 5.0
      3. NAME DISPLAY VERSION REPLACES PHASE
      4. clusterserviceversion.operators.coreos.com/elasticsearch-operator.5.0.0-65 OpenShift Elasticsearch Operator 5.0.0-65 Succeeded
    2. Delete the subscription:

      1. $ oc delete subscription <subscription_name> -n <namespace>
    3. Delete the cluster service version:

      1. $ oc delete csv <csv_name> -n <namespace>
    4. Get the names of any failing jobs and related config maps in the openshift-marketplace namespace:

      1. $ oc get job,configmap -n openshift-marketplace

      Example output

      1. NAME COMPLETIONS DURATION AGE
      2. job.batch/1de9443b6324e629ddf31fed0a853a121275806170e34c926d69e53a7fcbccb 1/1 26s 9m30s
      3. NAME DATA AGE
      4. configmap/1de9443b6324e629ddf31fed0a853a121275806170e34c926d69e53a7fcbccb 3 9m30s
    5. Delete the job:

      1. $ oc delete job <job_name> -n openshift-marketplace

      This ensures pods that try to pull the inaccessible image are not recreated.

    6. Delete the config map:

    7. Reinstall the Operator using OperatorHub in the web console.

    • Check that the Operator has been reinstalled successfully:

      1. $ oc get sub,csv,installplan -n <namespace>