Using the vSphere Problem Detector Operator

    The Operator runs in the namespace and is started by the Cluster Storage Operator when the Cluster Storage Operator detects that the cluster is deployed on vSphere. The vSphere Problem Detector Operator communicates with the vSphere vCenter Server to determine the virtual machines in the cluster, the default datastore, and other information about the vSphere vCenter Server configuration. The Operator uses the credentials from the Cloud Credential Operator to connect to vSphere.

    The Operator runs the checks according to the following schedule:

    • The checks run every 8 hours.

    • If any check fails, the Operator runs the checks again in intervals of 1 minute, 2 minutes, 4, 8, and so on. The Operator doubles the interval up to a maximum interval of 8 hours.

    • When all checks pass, the schedule returns to an 8 hour interval.

    The Operator increases the frequency of the checks after a failure so that the Operator can report success quickly after the failure condition is remedied. You can run the Operator manually for immediate troubleshooting information.

    Running the vSphere Problem Detector Operator checks

    You can override the schedule for running the vSphere Problem Detector Operator checks and run the checks immediately.

    The vSphere Problem Detector Operator automatically runs the checks every 8 hours. However, when the Operator starts, it runs the checks immediately. The Operator is started by the Cluster Storage Operator when the Cluster Storage Operator starts and determines that the cluster is running on vSphere. To run the checks immediately, you can scale the vSphere Problem Detector Operator to 0 and back to 1 so that it restarts the vSphere Problem Detector Operator.

    Prerequisites

    • Access to the cluster as a user with the cluster-admin role.

    Procedure

    1. Scale the Operator to 0:

      If the deployment does not scale to zero immediately, you can run the following command to wait for the pods to exit:

      1. $ oc wait pods -l name=vsphere-problem-detector-operator \
      2. --for=delete --timeout=5m -n openshift-cluster-storage-operator
    2. Delete the old leader lock to speed up the new leader election for the Cluster Storage Operator:

    Verification

    • View the events or logs that are generated by the vSphere Problem Detector Operator. Confirm that the events or logs have recent timestamps.

    After the vSphere Problem Detector Operator runs and performs the configuration checks, it creates events that can be viewed from the command line or from the OKD web console.

    Procedure

    • To view the events by using the command line, run the following command:

      1. $ oc get event -n openshift-cluster-storage-operator \
      2. --sort-by={.metadata.creationTimestamp}

      Example output

      1. 16m Normal Started pod/vsphere-problem-detector-operator-xxxxx Started container vsphere-problem-detector
      2. 16m Normal Created pod/vsphere-problem-detector-operator-xxxxx Created container vsphere-problem-detector
      3. 16m Normal LeaderElection configmap/vsphere-problem-detector-lock vsphere-problem-detector-operator-xxxxx became leader
    • To view the events by using the OKD web console, navigate to HomeEvents and select openshift-cluster-storage-operator from the Project menu.

    Viewing the logs from the vSphere Problem Detector Operator

    After the vSphere Problem Detector Operator runs and performs the configuration checks, it creates log records that can be viewed from the command line or from the OKD web console.

    Procedure

    • To view the logs by using the command line, run the following command:

      Example output

      1. I0108 08:32:28.445696 1 operator.go:209] ClusterInfo passed
      2. I0108 08:32:28.451029 1 datastore.go:57] CheckStorageClasses checked 1 storage classes, 0 problems found
      3. I0108 08:32:28.451047 1 operator.go:209] CheckStorageClasses passed
      4. I0108 08:32:28.480648 1 operator.go:271] CheckNodeDiskUUID:<host_name> passed
      5. I0108 08:32:28.480685 1 operator.go:271] CheckNodeProviderID:<host_name> passed
    • To view the Operator logs with the OKD web console, perform the following steps:

      1. Select openshift-cluster-storage-operator from the Projects menu.

      2. Click the link for the vsphere-problem-detector-operator pod.

      3. Click the Logs tab on the Pod details page to view the logs.

    The following tables identify the configuration checks that the vSphere Problem Detector Operator runs. Some checks verify the configuration of the cluster. Other checks verify the configuration of each node in the cluster.

    About the storage class configuration check

    The names for persistent volumes that use vSphere storage are related to the datastore name and cluster ID.

    When a persistent volume is created, systemd creates a mount unit for the persistent volume. The systemd process has a 255 character limit for the length of the fully qualified path to the VDMK file that is used for the persistent volume.

    The fully qualified path is based on the naming conventions for systemd and vSphere. The naming conventions use the following pattern:

    • The naming conventions require 205 characters of the 255 character limit.

    • The datastore name and the cluster ID are determined from the deployment.

    • The datastore name and cluster ID are substituted into the preceding pattern. Then the path is processed with the systemd-escape command to escape special characters. For example, a hyphen character uses four characters after it is escaped. The escaped value is \x2d.

    The vSphere Problem Detector Operator exposes the following metrics for use by the OKD monitoring stack.

    Additional resources