Investigating monitoring issues
You can follow these procedures if your own metrics are unavailable or if Prometheus is consuming a lot of disk space.
resources enable you to determine how to use the metrics exposed by a service in user-defined projects. Follow the steps outlined in this procedure if you have created a ServiceMonitor
resource but cannot see any corresponding metrics in the Metrics UI.
Prerequisites
You have access to the cluster as a user with the
cluster-admin
role.You have installed the OpenShift CLI (
oc
).You have enabled and configured monitoring for user-defined workloads.
You have created the
user-workload-monitoring-config
ConfigMap
object.You have created a
ServiceMonitor
resource.
Procedure
Check that the corresponding labels match in the service and
ServiceMonitor
resource configurations.Obtain the label defined in the service. The following example queries the
prometheus-example-app
service in thens1
project:Example output
labels:
app: prometheus-example-app
Check that the
matchLabels
app
label in theServiceMonitor
resource configuration matches the label output in the preceding step:$ oc -n ns1 get servicemonitor prometheus-example-monitor -o yaml
Example output
spec:
endpoints:
- interval: 30s
port: web
selector:
matchLabels:
Inspect the logs for the Prometheus Operator in the
openshift-user-workload-monitoring
project.List the pods in the
openshift-user-workload-monitoring
project:$ oc -n openshift-user-workload-monitoring get pods
Example output
Obtain the logs from the
prometheus-operator
container in theprometheus-operator
pod. In the following example, the pod is calledprometheus-operator-776fcbbd56-2nbfm
:$ oc -n openshift-user-workload-monitoring logs prometheus-operator-776fcbbd56-2nbfm -c prometheus-operator
level=warn ts=2020-08-10T11:48:20.906739623Z caller=operator.go:1829 component=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via bearer token file which Prometheus specification prohibits" servicemonitor=eagle/eagle namespace=openshift-user-workload-monitoring prometheus=user-workload
Review the target status for your project in the Prometheus UI directly.
Establish port-forwarding to the Prometheus instance in the
openshift-user-workload-monitoring
project:$ oc port-forward -n openshift-user-workload-monitoring pod/prometheus-user-workload-0 9090
Open http://localhost:9090/targets in a web browser and review the status of the target for your project directly in the Prometheus UI. Check for error messages relating to the target.
Configure debug level logging for the Prometheus Operator in the
openshift-user-workload-monitoring
project.Edit the
user-workload-monitoring-config
ConfigMap
object in theopenshift-user-workload-monitoring
project:$ oc -n openshift-user-workload-monitoring edit configmap user-workload-monitoring-config
Add for
prometheusOperator
underdata/config.yaml
to set the log level todebug
:Save the file to apply the changes.
Confirm that the
debug
log-level has been applied to theprometheus-operator
deployment in theopenshift-user-workload-monitoring
project:$ oc -n openshift-user-workload-monitoring get deploy prometheus-operator -o yaml | grep "log-level"
Example output
- --log-level=debug
Debug level logging will show all calls made by the Prometheus Operator.
Check that the
prometheus-operator
pod is running:$ oc -n openshift-user-workload-monitoring get pods
Additional resources
See Specifying how a service is monitored for details on how to create a service monitor or pod monitor
Determining why Prometheus is consuming a lot of disk space
Developers can create labels to define attributes for metrics in the form of key-value pairs. The number of potential key-value pairs corresponds to the number of possible values for an attribute. An attribute that has an unlimited number of potential values is called an unbound attribute. For example, a customer_id
attribute is unbound because it has an infinite number of possible values.
Every assigned key-value pair has a unique time series. The use of many unbound attributes in labels can result in an exponential increase in the number of time series created. This can impact Prometheus performance and can consume a lot of disk space.
Check the number of scrape samples that are being collected.
Check the time series database (TSDB) status in the Prometheus UI for more information on which labels are creating the most time series. This requires cluster administrator privileges.
Reduce the number of unique time series that are created by reducing the number of unbound attributes that are assigned to user-defined metrics.
Enforce limits on the number of samples that can be scraped across user-defined projects. This requires cluster administrator privileges.
Prerequisites
You have access to the cluster as a user with the
cluster-admin
role.You have installed the OpenShift CLI (
oc
).
Procedure
In the Administrator perspective, navigate to Monitoring → Metrics.
Run the following Prometheus Query Language (PromQL) query in the Expression field. This returns the ten metrics that have the highest number of scrape samples:
topk(10,count by (job)({__name__=~".+"}))
Investigate the number of unbound label values assigned to metrics with higher than expected scrape sample counts.
If the metrics relate to a user-defined project, review the metrics key-value pairs assigned to your workload. These are implemented through Prometheus client libraries at the application level. Try to limit the number of unbound attributes referenced in your labels.
If the metrics relate to a core OKD project, create a Red Hat support case on the Red Hat Customer Portal.
Check the TSDB status in the Prometheus UI.
In the Administrator perspective, navigate to Networking → Routes.
Select the project in the Project list.
Select the URL in the
prometheus-k8s
row to open the login page for the Prometheus UI.Choose Log in with OpenShift to log in using your OKD credentials.
Additional resources