Troubleshooting for Critical Alerts
Troubleshooting
Check the Elasticsearch cluster health and verify that the cluster is red.
List the nodes that have joined the cluster.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/nodes?v
List the Elasticsearch pods and compare them with the nodes in the command output from the previous step.
oc -n openshift-logging get pods -l component=elasticsearch
If some of the Elasticsearch nodes have not joined the cluster, perform the following steps.
Confirm that Elasticsearch has an elected master node.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/master?v
Review the pod logs of the elected master node for issues.
oc logs <elasticsearch_master_pod_name> -c elasticsearch -n openshift-logging
Review the logs of nodes that have not joined the cluster for issues.
oc logs <elasticsearch_node_name> -c elasticsearch -n openshift-logging
If all the nodes have joined the cluster, perform the following steps, check if the cluster is in the process of recovering.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/recovery?active_only=true
If there is no command output, the recovery process might be delayed or stalled by pending tasks.
Check if there are pending tasks.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- health |grep number_of_pending_tasks
If there are pending tasks, monitor their status.
If their status changes and indicates that the cluster is recovering, continue waiting. The recovery time varies according to the size of the cluster and other factors.
Otherwise, if the status of the pending tasks does not change, this indicates that the recovery has stalled.
If it seems like the recovery has stalled, check if
cluster.routing.allocation.enable
is set tonone
.oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/settings?pretty
If
cluster.routing.allocation.enable
is set tonone
, set it toall
.oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/settings?pretty -X PUT -d '{"persistent": {"cluster.routing.allocation.enable":"all"}}'
Check which indices are still red.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/indices?v
If any indices are still red, try to clear them by performing the following steps.
Clear the cache.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name>/_cache/clear?pretty
Increase the max allocation retries.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name>/_settings?pretty -X PUT -d '{"index.allocation.max_retries":10}'
Delete all the scroll items.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_search/scroll/_all -X DELETE
Increase the timeout.
If the preceding steps do not clear the red indices, delete the indices individually.
Identify the red index name.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/indices?v
Delete the red index.
If there are no red indices and the cluster status is red, check for a continuous heavy processing load on a data node.
Check if the Elasticsearch JVM Heap usage is high.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_nodes/stats?pretty
In the command output, review the
node_name.jvm.mem.heap_used_percent
field to determine the JVM Heap usage.Check for high CPU utilization.
Additional resources
- Search for “Free up or increase disk space” in the Elasticsearch topic, Fix a red or yellow cluster status.
Elasticsearch Cluster Health is Yellow
Replica shards for at least one primary shard are not allocated to nodes.
Troubleshooting
- Increase the node count by adjusting
nodeCount
in theClusterLogging
CR.
Additional resources
Search for “Free up or increase disk space” in the Elasticsearch topic, Fix a red or yellow cluster status.
Elasticsearch Node Disk Low Watermark Reached
Elasticsearch does not allocate shards to nodes that reach the low watermark.
Troubleshooting
Identify the node on which Elasticsearch is deployed.
-
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/health?pretty | grep unassigned_shards
If there are unassigned shards, check the disk space on each node.
for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
Check the
nodes.node_name.fs
field to determine the free disk space on that node.If the used disk percentage is above 85%, the node has exceeded the low watermark, and shards can no longer be allocated to this node.
Try to increase the disk space on all nodes.
If increasing the disk space is not possible, try adding a new data node to the cluster.
If adding a new data node is problematic, decrease the total cluster redundancy policy.
Check the current
redundancyPolicy
.oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
oc -n openshift-logging get cl -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'
If the cluster
redundancyPolicy
is higher thanSingleRedundancy
, set it toSingleRedundancy
and save this change.
If the preceding steps do not fix the issue, delete the old indices.
Check the status of all indices on Elasticsearch.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices
Identify an old index that can be deleted.
Delete the index.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE
Additional resources
- Search for “redundancyPolicy” in the “Sample
ClusterLogging
custom resource (CR)” in
Elasticsearch attempts to relocate shards away from a node that has reached the high watermark.
Troubleshooting
Identify the node on which Elasticsearch is deployed.
oc -n openshift-logging get po -o wide
Check the disk space on each node.
for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
Check if the cluster is rebalancing.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/health?pretty | grep relocating_shards
If the command output shows relocating shards, the High Watermark has been exceeded. The default value of the High Watermark is 90%.
The shards relocate to a node with low disk usage that has not crossed any watermark threshold limits.
To allocate shards to a particular node, free up some space.
Try to increase the disk space on all nodes.
If increasing the disk space is not possible, try adding a new data node to the cluster.
If adding a new data node is problematic, decrease the total cluster redundancy policy.
Check the current
redundancyPolicy
.oc -n openshift-logging get cl -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'
If the cluster
redundancyPolicy
is higher thanSingleRedundancy
, set it toSingleRedundancy
and save this change.
If the preceding steps do not fix the issue, delete the old indices.
Check the status of all indices on Elasticsearch.
Identify an old index that can be deleted.
Delete the index.
Additional resources
- Search for “redundancyPolicy” in the “Sample
ClusterLogging
custom resource (CR)” in
Elasticsearch Node Disk Flood Watermark Reached
Elasticsearch enforces a read-only index block on every index that has both of these conditions:
One or more shards are allocated to the node.
One or more disks exceed the .
Troubleshooting
Check the disk space of the Elasticsearch node.
for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
Check the
nodes.node_name.fs
field to determine the free disk space on that node.If the used disk percentage is above 95%, it signifies that the node has crossed the flood watermark. Writing is blocked for shards allocated on this particular node.
Try to increase the disk space on all nodes.
If adding a new data node is problematic, decrease the total cluster redundancy policy.
Check the current
redundancyPolicy
.oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
oc -n openshift-logging get cl -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'
If the cluster
redundancyPolicy
is higher thanSingleRedundancy
, set it toSingleRedundancy
and save this change.
If the preceding steps do not fix the issue, delete the old indices.
Check the status of all indices on Elasticsearch.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices
Identify an old index that can be deleted.
Delete the index.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE
Continue freeing up and monitoring the disk space until the used disk space drops below 90%. Then, unblock write to this particular node.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_all/_settings?pretty -X PUT -d '{"index.blocks.read_only_allow_delete": null}'
Additional resources
- Search for “redundancyPolicy” in the “Sample
ClusterLogging
custom resource (CR)” in About the Cluster Logging custom resource
Elasticsearch JVM Heap Use is High
The Elasticsearch node JVM Heap memory used is above 75%.
Troubleshooting
Consider increasing the heap size.
System CPU usage on the node is high.
Troubleshooting
Check the CPU of the cluster node. Consider allocating more CPU resources to the node.
Elasticsearch Process CPU is High
Elasticsearch process CPU usage on the node is high.
Troubleshooting
Check the CPU of the cluster node. Consider allocating more CPU resources to the node.
Elasticsearch Disk Space is Running Low
The Elasticsearch Cluster is predicted to be out of disk space within the next 6 hours based on current disk usage.
Troubleshooting
Get the disk space of the Elasticsearch node.
for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
In the command output, check the
nodes.node_name.fs
field to determine the free disk space on that node.Try to increase the disk space on all nodes.
If increasing the disk space is not possible, try adding a new data node to the cluster.
If adding a new data node is problematic, decrease the total cluster redundancy policy.
Check the current
redundancyPolicy
.oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
oc -n openshift-logging get cl -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'
If the cluster
redundancyPolicy
is higher thanSingleRedundancy
, set it toSingleRedundancy
and save this change.
If the preceding steps do not fix the issue, delete the old indices.
Check the status of all indices on Elasticsearch.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices
Identify an old index that can be deleted.
Delete the index.
Additional resources
Search for “redundancyPolicy” in the “Sample custom resource (CR)” in
Search for “ElasticsearchDiskSpaceRunningLow” in About Elasticsearch alerting rules.
Search for “Free up or increase disk space” in the Elasticsearch topic, .
Based on current usage trends, the predicted number of file descriptors on the node is insufficient.
Troubleshooting
Check and, if needed, configure the value of max_file_descriptors
for each node, as described in the Elasticsearch File descriptors topic.
Additional resources
Search for “File Descriptors In Use” in .