Configuring your logging deployment
Red Hat Openshift Logging Operator:
(CL) - Deploys the collector and forwarder which currently are both implemented by a daemonset running on each node.
ClusterLogForwarder
(CLF) - Generates collector configuration to forward logs per user configuration.
Loki Operator:
LokiStack
- Controls the Loki cluster as log store and the web proxy with OpenShift Container Platform authentication integration to enforce multi-tenancy.
OpenShift Elasticsearch Operator:
ElasticSearch
- Configure and deploy an Elasticsearch instance as the default log store.Kibana
- Configure and deploy Kibana instance to search, query and view logs.
The supported way of configuring the logging subsystem for Red Hat OpenShift is by configuring it using the options described in this documentation. Do not use other configurations, as they are unsupported. Configuration paradigms might change across OpenShift Container Platform releases, and such cases can only be handled gracefully if all configuration possibilities are controlled. If you use configurations other than those described in this documentation, your changes will disappear because the Operators reconcile any differences. The Operators reverse everything to the defined state by default and by design.
If you must perform configurations not described in the OpenShift Container Platform documentation, you must set your Red Hat OpenShift Logging Operator to |
With Logging version 5.6 and higher, you can configure retention policies based on log streams. Rules for these may be set globally, per tenant, or both. If you configure both, tenant rules apply before global rules.
- To enable stream-based retention, create or edit the
LokiStack
custom resource (CR):
- You can refer to the examples below to configure your LokiStack CR.
Example global stream-based retention
apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
name: logging-loki
namespace: openshift-logging
spec:
limits:
global: (1)
retention: (2)
days: 20
streams:
- days: 4
priority: 1
selector: '{kubernetes_namespace_name=~"test.+"}' (3)
- days: 1
priority: 1
selector: '{log_type="infrastructure"}'
managementState: Managed
replicationFactor: 1
size: 1x.small
storage:
schemas:
- effectiveDate: "2020-10-11"
version: v11
secret:
name: logging-loki-s3
type: aws
storageClassName: standard
tenants:
mode: openshift-logging
1 | Sets retention policy for all log streams. Note: This field does not impact the retention period for stored logs in object storage. |
2 | Retention is enabled in the cluster when this block is added to the CR. |
3 | Contains the used to define the log stream. |
Example per-tenant stream-based retention
apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
name: logging-loki
namespace: openshift-logging
spec:
limits:
global:
retention:
days: 20
tenants: (1)
application:
days: 1
streams:
- days: 4
selector: '{kubernetes_namespace_name=~"test.+"}' (2)
infrastructure:
retention:
days: 5
streams:
- days: 1
selector: '{kubernetes_namespace_name=~"openshift-cluster.+"}'
managementState: Managed
replicationFactor: 1
size: 1x.small
storage:
schemas:
version: v11
secret:
name: logging-loki-s3
type: aws
storageClassName: standard
tenants:
mode: openshift-logging
oc apply -f <file-name>.yaml
This is not for managing the retention for stored logs. Global retention periods for stored logs to a supported maximum of 30 days is configured with your object storage. |
Enables multi-line error detection of container logs.
Enabling this feature could have performance implications and may require additional computing resources or alternate logging solutions. |
Log parsers often incorrectly identify separate lines of the same exception as separate exceptions. This leads to extra log entries and an incomplete or inaccurate view of the traced information.
Example java exception
- To enable logging to detect multi-line exceptions and reassemble them into a single log entry, ensure that the
ClusterLogForwarder
Custom Resource (CR) contains adetectMultilineErrors
field, with a value oftrue
.
Example ClusterLogForwarder CR
apiVersion: logging.openshift.io/v1
kind: ClusterLogForwarder
metadata:
name: instance
namespace: openshift-logging
spec:
pipelines:
- name: my-app-logs
inputRefs:
- application
outputRefs:
- default
detectMultilineErrors: true
When log messages appear as a consecutive sequence forming an exception stack trace, they are combined into a single, unified log record. The first log message’s content is replaced with the concatenated content of all the message fields in the sequence.
Troubleshooting
When enabled, the collector configuration will include a new section with type: detect_exceptions
Example vector configuration section
[transforms.detect_exceptions_app-logs]
type = "detect_exceptions"
inputs = ["application"]
languages = ["All"]
group_by = ["kubernetes.namespace_name","kubernetes.pod_name","kubernetes.container_name"]
expire_after_ms = 2000
multiline_flush_interval_ms = 1000
<label @MULTILINE_APP_LOGS>
<match kubernetes.**>
@type detect_exceptions
remove_tag_prefix 'kubernetes'
message message
force_line_breaks true
multiline_flush_interval .2
</match>
Loki alerting rules use LogQL and follow . You can set log based alerts by creating an AlertingRule
custom resource (CR). AlertingRule
CRs may be created for application
, audit
, or infrastructure
tenants.
Tenant type | Valid namespaces |
---|---|
application | |
audit |
|
infrastructure |
|
Application, Audit, and Infrastructure alerts are sent to the Cluster Monitoring Operator (CMO) Alertmanager in the openshift-monitoring
namespace by default unless you have disabled the local Alertmanager
instance.
Application alerts are not sent to the CMO Alertmanager in the openshift-user-workload-monitoring
namespace by default unless you have enabled a separate Alertmanager
instance.
The AlertingRule
CR contains a set of specifications and webhook validation definitions to declare groups of alerting rules for a single LokiStack instance. In addition, the webhook validation definition provides support for rule validation conditions:
If an
AlertingRule
CR includes an invalidinterval
period, it is an invalid alerting ruleIf an
AlertingRule
CR includes an invalidfor
period, it is an invalid alerting rule.If an
AlertingRule
CR includes an invalid LogQLexpr
, it is an invalid alerting rule.If an
AlertingRule
CR includes two groups with the same name, it is an invalid alerting rule.If none of above applies, an
AlertingRule
is considered a valid alerting rule.
Prerequisites
Logging subsystem for Red Hat OpenShift Operator 5.7 and later
OKD 4.13 and later
Procedure
Create an AlertingRule CR:
Populate your AlertingRule CR using the appropriate example below:
Example infrastructure AlertingRule CR
kind: AlertingRule
metadata:
name: loki-operator-alerts
namespace: openshift-operators-redhat (1)
labels: (2)
openshift.io/cluster-monitoring: "true"
spec:
tenantID: "infrastructure" (3)
groups:
- name: LokiOperatorHighReconciliationError
rules:
- alert: HighPercentageError
expr: | (4)
sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"} |= "error" [1m])) by (job)
/
sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"}[1m])) by (job)
> 0.01
for: 10s
labels:
severity: critical (5)
annotations:
summary: High Loki Operator Reconciliation Errors (6)
description: High Loki Operator Reconciliation Errors (7)
1 The namespace
where this AlertingRule is created must have a label matching the LokiStackspec.rules.namespaceSelector
definition.2 The labels
block must match the LokiStackspec.rules.selector
definition.3 AlertingRules for infrastructure
tenants are only supported in theopenshift-
,kube-\
, ordefault
namespaces.4 Value for kubernetes_namespace_name:
must match the value formetadata.namespace
.5 Mandatory field. Must be critical
,warning
, orinfo
.6 Mandatory field. 7 Mandatory field. apiVersion: loki.grafana.com/v1
kind: AlertingRule
metadata:
name: app-user-workload
namespace: app-ns (1)
labels: (2)
openshift.io/cluster-monitoring: "true"
spec:
tenantID: "application"
groups:
- name: AppUserWorkloadHighError
rules:
- alert:
expr: | (3)
sum(rate({kubernetes_namespace_name="app-ns", kubernetes_pod_name=~"podName.*"} |= "error" [1m])) by (job)
for: 10s
labels:
severity: critical (4)
annotations:
summary: (5)
description: (6)