Outlier detection

    Detected errors fall into two categories: externally and locally originated errors. Externally generated errors are transaction specific and occur on the upstream server in response to the received request. For example, an HTTP server returning error code 500 or a redis server returning a payload which cannot be decoded. Those errors are generated on the upstream host after Envoy has connected to it successfully. Locally originated errors are generated by Envoy in response to an event which interrupted or prevented communication with the upstream host. Examples of locally originated errors are timeout, TCP reset, inability to connect to a specified port, etc.

    The type of detected errors depends on the filter type. The http router filter, for example, detects locally originated errors (timeouts, resets - errors related to connection to upstream host) and because it also understands the HTTP protocol it reports errors returned by the HTTP server (externally generated errors). In such a scenario, even when the connection to the upstream HTTP server is successful, the transaction with the server may fail. By contrast, the filter does not understand any protocol above the TCP layer and reports only locally originated errors.

    Under the default configuration (outlier_detection.split_external_local_origin_errors is false) locally originated errors are not distinguished from externally generated (transaction) errors, all end up in the same bucket, and are compared against the , outlier_detection.consecutive_gateway_failure and configuration items. For example, if connection to an upstream HTTP server fails twice because of timeout and then, after successful connection establishment, the server returns error code 500 then the total error count will be 3.

    Outlier detection may also be configured to distinguish locally originated errors from externally originated (transaction) errors. It is done via the outlier_detection.split_external_local_origin_errors configuration item. In that mode locally originated errors are tracked by separate counters than externally originated (transaction) errors and the outlier detector may be configured to react to locally originated errors and ignore externally originated errors or vice-versa.

    It is important to understand that a cluster may be shared among several filter chains. If one filter chain ejects a host based on its outlier detection type, other filter chains will be also affected even though their outlier detection type would not have ejected that host.

    Depending on the type of outlier detection, ejection either runs inline (for example in the case of consecutive 5xx) or at a specified interval (for example in the case of periodic success rate). The ejection algorithm works as follows:

    1. A host is determined to be an outlier.

    2. The host is ejected for some number of milliseconds. Ejection means that the host is marked unhealthy and will not be used during load balancing unless the load balancer is in a scenario. The number of milliseconds is equal to the outlier_detection.base_ejection_time value multiplied by the number of times the host has been ejected in a row. This causes hosts to get ejected for longer and longer periods if they continue to fail. When ejection time reaches it does not increase any more. When the host becomes healthy, the ejection time multiplier decreases with time. The host’s health is checked at intervals equal to outlier_detection.interval. If the host is healthy during that check, the ejection time multiplier is decremented. Assuming that the host stays healthy it would take approximately / outlier_detection.base_ejection_time * seconds to bring down the ejection time to the minimum value outlier_detection.base_ejection_time.

    Envoy supports the following outlier detection types:

    In the default mode ( is false) this detection type takes into account all generated errors: locally originated and externally originated (transaction) errors. Errors generated by non-HTTP filters, like tcp proxy or are internally mapped to HTTP 5xx codes and treated as such.

    In split mode (outlier_detection.split_external_local_origin_errors is true) this detection type takes into account only externally originated (transaction) errors, ignoring locally originated errors. If an upstream host is an HTTP-server, only 5xx types of error are taken into account (see for exceptions). For redis servers, served via redis proxy only malformed responses from the server are taken into account. Properly formatted responses, even when they carry an operational error (like index not found, access denied) are not taken into account.

    If an upstream host returns some number of errors which are treated as consecutive 5xx type errors, it will be ejected. The number of consecutive 5xx required for ejection is controlled by the value.

    In the default mode (outlier_detection.split_external_local_origin_errors is false) this detection type takes into account a subset of 5xx errors, called “gateway errors” (502, 503 or 504 status code) and local origin failures, such as timeout, TCP reset etc.

    In split mode ( is true) this detection type takes into account a subset of 5xx errors, called “gateway errors” (502, 503 or 504 status code) and is supported only by the http router.

    If an upstream host returns some number of consecutive “gateway errors” (502, 503 or 504 status code), it will be ejected. The number of consecutive gateway failures required for ejection is controlled by the value.

    This detection type is enabled only when outlier_detection.split_external_local_origin_errors is true and takes into account only locally originated errors (timeout, reset, etc). If Envoy repeatedly cannot connect to an upstream host or communication with the upstream host is repeatedly interrupted, it will be ejected. Various locally originated problems are detected: timeout, TCP reset, ICMP errors, etc. The number of consecutive locally originated failures required for ejection is controlled by the value. This detection type is supported by http router, and redis proxy.

    Success Rate based outlier detection aggregates success rate data from every host in a cluster. Then at given intervals ejects hosts based on statistical outlier detection. Success Rate outlier detection will not be calculated for a host if its request volume over the aggregation interval is less than the value. Moreover, detection will not be performed for a cluster if the number of hosts with the minimum required request volume in an interval is less than the outlier_detection.success_rate_minimum_hosts value.

    In split mode ( is true), locally originated errors and externally originated (transaction) errors are counted and treated separately. Most configuration items, namely outlier_detection.success_rate_minimum_hosts, , outlier_detection.success_rate_stdev_factor apply to both types of errors, but applies to externally originated errors only and outlier_detection.enforcing_local_origin_success_rate applies to locally originated errors only.

    Failure Percentage based outlier detection functions similarly to success rate detection, in that it relies on success rate data from each host in a cluster. However, rather than compare those values to the mean success rate of the cluster as a whole, they are compared to a flat user-configured threshold. This threshold is configured via the field.

    The other configuration fields for failure percentage based detection are similar to the fields for success rate detection. Failure percentage based detection also obeys outlier_detection.split_external_local_origin_errors; the enforcement percentages for externally- and locally-originated errors are controlled by and outlier_detection.enforcing_failure_percentage_local_origin, respectively. As with success rate detection, detection will not be performed for a host if its request volume over the aggregation interval is less than the value. Detection also will not be performed for a cluster if the number of hosts with the minimum required request volume in an interval is less than the outlier_detection.failure_percentage_minimum_hosts value.

    For gRPC requests, the outlier detection will use the HTTP status mapped from the response header.

    A log of outlier ejection events can optionally be produced by Envoy. This is extremely useful during daily operations since global stats do not provide enough information on which hosts are being ejected and for what reasons. The log is structured as protobuf-based dumps of OutlierDetectionEvent messages. Ejection event logging is configured in the Cluster manager .