Panic threshold

    There are two modes Envoy can choose from when in a panic state: traffic will either be sent to all hosts, or will be sent to no hosts (and therefore will always fail). This is configured in the cluster configuration. Choosing to fail traffic during panic scenarios can help avoid overwhelming potentially failing upstream services, as it will reduce the load on the upstream service before all hosts have been determined to be unhealthy. However, it eliminates the possibility of _some_ requests succeeding even when many or all hosts in a cluster are unhealthy. This may be a good tradeoff to make if a given service is observed to fail in an all-or-nothing pattern, as it will more quickly cut off requests to the cluster. Conversely, if a cluster typically continues to successfully service _some_ requests even when degraded, enabling this option is probably unhelpful.

    Panic thresholds work in conjunction with priorities. If the number of available hosts in a given priority goes down, Envoy will try to shift some traffic to lower priorities. If it succeeds in finding enough available hosts in lower priorities, Envoy will disregard panic thresholds. In mathematical terms, if normalized total availability across all priority levels is 100%, Envoy disregards panic thresholds and continues to distribute traffic load across priorities according to the algorithm described . However, when normalized total availability drops below 100%, Envoy assumes that there are not enough available hosts across all priority levels. It continues to distribute traffic load across priorities, but if a given priority level’s availability is below the panic threshold, traffic will go to all (or no) hosts in that priority level regardless of their availability.

    The following examples explain the relationship between normalized total availability and panic threshold. It is assumed that the default value of 50% is used for the panic threshold.

    Assume a simple set-up with 2 priority levels, P=1 100% healthy. In this scenario normalized total health is always 100%, P=0 never enters panic mode, and Envoy is able to shift as much traffic as necessary to P=1.

    If P=1 becomes unhealthy, panic threshold continues to be disregarded until the sum of the health P=0 + P=1 goes below 100%. At this point Envoy starts checking panic threshold value for each priority.

    P=0 healthy endpoints

    P=1 healthy endpoints

    Traffic to P=0

    P=0 in panic

    Traffic to P=1

    P=1 in panic

    normalized total health

    72%

    72%

    100%

    NO

    0%

    NO

    100%

    71%

    71%

    99%

    NO

    1%

    NO

    50%

    60%

    70%

    NO

    30%

    NO

    100%

    25%

    100%

    35%

    NO

    65%

    NO

    100%

    25%

    25%

    50%

    YES

    50%

    YES

    70%

    5%

    65%

    7%

    YES

    93%

    NO

    98%

    Panic mode can be disabled by setting the panic threshold to 0%.

    Load distribution is calculated as described above as long as there are priority levels not in panic mode. When all priority levels enter the panic mode, load calculation algorithm changes. In this case each priority level receives traffic relative to the number of hosts in that priority level in relation to the number of hosts in all priority levels. For example, if there are 2 priorities P=0 and P=1 and each of them consists of 5 hosts, each level will receive 50% of the traffic. If there are 2 hosts in priority P=0 and 8 hosts in priority P=1, priority P=0 will receive 20% of the traffic and priority P=1 will receive 80% of the traffic.

    Note that panic thresholds can be configured per-priority.