Recording rules

    This page documents how to correctly do aggregation and suggests a naming convention.

    Recording rules should be of the general form . level represents the aggregation level and labels of the rule output. metric is the metric name and should be unchanged other than stripping _total off counters when using rate() or irate(). operations is a list of operations that were applied to the metric, newest operation first.

    Keeping the metric name unchanged makes it easy to know what a metric is and easy to find in the codebase.

    To keep the operations clean, _sum is omitted if there are other operations, as . Associative operations can be merged (for example min_min is the same as min).

    When aggregating up ratios, aggregate up the numerator and denominator separately and then divide. Do not take the average of a ratio or average of an average as that is not statistically valid.

    When aggregating up the _count and _sum of a Summary and dividing to calculate average observation size, treating it as a ratio would be unwieldy. Instead keep the metric name without the or _sum suffix and replace the rate in the operation with mean. This represents the average observation size over that time period.

    Always specify a without clause with the labels you are aggregating away. This is to preserve all the other labels such as job, which will avoid conflicts and give you more useful metrics and alerts.

    Examples

    Note the indentation style with outdented operators on their own line between two vectors. To make this style possible in Yaml, (e.g. |2) are used.

    Calculating a request failure ratio and aggregating up to the job-level failure ratio:

    Calculating average latency over a time period from a Summary:

    Calculating the average query rate across instances and paths is done using the function:

    Notice that when aggregating that the labels in the without clause are removed from the level of the output metric name compared to the input metric names. When there is no aggregation, the levels always match. If this is not the case a mistake has likely been made in the rules.