Common observability strategies

    A logical strategy allows you to make uniform dashboards and scale your observability platform more easily.

    • The USE method tells you how happy your machines are, the RED method tells you how happy your users are.
    • RED reports on user experience and is more likely to report symptoms of problems.
    • The best practice of alerting is to alert on symptoms rather than causes, so alerting should be done on RED dashboards.

    USE stands for:

    • Utilization - Percent time the resource is busy, such as node CPU usage
    • Saturation - Amount of work a resource has to do, often queue length or node load
    • Errors - Count of error events

    RED stands for:

    • Errors - Number of requests that are failing
    • Duration - Amount of time these requests take, distribution of latency measurements

    This method is most applicable to services, especially a microservices environment. For each of your services, instrument the code to expose these metrics for each component. RED dashboards are good for alerting and SLAs. A well-designed RED dashboard is a proxy for user experience.

    According to the Google SRE handbook, if you can only measure four metrics of your user-facing system, focus on these four.

    This method is similar to the RED method, but it includes saturation.

    • Latency - Time taken to serve a request
    • Traffic - How much demand is placed on your system
    • Errors - Rate of requests that are failing