Common observability strategies
A logical strategy allows you to make uniform dashboards and scale your observability platform more easily.
- The USE method tells you how happy your machines are, the RED method tells you how happy your users are.
- RED reports on user experience and is more likely to report symptoms of problems.
- The best practice of alerting is to alert on symptoms rather than causes, so alerting should be done on RED dashboards.
USE stands for:
- Utilization - Percent time the resource is busy, such as node CPU usage
- Saturation - Amount of work a resource has to do, often queue length or node load
- Errors - Count of error events
RED stands for:
- Errors - Number of requests that are failing
- Duration - Amount of time these requests take, distribution of latency measurements
This method is most applicable to services, especially a microservices environment. For each of your services, instrument the code to expose these metrics for each component. RED dashboards are good for alerting and SLAs. A well-designed RED dashboard is a proxy for user experience.
According to the Google SRE handbook, if you can only measure four metrics of your user-facing system, focus on these four.
This method is similar to the RED method, but it includes saturation.
- Latency - Time taken to serve a request
- Traffic - How much demand is placed on your system
- Errors - Rate of requests that are failing