Is your application’s performance degrading? Monitoring the right way!
I have been researching this for a long time now to understand how big tech giants monitor the quality of their services and the health of their immense infrastructure. I was quite sure that monotonous pings were not the solution. Then, what was it? What different approach do they(Google) take to track the quality and customer satisfaction of their services?
Continuous Monitoring 🔍
It is an automated way of checking the uptime and health of various computing resources. A proactive approach to improve the DevOps lifecycle by showing the areas where special care is required. It can be divided into two broad categories
- Black Box Monitoring is based on checking the attributes of resources such as disk, VM, web-servers and triggering an alert when a threshold is met.
- White Box Monitoring is based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or an HTTP handler that emits internal statistics.
Black-Box 👉 THERE is a problem
White-Box 👉 THIS is the problem
Without monitoring, in place, an organization is clueless about its computing infrastructure.
Metrics are everywhere! 📈
A metric is a measurement of an attribute of any resource which can be used to produce an alert if the value matches some predefined threshold.
- used memory in a VM
- number of Java threads in an application
- number of active users(custom)
- HTTP hit count of a web-server
- a temperature reading of a machine
Metrics are generally stored as time-series in Time Series Databases like Prometheus.
http_requests_total{service="users-directory", method="GET", endpoint="/user/:id", status="403"} 1150# http_requests_total is the name of the metric & 1150 is the value
# service, method, endpoint, status are the labels of this metrics which helps us quering and identifying these metrics
Metrics have been classified into two major categories
- Utilization (% time that the resource was busy)
- Saturation (amount of work resource has to do, often queue length)
- Errors (count of error events)
- Rate (the number of requests per second)
- Errors (the number of those requests that are failing)
- Duration (the amount of time those requests take)
USE metrics 👉 How happy my SERVERS are?
RED metrics 👉 How happy my CUSTOMERS are?
The RED metrics are instrumented in the code to be later fetched by systems like Prometheus. Istio can be used as a sidecar container in Kubernetes to get the metrics out of the box.
Service Level Terminology
To understand how we can use RED metrics for alerting we’ll need to understand a bunch of concepts.
Service Level Indicator 🟢
SLI is a carefully defined quantitative measure of some aspect of the level of service that is provided.
SLI = Good Events/Total EventsSLI = Good HTTP Req/Total HTTP ReqSLI is 0% 👉 Nothing is workingSLI is 100% 👉 Everything is working
Service Level Objective ✔️
SLO is the target value for SLI measured over time. It’s internal to an organization.
- 99.9% successful HTTP request over 28 days for a web service.
- 99.9% successful HTTP request over 28 days for web service within 500ms. (Latency Included)
- 99.9% of the database writes in 1 sec and reads in 3 sec for a database as a service.
Error Budget & Error Burn Rate ❎
Errors are inevitable and no system is 100% reliable!
Error Budget is the inverse of SLO. It denotes the budget each application or a team is allowed to fail.
ErrorBudget = (100 — SLO)%
- 0.1% HTTP request will fail over 28 days for web service.
- 0.1% HTTP request will fail or responded over 500ms over 28 days for web service. (Latency Included)
- 0.1% of the database writes and read will take more than 1 sec and 3 secs respectively for a database as a service.
Error Burn Rate is the rate at which an application is consuming its Error Budget. It won't make much sense as a number but helps in triggering the alerts.
Service Level Agreement 📖
An explicit or implicit contract with your users that includes consequences of the meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financial — a rebate or a penalty — but they can take other forms.
Alerting on SLOs 💥
We turn SLOs into alerting rules so that we can respond to problems before we consume too much of our error budget.
We can reliably detect breaches of configurable SLOs with an appropriate detection time (the quicker the error budget is burned, the sooner the alert will fire) and a sufficiently short reset time. The resulting alerts are strictly symptom-based and therefore always relevant.
Putting the pieces of the puzzle together ⚙️
Suppose you are the product owner of Python-based web service and you commit to your organization that the service will return 99.99% non 5XX HTTP codes over a month. The organization will commit say 99.9%(not 99.99%) non 5XX HTTP codes returns to the user in an agreement.
SLI = Non 5XX HTTP Returns / Total HTTP returns; Prometheus metrics
SLO = 99.99%; Internal to the company
Error Budget = 0.01%
SLA = 99.9%; Agreement to the client & breach will have consequences
The monitoring system has to now generate an alert whenever the Error Burn Rate is more than allowed and there is a chance that the whole Error Budget will be consumed way earlier if errors keep occurring at the current rate.
Think of Error Budget as pocket money 💰 for a child where he has $10 for the whole month i.e every day he can spend up to $0.33. But, what if the child spends $2 on the first day itself. There is a high risk that the whole pocket money, $10, will be consumed before the month ends if the child keeps spending the money at this rate and he or his guardians needs to be notified of this.
Multiwindow, Multi-Burn-Rate Alerts
The most mature and iterated version of Alerting on SLOs by Google. Watch the below video to understand how does it work, it took me hours to understand it but every time I rewatch the video I learn something new! Watch from 23:45 if you want to skip basics.
References
- http://www.brendangregg.com/usemethod.html
- https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/
- https://landing.google.com/sre/sre-book/chapters/service-level-objectives/
- https://landing.google.com/sre/workbook/chapters/alerting-on-slos/
- https://developers.soundcloud.com/blog/alerting-on-slos
- https://tanzu.vmware.com/content/vmware-tanzu-observability-blog/slo-alerting-with-wavefront
- https://promtools.dev/alerts/errors
- https://github.com/google/slo-generator