- Authors
- Name
- Introduction
- SLI/SLO/SLA Concepts
- SLI Selection Strategy
- SLO Target Setting
- Error Budget Policy
- Burn Rate Alerts
- Prometheus Implementation
- Grafana Dashboard
- Organizational Adoption Strategy
- Failure Cases and Lessons Learned
- Operational Checklist
- References

Introduction
You have set up Prometheus for metric collection and configured Alertmanager for notifications. Yet every time an incident occurs, the question remains: "How much failure is acceptable?" You configured an alert at 80% CPU utilization, but it fires repeatedly on events that have no real impact on user experience. The team suffers from Alert Fatigue.
Google's SRE team established the SLI (Service Level Indicator), SLO (Service Level Objective), and Error Budget framework to solve this problem at its root. The core idea is straightforward: "Measure service quality from the user's perspective, quantify the acceptable failure range, and use data to balance reliability against development velocity."
This guide walks through the entire reliability management pipeline: SLI/SLO/SLA concepts, SLI selection strategy, SLO target setting, Error Budget policies, Burn Rate alerts, Prometheus-based implementation, Grafana dashboards, and organizational adoption strategy -- all with production-ready code examples.
SLI/SLO/SLA Concepts
SLI (Service Level Indicator)
An SLI is a quantitative measure of service quality. The Google SRE book defines it as the ratio of good events to total events. For example, "the proportion of HTTP requests that responded within 200ms out of all requests" is an SLI. An SLI always falls between 0% and 100%.
SLO (Service Level Objective)
An SLO is a target value for an SLI. "The response time SLI must be at least 99.9% over a 30-day window" is an example of an SLO. SLOs are set by internal engineering teams and represent the balance point between user expectations and engineering cost. A 100% SLO is neither realistic nor desirable.
SLA (Service Level Agreement)
An SLA attaches legal or contractual consequences to an SLO. For instance, "credit refund if availability falls below 99.95%." SLAs include financial compensation or contract termination clauses. An SLA is a business contract; an SLO is an engineering target. SLOs should always be stricter than SLAs. If the SLA is 99.9%, set the SLO to at least 99.95% to maintain a higher internal standard.
Error Budget
Error Budget derives from the SLO. If the SLO is 99.9%, the Error Budget is 0.1%. Over a 30-day window, that translates to roughly 43 minutes of allowable downtime. Error Budget quantifies "how much failure is acceptable" and enables data-driven management of the trade-off between feature development velocity and reliability.
| Concept | Definition | Example | Owner |
|---|---|---|---|
| SLI | Quantitative measure of service level | Success request ratio, latency distribution | Engineering |
| SLO | Target value for an SLI | 30-day availability at least 99.9% | Engineering + Product |
| SLA | SLO + contractual consequences | Credit refund if below 99.95% | Business + Legal |
| Error Budget | 100% - SLO | 0.1% = 43 min downtime per month | Engineering |
SLI Selection Strategy
SLIs by Service Type
Different service types require different SLIs. The Google SRE Workbook categorizes them as follows.
| Service Type | Primary SLIs | Measurement Method |
|---|---|---|
| Request-driven (API, Web) | Availability, Latency | Success response ratio, p50/p99 response time |
| Pipeline (Batch, ETL) | Freshness, Correctness | Data processing delay, result accuracy |
| Storage (DB, Cache) | Durability, Throughput | Data loss rate, reads/writes per second |
| Streaming (Kafka, Pub/Sub) | Latency, Throughput | End-to-end latency, messages per second |
SLI Selection Principles
- Critical User Journey (CUJ) based: Define SLIs for each key user journey such as login, search, and checkout.
- Ease of measurement first: Start with SLIs extractable from existing logs or metrics rather than requiring complex instrumentation.
- Synthetic monitoring in parallel: Server-side metrics alone cannot capture CDN cache hit rates or client rendering times. Combine with synthetic monitoring.
- Multi-grade SLOs: Set multiple thresholds for a single SLI. Track two objectives simultaneously: "90% of requests under 100ms, 99% of requests under 400ms."
SLO Target Setting
Allowed Downtime by SLO Level
As SLO targets increase, the allowed downtime decreases exponentially.
| SLO | Monthly Error Budget | Annual Error Budget |
|---|---|---|
| 99% | 7h 18m | 3d 15h |
| 99.5% | 3h 39m | 1d 19h |
| 99.9% | 43m 50s | 8h 46m |
| 99.95% | 21m 55s | 4h 23m |
| 99.99% | 4m 23s | 52m 36s |
SLO Setting Process
- Measure current performance: Collect at least two weeks of real data to understand your service's actual reliability.
- Survey user expectations: Determine what quality level users actually perceive. Internal tools and external services have different expectations.
- Cost-benefit analysis: Evaluate whether the cost of going from 99.9% to 99.99% delivers proportional business value.
- Stakeholder alignment: Set a number that product managers, engineering leads, and executives all agree on.
- Iterative adjustment: Start with conservative SLOs and review quarterly.
Error Budget Policy
An Error Budget policy documents the specific actions to take when the Error Budget is consumed. The Google SRE Workbook recommends a written policy.
Error Budget Policy Example
# error-budget-policy.yaml
error_budget_policy:
service: payment-api
slo_window: 30d
slo_target: 99.9%
thresholds:
- level: green
budget_remaining: '>50%'
actions:
- Normal feature development proceeds
- Weekly SLO review meeting
- level: yellow
budget_remaining: '25%-50%'
actions:
- Reliability review required before any feature release
- Technical debt reduction work runs in parallel
- Daily SLO review meeting
- level: orange
budget_remaining: '10%-25%'
actions:
- Halt new feature releases
- Dedicate 50%+ engineering time to reliability work
- Execute postmortem analysis immediately
- level: red
budget_remaining: '<10%'
actions:
- Freeze all feature development
- Focus all engineering resources on reliability recovery
- Escalate to executive leadership
- Maintain freeze until root cause of SLO miss is resolved
Error Budget Exhaustion Response
How an organization responds to Error Budget exhaustion is a litmus test of its reliability culture. Key principles include:
- Feature Freeze: When the Error Budget is exhausted, reliability recovery takes top priority.
- Mandatory Postmortem: Conduct a blameless postmortem for every Error Budget exhaustion event.
- Automation investment: If manual work (Toil) is the root cause, prioritize automation.
- SLO Review: If Error Budget is repeatedly exhausted, evaluate whether the SLO itself is too aggressive.
Burn Rate Alerts
Burn Rate Concept
Burn Rate is a multiplier indicating how quickly you are consuming your Error Budget. A Burn Rate of 1 means you will exactly exhaust your Error Budget over the SLO window (e.g., 30 days). A Burn Rate of 10 means you will consume the 30-day budget in just 3 days.
Alerting Approach Comparison
| Alerting Approach | Pros | Cons |
|---|---|---|
| Threshold-based (error rate exceeds 1%) | Simple to configure | Many alerts unrelated to user impact |
| Single-window Burn Rate | Directly tied to SLO | Overreacts to short spikes |
| Multi-Window Multi-Burn-Rate | Precise detection | Complex configuration |
Multi-Window Multi-Burn-Rate Alerts
This is the approach recommended by the Google SRE Workbook. It checks two windows simultaneously (long window + short window). The long window reflects actual trends; the short window prevents alerting on already-recovered issues.
# multi-window-burn-rate-alerts.yaml
# Google SRE Workbook recommended alert configuration
groups:
- name: slo-burn-rate-alerts
rules:
# Page alert: 2% budget consumed in 1 hour (Burn Rate 14.4)
- alert: SLOBurnRateHigh_Page
expr: |
(
sli:error_ratio:rate1h > (14.4 * 0.001)
and
sli:error_ratio:rate5m > (14.4 * 0.001)
)
for: 2m
labels:
severity: page
annotations:
summary: 'Error budget burn rate is very high'
description: 'Service is consuming error budget 14.4x faster than normal. 2% of 30-day budget will be consumed in 1 hour.'
# Page alert: 5% budget consumed in 6 hours (Burn Rate 6)
- alert: SLOBurnRateMedium_Page
expr: |
(
sli:error_ratio:rate6h > (6 * 0.001)
and
sli:error_ratio:rate30m > (6 * 0.001)
)
for: 5m
labels:
severity: page
annotations:
summary: 'Error budget burn rate is elevated'
description: 'Service is consuming error budget 6x faster than normal. 5% of 30-day budget will be consumed in 6 hours.'
# Ticket alert: 10% budget consumed in 3 days (Burn Rate 1)
- alert: SLOBurnRateSlow_Ticket
expr: |
(
sli:error_ratio:rate3d > (1 * 0.001)
and
sli:error_ratio:rate6h > (1 * 0.001)
)
for: 30m
labels:
severity: ticket
annotations:
summary: 'Error budget burn rate indicates slow degradation'
description: 'Service is steadily consuming error budget. 10% of 30-day budget will be consumed in 3 days.'
Prometheus Implementation
SLI Recording Rules
Recording Rules pre-compute error ratios across multiple time windows so that alert rules can reuse them efficiently.
# sli-recording-rules.yaml
groups:
- name: sli-recording-rules
interval: 30s
rules:
# Error ratio across multiple windows
- record: sli:error_ratio:rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
- record: sli:error_ratio:rate30m
expr: |
sum(rate(http_requests_total{status=~"5.."}[30m]))
/
sum(rate(http_requests_total[30m]))
- record: sli:error_ratio:rate1h
expr: |
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
- record: sli:error_ratio:rate6h
expr: |
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
- record: sli:error_ratio:rate3d
expr: |
sum(rate(http_requests_total{status=~"5.."}[3d]))
/
sum(rate(http_requests_total[3d]))
# Latency SLI: ratio of responses within 200ms
- record: sli:latency_good_ratio:rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
- record: sli:latency_good_ratio:rate1h
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[1h]))
/
sum(rate(http_request_duration_seconds_count[1h]))
Error Budget Remaining Calculation
# Error Budget remaining ratio (30-day window, SLO 99.9%)
# Result of 1 means 100% remaining, 0 means fully exhausted
(
1 - (
sum(increase(http_requests_total{status=~"5.."}[30d]))
/
sum(increase(http_requests_total[30d]))
)
- 0.999
) / 0.001
This query computes the actual error ratio over the past 30 days, subtracts the SLO target (0.999), and divides by the total Error Budget (0.001). A result of 0.5 means 50% of the Error Budget remains.
SLO Reporting Script
#!/usr/bin/env python3
"""SLO Report Generation Script"""
import requests
from datetime import datetime, timedelta
PROMETHEUS_URL = "http://prometheus:9090"
SLO_CONFIGS = [
{
"name": "payment-api-availability",
"sli_query": 'sum(rate(http_requests_total{job="payment-api",status!~"5.."}[30d])) / sum(rate(http_requests_total{job="payment-api"}[30d]))',
"slo_target": 0.999,
"window_days": 30,
},
{
"name": "payment-api-latency",
"sli_query": 'sum(rate(http_request_duration_seconds_bucket{job="payment-api",le="0.2"}[30d])) / sum(rate(http_request_duration_seconds_count{job="payment-api"}[30d]))',
"slo_target": 0.995,
"window_days": 30,
},
]
def query_prometheus(query: str) -> float:
"""Execute Prometheus instant query"""
resp = requests.get(
f"{PROMETHEUS_URL}/api/v1/query",
params={"query": query},
)
result = resp.json()["data"]["result"]
if not result:
return 0.0
return float(result[0]["value"][1])
def generate_report():
"""Generate SLO report"""
print(f"=== SLO Report ({datetime.now().strftime('%Y-%m-%d %H:%M')}) ===\n")
for config in SLO_CONFIGS:
current_sli = query_prometheus(config["sli_query"])
slo_target = config["slo_target"]
error_budget_total = 1 - slo_target
error_budget_consumed = max(0, slo_target - current_sli)
error_budget_remaining = max(
0, (error_budget_total - error_budget_consumed) / error_budget_total
)
status = "OK" if current_sli >= slo_target else "VIOLATED"
budget_pct = error_budget_remaining * 100
print(f"Service: {config['name']}")
print(f" SLI (current): {current_sli:.4%}")
print(f" SLO (target): {slo_target:.4%}")
print(f" Status: {status}")
print(f" Error Budget: {budget_pct:.1f}% remaining")
if budget_pct < 25:
print(f" WARNING: Error budget below 25%!")
print()
if __name__ == "__main__":
generate_report()
Grafana Dashboard
SLO Dashboard Configuration
A Grafana dashboard should provide an at-a-glance view of SLO status. The key panels are:
- Current SLI Value (Stat Panel): Display the current availability SLI as a large number. Use threshold colors to convey status intuitively.
- Error Budget Remaining (Gauge Panel): Show the remaining Error Budget as a gauge. Turn yellow below 50%, red below 25%.
- Burn Rate Trend (Time Series Panel): Graph the Burn Rate over time with alert thresholds displayed as horizontal lines.
- Error Budget Exhaustion Forecast (Time Series Panel): Show the projected date when the Error Budget will be fully exhausted at the current Burn Rate.
# grafana-slo-dashboard.json (panel configuration summary)
panels:
- title: 'Current Availability SLI'
type: stat
targets:
- expr: '1 - sli:error_ratio:rate30d'
legendFormat: 'Availability'
thresholds:
- value: 0.999
color: green
- value: 0.995
color: yellow
- value: 0.99
color: red
- title: 'Error Budget Remaining'
type: gauge
targets:
- expr: |
clamp_min(
(0.001 - sli:error_ratio:rate30d) / 0.001,
0
)
legendFormat: 'Budget Remaining'
min: 0
max: 1
thresholds:
- value: 0.5
color: green
- value: 0.25
color: yellow
- value: 0
color: red
- title: 'Burn Rate (1h window)'
type: timeseries
targets:
- expr: 'sli:error_ratio:rate1h / 0.001'
legendFormat: 'Burn Rate (1h)'
overrides:
- matcher: 'Burn Rate Threshold'
properties:
- id: custom.lineStyle
value: dash
Organizational Adoption Strategy
Phased Rollout Roadmap
SLI/SLO adoption cannot succeed through technical implementation alone. It requires organization-wide alignment and cultural change.
Phase 1 - Pilot (1-2 months)
- Select 1-2 of the most critical services.
- Start with simple SLIs extractable from existing metrics.
- Set SLOs slightly below current performance so they are achievable.
Phase 2 - Expansion (3-6 months)
- Extend to major services based on pilot results.
- Establish Error Budget policies and secure stakeholder alignment.
- Deploy Burn Rate alerts to production.
Phase 3 - Maturity (6-12 months)
- Apply SLOs to all user-facing services.
- Integrate SLOs into incident review, capacity planning, and release management.
- Institutionalize quarterly SLO review processes.
Key Success Factors
- Executive sponsorship: SLO-driven decisions can include feature freezes, so executive support is essential.
- Product manager involvement: PMs must participate in SLO target decisions. If SLOs become an engineering-only metric, adoption will fail.
- Incremental approach: Do not aim for perfect SLOs from the start. "Wrong but useful" beats "perfect but never shipped."
- Automation investment: Automate SLO report generation, Error Budget tracking, and Burn Rate alerting to minimize operational burden.
Failure Cases and Lessons Learned
Case 1: Overly Aggressive SLO
A team set a 99.99% SLO for an internal service that had no external SLA. The monthly Error Budget was only 4 minutes and 23 seconds, making normal deployments impossible. Every deployment caused a few seconds of downtime, immediately exhausting the Error Budget and effectively halting feature development. The solution was to lower the SLO to 99.9% and adopt zero-downtime deployment strategies (Blue-Green, Canary).
Case 2: Alert Fatigue from Poor Burn Rate Configuration
A team adopted Burn Rate alerts but only used a short window (5 minutes), causing alerts on every brief error spike. With dozens of alerts per day, on-call engineers began ignoring them. The fix was switching to the Multi-Window approach and separating page alerts from ticket alerts.
Case 3: No Policy for Error Budget Exhaustion
The Error Budget was exhausted, but there was no "feature freeze" policy, so the development team continued releasing new features. This led to repeated incidents and user churn. The problem was resolved only after the team drafted a written Error Budget policy and obtained executive approval to enforce it.
Operational Checklist
A checklist for deploying SLI/SLO/Error Budget in production:
- Are SLIs defined for each Critical User Journey (CUJ)?
- Are SLIs automatically computed via Prometheus Recording Rules?
- Have engineering and product teams agreed on SLO targets?
- Is the Error Budget policy documented in writing?
- Are Multi-Window Multi-Burn-Rate alerts configured?
- Are page alerts (immediate response) separated from ticket alerts (business-hours response)?
- Can SLI, Error Budget remaining, and Burn Rate be viewed in a Grafana dashboard?
- Are SLO reports automatically generated weekly and monthly?
- Has the feature freeze process for Error Budget exhaustion been approved by executive leadership?
- Are quarterly SLO review meetings scheduled?
- Is the postmortem process linked to Error Budget exhaustion events?
References
- Google SRE Book - Service Level Objectives
- Google SRE Workbook - Implementing SLOs
- Google SRE Workbook - Alerting on SLOs
- Google SRE Workbook - Error Budget Policy
- Google Cloud Blog - SRE Fundamentals: SLIs vs SLOs vs SLAs
- Sloth - Prometheus SLO Generator
- SLODLC Handbook - SLO Development Life Cycle
- Grafana Cloud - Introduction to SLO