- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Why SLOs End Up as Dashboard Numbers
- Step 1: Define SLIs -- What to Measure
- Step 2: Set SLO Targets -- How Good Is Good Enough
- Step 3: Burn Rate Alert Design -- When to Respond
- Step 4: Error Budget Policy -- What to Do When Exhausted
- Step 5: Release Gate Integration -- How SLOs Block Deployments
- Step 6: Include SLO Impact in Postmortems
- Troubleshooting
- Quiz
- References

Why SLOs End Up as Dashboard Numbers
In most organizations, SLO (Service Level Objective) adoption fails in the following pattern: they set a number like "99.9% availability" and put it on a Grafana dashboard, but that number has no influence on release decisions, on-call priorities, or technical debt remediation schedules. SLOs are not measurement tools -- they are decision-making frameworks. Without organizational agreement that feature development stops and reliability work takes priority when the error budget is exhausted, SLOs are just nice-looking numbers.
This manual covers the execution procedures for connecting SLO numbers to actual organizational actions. It concretizes the Google SRE Workbook's error budget policy (sre.google/workbook/error-budget-policy) to a level that can be applied in practice.
Step 1: Define SLIs -- What to Measure
SLI (Service Level Indicator) is the raw metric used to calculate SLOs. You need to clearly define "what constitutes a good request."
SLI Types and Calculation Formulas
| SLI Type | Formula | Suitable Services |
|---|---|---|
| Availability | good_requests / total_requests | API servers, web services |
| Latency | requests_below_threshold / total_requests | User-facing services |
| Throughput | processed_jobs / submitted_jobs | Batch processing, pipelines |
| Correctness | correct_responses / total_responses | ML model serving, search |
| Freshness | fresh_data_reads / total_reads | Cache, data synchronization |
Prometheus-based SLI Collection Setup
# prometheus_rules/sli_recording_rules.yaml
groups:
- name: sli_availability
interval: 30s
rules:
# API server availability SLI
# "Good request" = all responses except HTTP 5xx
- record: sli:api_availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
- name: sli_latency
interval: 30s
rules:
# Latency SLI
# "Good request" = requests completed within 300ms
- record: sli:api_latency:ratio_rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (service)
/
sum(rate(http_request_duration_seconds_count[5m])) by (service)
Common Mistakes When Defining SLIs
# Bad SLI definition examples
# 1. Using server-side health checks as SLI (unrelated to user experience)
bad_sli_1 = "health_check_success_rate" # Server is alive but responses might be slow
# 2. Success rate including internal retries (differs from actual user perception)
bad_sli_2 = "requests_eventually_succeeded / requests_total" # Includes retries
# 3. Average latency (p50 might be fine but p99 could be 10 seconds)
bad_sli_3 = "avg(request_duration)" # Average hides tail latency
# Good SLI definitions
good_sli = {
"availability": "Ratio of first responses received by users that are non-5xx",
"latency": "Ratio of responses perceived by users within 300ms",
"correctness": "Ratio where 3+ of top 5 search results are relevant",
}
Step 2: Set SLO Targets -- How Good Is Good Enough
You should never set an SLO to 100%. 100% means "there must never be any failure," which equates to "never deploy any new features."
SLO Target Calculation Process
def calculate_error_budget(slo_target: float, window_days: int = 30) -> dict:
"""Calculate error budget from SLO target"""
total_minutes = window_days * 24 * 60
error_budget_fraction = 1.0 - slo_target
allowed_bad_minutes = total_minutes * error_budget_fraction
# Request-based calculation (assuming 1 million requests per day)
daily_requests = 1_000_000
total_requests = daily_requests * window_days
allowed_bad_requests = int(total_requests * error_budget_fraction)
return {
"slo_target": f"{slo_target * 100:.2f}%",
"window_days": window_days,
"error_budget_fraction": f"{error_budget_fraction * 100:.3f}%",
"allowed_bad_minutes": round(allowed_bad_minutes, 1),
"allowed_bad_minutes_per_day": round(allowed_bad_minutes / window_days, 2),
"allowed_bad_requests_30d": allowed_bad_requests,
}
# Error budget comparison by SLO
for target in [0.999, 0.995, 0.99, 0.9]:
budget = calculate_error_budget(target)
print(f"SLO {budget['slo_target']:>7s}: "
f"Monthly {budget['allowed_bad_minutes']:>7.1f}min = "
f"Daily {budget['allowed_bad_minutes_per_day']:>5.2f}min, "
f"Monthly {budget['allowed_bad_requests_30d']:>8,} bad requests allowed")
Output:
SLO 99.90%: Monthly 43.2min = Daily 1.44min, Monthly 30,000 bad requests allowed
SLO 99.50%: Monthly 216.0min = Daily 7.20min, Monthly 150,000 bad requests allowed
SLO 99.00%: Monthly 432.0min = Daily 14.40min, Monthly 300,000 bad requests allowed
SLO 90.00%: Monthly 4,320.0min = Daily144.00min, Monthly 3,000,000 bad requests allowed
SLO Guidelines by Service Tier
| Service Tier | Availability SLO | Latency SLO (P95) | Rationale |
|---|---|---|---|
| Tier 1 (Payment, Auth) | 99.95% | 200ms | Directly impacts revenue, immediate business impact on failure |
| Tier 2 (Search, Recommendations) | 99.9% | 500ms | Core user experience, alternative paths exist |
| Tier 3 (Notifications, Logs) | 99.5% | 2s | Delay tolerable, async processing possible |
| Tier 4 (Internal Tools) | 99.0% | 5s | Internal users, maintenance possible outside business hours |
Step 3: Burn Rate Alert Design -- When to Respond
We implement the Multi-Window, Multi-Burn-Rate alerting recommended by the Google SRE Workbook. Core concept: burn rate represents "if errors continue at the current rate, how many times faster will the error budget be exhausted within the compliance window."
Burn Rate Concept
def explain_burn_rate():
"""Burn rate concept explanation"""
# burn rate = 1: exhausts error budget over exactly 30 days
# burn rate = 14: exhausts error budget in about 2.1 days
# burn rate = 2: exhausts error budget in about 15 days
examples = {
"burn_rate_14": {
"meaning": "Exhausting 30-day budget in 2.1 days",
"use_case": "Acute incident. Detection needed within 5 minutes",
"long_window": "1h",
"short_window": "5m",
},
"burn_rate_6": {
"meaning": "Exhausting 30-day budget in 5 days",
"use_case": "Significant performance degradation. Detection within 30 minutes",
"long_window": "6h",
"short_window": "30m",
},
"burn_rate_2": {
"meaning": "Exhausting 30-day budget in 15 days",
"use_case": "Gradual quality degradation. Detection within hours",
"long_window": "3d",
"short_window": "6h",
},
"burn_rate_1": {
"meaning": "Exhausting 30-day budget in exactly 30 days",
"use_case": "Review in weekly meeting. No immediate alert needed",
"long_window": "N/A",
"short_window": "N/A",
},
}
return examples
Prometheus Alert Rules
# prometheus_rules/slo_alerts.yaml
groups:
- name: slo_burn_rate_alerts
rules:
# === Tier 1: Acute Incident Detection (Burn Rate 14, 1h/5m window) ===
- alert: SLOBurnRateCritical
# long window: burn rate 14 or higher over 1h
# short window: confirm still high over 5m (false positive prevention)
expr: |
(
1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) by (service)
/ sum(rate(http_requests_total[1h])) by (service))
) > (14 * (1 - 0.999))
and
(
1 - (sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service))
) > (14 * (1 - 0.999))
for: 1m
labels:
severity: critical
slo_window: '1h/5m'
burn_rate: '14'
annotations:
summary: '{{ $labels.service }}: SLO burn rate critical (14x)'
description: |
Service {{ $labels.service }} is consuming error budget at 14x
the SLO rate. The entire budget will be exhausted in approximately 2.1 days.
Investigate immediately.
runbook: 'https://wiki.internal/runbook/slo-critical'
# === Tier 2: Significant Performance Degradation (Burn Rate 6, 6h/30m window) ===
- alert: SLOBurnRateHigh
expr: |
(
1 - (sum(rate(http_requests_total{status!~"5.."}[6h])) by (service)
/ sum(rate(http_requests_total[6h])) by (service))
) > (6 * (1 - 0.999))
and
(
1 - (sum(rate(http_requests_total{status!~"5.."}[30m])) by (service)
/ sum(rate(http_requests_total[30m])) by (service))
) > (6 * (1 - 0.999))
for: 5m
labels:
severity: warning
slo_window: '6h/30m'
burn_rate: '6'
annotations:
summary: '{{ $labels.service }}: SLO burn rate high (6x)'
description: |
Service {{ $labels.service }} is consuming error budget at 6x
the SLO rate. The entire budget will be exhausted in approximately 5 days.
# === Tier 3: Gradual Quality Degradation (Burn Rate 2, 3d/6h window) ===
- alert: SLOBurnRateLow
expr: |
(
1 - (sum(rate(http_requests_total{status!~"5.."}[3d])) by (service)
/ sum(rate(http_requests_total[3d])) by (service))
) > (2 * (1 - 0.999))
and
(
1 - (sum(rate(http_requests_total{status!~"5.."}[6h])) by (service)
/ sum(rate(http_requests_total[6h])) by (service))
) > (2 * (1 - 0.999))
for: 30m
labels:
severity: info
slo_window: '3d/6h'
burn_rate: '2'
annotations:
summary: '{{ $labels.service }}: SLO burn rate elevated (2x)'
Why Multi-Window
The problem with single windows:
[Single Window: 1h]
10:00 - 10:05 Incident occurs, error rate 50%
10:05 - 10:10 Incident resolved, error rate 0%
...
10:55 - 11:00 Error rate 0%
-> Average error rate over 1h window: approximately 4.2%
-> Exceeds burn rate 14 threshold (0.014) -> Alert fires
But the incident ended 55 minutes ago!
=> False positive. On-call engineer gets paged for an already resolved issue.
[Multi-Window: 1h + 5m]
Long window (1h): Error rate 4.2% -> Exceeds threshold (O)
Short window (5m): Error rate 0% -> Below threshold (X)
=> Both conditions are not met, so no alert fires. Correct decision.
Step 4: Error Budget Policy -- What to Do When Exhausted
Error budget policy is the core of SLOs. Setting SLO numbers without this is meaningless.
Error Budget Policy Template
# error_budget_policy.yaml
# This document is agreed upon and signed by engineering leadership, product team, and SRE team.
policy:
version: '2.0'
effective_date: '2026-01-15'
review_cycle: 'quarterly'
budget_thresholds:
# Budget >= 50%: Normal operations
green:
remaining_budget: '>= 50%'
actions:
- 'Feature development to reliability work ratio: 8:2'
- 'Maintain standard release process'
- 'Monitor trends in weekly SLO review'
# Budget 20-50%: Caution
yellow:
remaining_budget: '20% ~ 50%'
actions:
- 'Feature development to reliability work ratio: 5:5'
- 'Additional load testing required before releases'
- 'Add canary stage to all deployments (1% -> 10% -> 50% -> 100%)'
- 'SLO review twice per week'
# Budget < 20%: Danger
red:
remaining_budget: '< 20%'
actions:
- 'Freeze new feature releases'
- 'All personnel focused on reliability improvement'
- 'VP approval required for all changes'
- 'Daily SLO review'
- 'Write postmortems and track action items'
# Budget exhausted: Emergency
exhausted:
remaining_budget: '<= 0%'
actions:
- 'Immediately halt all non-essential deployments'
- 'Review rollback candidates from changes in the last 30 days'
- 'Escalate to CTO/VP'
- 'Daily status reports'
- 'Maintain feature freeze until window reset'
exceptions:
- 'Security patches are deployed immediately regardless of budget status'
- 'Legal compliance requirements are exempt'
- 'Emergency fixes to prevent data loss are exempt'
escalation:
- level: 'L1 (On-call Engineer)'
condition: 'Burn rate alert fires'
response_time: '15 minutes'
- level: 'L2 (Team Lead)'
condition: 'Budget < 50%'
response_time: '1 hour'
- level: 'L3 (VP Engineering)'
condition: 'Budget < 20% or exhausted'
response_time: 'Same day'
Error Budget Remaining Calculation and Reporting
import datetime
import requests
from dataclasses import dataclass
@dataclass
class ErrorBudgetReport:
service: str
slo_target: float
window_days: int
current_sli: float
budget_remaining_pct: float
budget_remaining_minutes: float
estimated_exhaustion_date: str
policy_status: str # green, yellow, red, exhausted
def calculate_error_budget_status(
prometheus_url: str,
service: str,
slo_target: float = 0.999,
window_days: int = 30,
) -> ErrorBudgetReport:
"""Query current SLI from Prometheus and calculate error budget status"""
# Query current SLI (30-day window)
query = f'''
sum(rate(http_requests_total{{service="{service}",status!~"5.."}}[{window_days}d]))
/
sum(rate(http_requests_total{{service="{service}"}}[{window_days}d]))
'''
resp = requests.get(
f"{prometheus_url}/api/v1/query",
params={"query": query},
)
result = resp.json()["data"]["result"]
current_sli = float(result[0]["value"][1]) if result else 0.0
# Calculate error budget
total_budget = 1.0 - slo_target # e.g., 0.001
consumed = max(0.0, (1.0 - current_sli) - 0) # actual error rate
remaining = max(0.0, total_budget - consumed)
remaining_pct = (remaining / total_budget) * 100 if total_budget > 0 else 0
# Convert to time
total_minutes = window_days * 24 * 60
remaining_minutes = total_minutes * (remaining / total_budget) if total_budget > 0 else 0
# Calculate estimated exhaustion date
if consumed > 0 and remaining > 0:
burn_rate = consumed / total_budget
days_to_exhaustion = window_days * (remaining / total_budget) / burn_rate
exhaustion_date = (
datetime.date.today() + datetime.timedelta(days=days_to_exhaustion)
).isoformat()
elif remaining <= 0:
exhaustion_date = "EXHAUSTED"
else:
exhaustion_date = "N/A (no errors)"
# Determine policy status
if remaining_pct >= 50:
status = "green"
elif remaining_pct >= 20:
status = "yellow"
elif remaining_pct > 0:
status = "red"
else:
status = "exhausted"
return ErrorBudgetReport(
service=service,
slo_target=slo_target,
window_days=window_days,
current_sli=round(current_sli, 6),
budget_remaining_pct=round(remaining_pct, 2),
budget_remaining_minutes=round(remaining_minutes, 1),
estimated_exhaustion_date=exhaustion_date,
policy_status=status,
)
Automated Slack Weekly Report
import json
from slack_sdk import WebClient
def send_weekly_slo_report(
slack_token: str,
channel: str,
services: list[str],
prometheus_url: str,
):
"""Send weekly SLO report to a Slack channel"""
client = WebClient(token=slack_token)
reports = []
for service in services:
report = calculate_error_budget_status(prometheus_url, service)
reports.append(report)
# Status emoji mapping (for Slack)
status_emoji = {
"green": ":large_green_circle:",
"yellow": ":large_yellow_circle:",
"red": ":red_circle:",
"exhausted": ":rotating_light:",
}
blocks = [
{"type": "header", "text": {"type": "plain_text", "text": "Weekly SLO Report"}},
{"type": "divider"},
]
for r in sorted(reports, key=lambda x: x.budget_remaining_pct):
emoji = status_emoji[r.policy_status]
blocks.append({
"type": "section",
"text": {
"type": "mrkdwn",
"text": (
f"{emoji} *{r.service}*\n"
f" SLO: {r.slo_target*100:.2f}% | "
f"Current SLI: {r.current_sli*100:.4f}%\n"
f" Budget remaining: {r.budget_remaining_pct:.1f}% "
f"({r.budget_remaining_minutes:.0f} min)\n"
f" Estimated exhaustion: {r.estimated_exhaustion_date}"
),
},
})
client.chat_postMessage(channel=channel, blocks=blocks)
Step 5: Release Gate Integration -- How SLOs Block Deployments
Automatically block deployments in the CI/CD pipeline based on error budget status.
GitHub Actions Release Gate
# .github/workflows/release-gate.yaml
name: SLO Release Gate
on:
workflow_call:
inputs:
service:
required: true
type: string
jobs:
check-error-budget:
runs-on: ubuntu-latest
steps:
- name: Query error budget status
id: budget
run: |
RESULT=$(curl -s "${{ secrets.PROMETHEUS_URL }}/api/v1/query" \
--data-urlencode "query=slo:error_budget_remaining_pct{service=\"${{ inputs.service }}\"}" \
| jq -r '.data.result[0].value[1]')
echo "remaining_pct=$RESULT" >> $GITHUB_OUTPUT
- name: Evaluate release gate
run: |
BUDGET="${{ steps.budget.outputs.remaining_pct }}"
echo "Error budget remaining: ${BUDGET}%"
if (( $(echo "$BUDGET < 20" | bc -l) )); then
echo "::error::ERROR BUDGET CRITICAL (${BUDGET}%). Release blocked."
echo "Error budget is below 20%. Complete reliability work first."
exit 1
elif (( $(echo "$BUDGET < 50" | bc -l) )); then
echo "::warning::Error budget at ${BUDGET}%. Canary deployment required."
echo "canary_required=true" >> $GITHUB_OUTPUT
else
echo "Error budget healthy at ${BUDGET}%. Proceeding."
fi
- name: Notify Slack on block
if: failure()
run: |
curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
-H 'Content-Type: application/json' \
-d "{\"text\":\"Release blocked for ${{ inputs.service }}: error budget < 20%\"}"
Step 6: Include SLO Impact in Postmortems
Incident postmortems must include "how much did this impact the SLO." This is the clearest way to quantify the business impact of an incident.
Postmortem SLO Impact Section Template
## SLO Impact Analysis
### Affected SLO
- Service: payment-api
- SLO Target: 99.95% availability (30-day window)
- Pre-incident SLI: 99.97%
- Post-incident SLI: 99.93%
### Error Budget Consumption
- Incident duration: 23 minutes
- Error requests during incident: 12,847
- Budget consumed relative to total window: 28.5%
- Pre-incident budget remaining: 71.2%
- Post-incident budget remaining: 42.7%
### Policy Status Change
- Before incident: GREEN (71.2%)
- After incident: YELLOW (42.7%)
- Action: Add canary stage to releases for the next 2 weeks
### Business Impact
- Failed payment attempts: approximately 3,200
- Estimated revenue loss: approximately $48,000
- Increase in customer inquiries: 127
Troubleshooting
1. SLI Value Stuck at 100%
Cause: Missing metric collection. Error responses are not being recorded as separate metrics.
# How to check
curl -s http://prometheus:9090/api/v1/query \
--data-urlencode 'query=http_requests_total{status=~"5.."}' | jq '.data.result | length'
# If 0, 5xx metrics are not being collected
# Fix: Check the status code label in the application's metrics middleware
2. Alerts Firing Too Frequently (10+ times per day)
Diagnostic sequence:
- Check if the burn rate threshold is too low (setting alerts at burn rate 1 will fire even in normal conditions)
- Check if the short window is too long (30m is appropriate; 5m generates too much noise)
- Check if the
forduration is too short (minimum 1m for critical, 5m for warning) - Review whether the SLO target itself is unrealistic compared to current service levels
3. Error Budget Resets on New Window but Same Problem Repeats
Root cause: Postmortem action items were not completed before the window reset.
Fix: Add a condition to the error budget policy: "If the previous window was exhausted, action item completion rate must be 80% or higher to return to GREEN in the next window."
4. SLI Definition Mismatch Between Teams
Case: The backend team uses "server response time" while the frontend team uses "user-perceived loading time" as their SLI. Same SLO of 99.9% but measuring different things.
Fix: Manage SLI definition documents in an organization-wide shared wiki and conduct joint reviews with product/platform/SRE teams quarterly.
Quiz
Q1. What is the 30-day error budget for SLO 99.9% in minutes?
Answer: 43.2 minutes. 30 days x 24 hours x 60 minutes = 43,200 minutes. 43,200 x 0.001 = 43.2
minutes.
Q2. What situation does a burn rate of 14 indicate?
Answer: It means errors are occurring at a rate that would exhaust the 30-day error budget in
approximately 2.1 days (30/14). This is an acute incident situation requiring immediate response.
Q3. What is the role of the short window in Multi-Window alerting?
Answer: It confirms whether the anomaly detected in the long window is still ongoing. It prevents
false positives from past incidents that have already ended. The Google SRE Workbook recommends
setting the short window to 1/12 of the long window.
Q4. Should security patches also be halted when the error budget is exhausted?
Answer: No. Security patches, legal compliance requirements, and emergency fixes to prevent data
loss should be deployed regardless of error budget status. These exceptions must be explicitly
stated in the error budget policy.
Q5. Why should you never set an SLO to 100%?
Answer: An SLO of 100% means an error budget of 0, so no changes can be deployed as long as there
is any possibility of errors. This is effectively declaring "we will never deploy new features,"
completely blocking innovation. Additionally, 100% is realistically unachievable due to inevitable
infrastructure failures (hardware failures, network issues).
Q6. Why should you not use "average response time" as an SLI?
Answer: Averages hide tail latency. A service with P50 of 100ms and P99 of 10 seconds may show an
average of about 200ms, but 1 in 100 users waits 10 seconds. For SLIs, you should use "ratio of
responses within a threshold" (e.g., ratio of responses within 300ms).