Skip to content
Published on

SLI/SLO/Error Budget-Based Reliability Engineering: A Practical Guide

Authors
  • Name
    Twitter
SLI SLO Error Budget

Introduction

You have set up Prometheus for metric collection and configured Alertmanager for notifications. Yet every time an incident occurs, the question remains: "How much failure is acceptable?" You configured an alert at 80% CPU utilization, but it fires repeatedly on events that have no real impact on user experience. The team suffers from Alert Fatigue.

Google's SRE team established the SLI (Service Level Indicator), SLO (Service Level Objective), and Error Budget framework to solve this problem at its root. The core idea is straightforward: "Measure service quality from the user's perspective, quantify the acceptable failure range, and use data to balance reliability against development velocity."

This guide walks through the entire reliability management pipeline: SLI/SLO/SLA concepts, SLI selection strategy, SLO target setting, Error Budget policies, Burn Rate alerts, Prometheus-based implementation, Grafana dashboards, and organizational adoption strategy -- all with production-ready code examples.


SLI/SLO/SLA Concepts

SLI (Service Level Indicator)

An SLI is a quantitative measure of service quality. The Google SRE book defines it as the ratio of good events to total events. For example, "the proportion of HTTP requests that responded within 200ms out of all requests" is an SLI. An SLI always falls between 0% and 100%.

SLO (Service Level Objective)

An SLO is a target value for an SLI. "The response time SLI must be at least 99.9% over a 30-day window" is an example of an SLO. SLOs are set by internal engineering teams and represent the balance point between user expectations and engineering cost. A 100% SLO is neither realistic nor desirable.

SLA (Service Level Agreement)

An SLA attaches legal or contractual consequences to an SLO. For instance, "credit refund if availability falls below 99.95%." SLAs include financial compensation or contract termination clauses. An SLA is a business contract; an SLO is an engineering target. SLOs should always be stricter than SLAs. If the SLA is 99.9%, set the SLO to at least 99.95% to maintain a higher internal standard.

Error Budget

Error Budget derives from the SLO. If the SLO is 99.9%, the Error Budget is 0.1%. Over a 30-day window, that translates to roughly 43 minutes of allowable downtime. Error Budget quantifies "how much failure is acceptable" and enables data-driven management of the trade-off between feature development velocity and reliability.

ConceptDefinitionExampleOwner
SLIQuantitative measure of service levelSuccess request ratio, latency distributionEngineering
SLOTarget value for an SLI30-day availability at least 99.9%Engineering + Product
SLASLO + contractual consequencesCredit refund if below 99.95%Business + Legal
Error Budget100% - SLO0.1% = 43 min downtime per monthEngineering

SLI Selection Strategy

SLIs by Service Type

Different service types require different SLIs. The Google SRE Workbook categorizes them as follows.

Service TypePrimary SLIsMeasurement Method
Request-driven (API, Web)Availability, LatencySuccess response ratio, p50/p99 response time
Pipeline (Batch, ETL)Freshness, CorrectnessData processing delay, result accuracy
Storage (DB, Cache)Durability, ThroughputData loss rate, reads/writes per second
Streaming (Kafka, Pub/Sub)Latency, ThroughputEnd-to-end latency, messages per second

SLI Selection Principles

  1. Critical User Journey (CUJ) based: Define SLIs for each key user journey such as login, search, and checkout.
  2. Ease of measurement first: Start with SLIs extractable from existing logs or metrics rather than requiring complex instrumentation.
  3. Synthetic monitoring in parallel: Server-side metrics alone cannot capture CDN cache hit rates or client rendering times. Combine with synthetic monitoring.
  4. Multi-grade SLOs: Set multiple thresholds for a single SLI. Track two objectives simultaneously: "90% of requests under 100ms, 99% of requests under 400ms."

SLO Target Setting

Allowed Downtime by SLO Level

As SLO targets increase, the allowed downtime decreases exponentially.

SLOMonthly Error BudgetAnnual Error Budget
99%7h 18m3d 15h
99.5%3h 39m1d 19h
99.9%43m 50s8h 46m
99.95%21m 55s4h 23m
99.99%4m 23s52m 36s

SLO Setting Process

  1. Measure current performance: Collect at least two weeks of real data to understand your service's actual reliability.
  2. Survey user expectations: Determine what quality level users actually perceive. Internal tools and external services have different expectations.
  3. Cost-benefit analysis: Evaluate whether the cost of going from 99.9% to 99.99% delivers proportional business value.
  4. Stakeholder alignment: Set a number that product managers, engineering leads, and executives all agree on.
  5. Iterative adjustment: Start with conservative SLOs and review quarterly.

Error Budget Policy

An Error Budget policy documents the specific actions to take when the Error Budget is consumed. The Google SRE Workbook recommends a written policy.

Error Budget Policy Example

# error-budget-policy.yaml
error_budget_policy:
  service: payment-api
  slo_window: 30d
  slo_target: 99.9%

  thresholds:
    - level: green
      budget_remaining: '>50%'
      actions:
        - Normal feature development proceeds
        - Weekly SLO review meeting

    - level: yellow
      budget_remaining: '25%-50%'
      actions:
        - Reliability review required before any feature release
        - Technical debt reduction work runs in parallel
        - Daily SLO review meeting

    - level: orange
      budget_remaining: '10%-25%'
      actions:
        - Halt new feature releases
        - Dedicate 50%+ engineering time to reliability work
        - Execute postmortem analysis immediately

    - level: red
      budget_remaining: '<10%'
      actions:
        - Freeze all feature development
        - Focus all engineering resources on reliability recovery
        - Escalate to executive leadership
        - Maintain freeze until root cause of SLO miss is resolved

Error Budget Exhaustion Response

How an organization responds to Error Budget exhaustion is a litmus test of its reliability culture. Key principles include:

  • Feature Freeze: When the Error Budget is exhausted, reliability recovery takes top priority.
  • Mandatory Postmortem: Conduct a blameless postmortem for every Error Budget exhaustion event.
  • Automation investment: If manual work (Toil) is the root cause, prioritize automation.
  • SLO Review: If Error Budget is repeatedly exhausted, evaluate whether the SLO itself is too aggressive.

Burn Rate Alerts

Burn Rate Concept

Burn Rate is a multiplier indicating how quickly you are consuming your Error Budget. A Burn Rate of 1 means you will exactly exhaust your Error Budget over the SLO window (e.g., 30 days). A Burn Rate of 10 means you will consume the 30-day budget in just 3 days.

Alerting Approach Comparison

Alerting ApproachProsCons
Threshold-based (error rate exceeds 1%)Simple to configureMany alerts unrelated to user impact
Single-window Burn RateDirectly tied to SLOOverreacts to short spikes
Multi-Window Multi-Burn-RatePrecise detectionComplex configuration

Multi-Window Multi-Burn-Rate Alerts

This is the approach recommended by the Google SRE Workbook. It checks two windows simultaneously (long window + short window). The long window reflects actual trends; the short window prevents alerting on already-recovered issues.

# multi-window-burn-rate-alerts.yaml
# Google SRE Workbook recommended alert configuration
groups:
  - name: slo-burn-rate-alerts
    rules:
      # Page alert: 2% budget consumed in 1 hour (Burn Rate 14.4)
      - alert: SLOBurnRateHigh_Page
        expr: |
          (
            sli:error_ratio:rate1h > (14.4 * 0.001)
            and
            sli:error_ratio:rate5m > (14.4 * 0.001)
          )
        for: 2m
        labels:
          severity: page
        annotations:
          summary: 'Error budget burn rate is very high'
          description: 'Service is consuming error budget 14.4x faster than normal. 2% of 30-day budget will be consumed in 1 hour.'

      # Page alert: 5% budget consumed in 6 hours (Burn Rate 6)
      - alert: SLOBurnRateMedium_Page
        expr: |
          (
            sli:error_ratio:rate6h > (6 * 0.001)
            and
            sli:error_ratio:rate30m > (6 * 0.001)
          )
        for: 5m
        labels:
          severity: page
        annotations:
          summary: 'Error budget burn rate is elevated'
          description: 'Service is consuming error budget 6x faster than normal. 5% of 30-day budget will be consumed in 6 hours.'

      # Ticket alert: 10% budget consumed in 3 days (Burn Rate 1)
      - alert: SLOBurnRateSlow_Ticket
        expr: |
          (
            sli:error_ratio:rate3d > (1 * 0.001)
            and
            sli:error_ratio:rate6h > (1 * 0.001)
          )
        for: 30m
        labels:
          severity: ticket
        annotations:
          summary: 'Error budget burn rate indicates slow degradation'
          description: 'Service is steadily consuming error budget. 10% of 30-day budget will be consumed in 3 days.'

Prometheus Implementation

SLI Recording Rules

Recording Rules pre-compute error ratios across multiple time windows so that alert rules can reuse them efficiently.

# sli-recording-rules.yaml
groups:
  - name: sli-recording-rules
    interval: 30s
    rules:
      # Error ratio across multiple windows
      - record: sli:error_ratio:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

      - record: sli:error_ratio:rate30m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[30m]))
          /
          sum(rate(http_requests_total[30m]))

      - record: sli:error_ratio:rate1h
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[1h]))
          /
          sum(rate(http_requests_total[1h]))

      - record: sli:error_ratio:rate6h
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[6h]))
          /
          sum(rate(http_requests_total[6h]))

      - record: sli:error_ratio:rate3d
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[3d]))
          /
          sum(rate(http_requests_total[3d]))

      # Latency SLI: ratio of responses within 200ms
      - record: sli:latency_good_ratio:rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
          /
          sum(rate(http_request_duration_seconds_count[5m]))

      - record: sli:latency_good_ratio:rate1h
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.2"}[1h]))
          /
          sum(rate(http_request_duration_seconds_count[1h]))

Error Budget Remaining Calculation

# Error Budget remaining ratio (30-day window, SLO 99.9%)
# Result of 1 means 100% remaining, 0 means fully exhausted
(
  1 - (
    sum(increase(http_requests_total{status=~"5.."}[30d]))
    /
    sum(increase(http_requests_total[30d]))
  )
  - 0.999
) / 0.001

This query computes the actual error ratio over the past 30 days, subtracts the SLO target (0.999), and divides by the total Error Budget (0.001). A result of 0.5 means 50% of the Error Budget remains.

SLO Reporting Script

#!/usr/bin/env python3
"""SLO Report Generation Script"""

import requests
from datetime import datetime, timedelta

PROMETHEUS_URL = "http://prometheus:9090"
SLO_CONFIGS = [
    {
        "name": "payment-api-availability",
        "sli_query": 'sum(rate(http_requests_total{job="payment-api",status!~"5.."}[30d])) / sum(rate(http_requests_total{job="payment-api"}[30d]))',
        "slo_target": 0.999,
        "window_days": 30,
    },
    {
        "name": "payment-api-latency",
        "sli_query": 'sum(rate(http_request_duration_seconds_bucket{job="payment-api",le="0.2"}[30d])) / sum(rate(http_request_duration_seconds_count{job="payment-api"}[30d]))',
        "slo_target": 0.995,
        "window_days": 30,
    },
]


def query_prometheus(query: str) -> float:
    """Execute Prometheus instant query"""
    resp = requests.get(
        f"{PROMETHEUS_URL}/api/v1/query",
        params={"query": query},
    )
    result = resp.json()["data"]["result"]
    if not result:
        return 0.0
    return float(result[0]["value"][1])


def generate_report():
    """Generate SLO report"""
    print(f"=== SLO Report ({datetime.now().strftime('%Y-%m-%d %H:%M')}) ===\n")

    for config in SLO_CONFIGS:
        current_sli = query_prometheus(config["sli_query"])
        slo_target = config["slo_target"]
        error_budget_total = 1 - slo_target
        error_budget_consumed = max(0, slo_target - current_sli)
        error_budget_remaining = max(
            0, (error_budget_total - error_budget_consumed) / error_budget_total
        )

        status = "OK" if current_sli >= slo_target else "VIOLATED"
        budget_pct = error_budget_remaining * 100

        print(f"Service: {config['name']}")
        print(f"  SLI (current): {current_sli:.4%}")
        print(f"  SLO (target):  {slo_target:.4%}")
        print(f"  Status:        {status}")
        print(f"  Error Budget:  {budget_pct:.1f}% remaining")

        if budget_pct < 25:
            print(f"  WARNING: Error budget below 25%!")
        print()


if __name__ == "__main__":
    generate_report()

Grafana Dashboard

SLO Dashboard Configuration

A Grafana dashboard should provide an at-a-glance view of SLO status. The key panels are:

  1. Current SLI Value (Stat Panel): Display the current availability SLI as a large number. Use threshold colors to convey status intuitively.
  2. Error Budget Remaining (Gauge Panel): Show the remaining Error Budget as a gauge. Turn yellow below 50%, red below 25%.
  3. Burn Rate Trend (Time Series Panel): Graph the Burn Rate over time with alert thresholds displayed as horizontal lines.
  4. Error Budget Exhaustion Forecast (Time Series Panel): Show the projected date when the Error Budget will be fully exhausted at the current Burn Rate.
# grafana-slo-dashboard.json (panel configuration summary)
panels:
  - title: 'Current Availability SLI'
    type: stat
    targets:
      - expr: '1 - sli:error_ratio:rate30d'
        legendFormat: 'Availability'
    thresholds:
      - value: 0.999
        color: green
      - value: 0.995
        color: yellow
      - value: 0.99
        color: red

  - title: 'Error Budget Remaining'
    type: gauge
    targets:
      - expr: |
          clamp_min(
            (0.001 - sli:error_ratio:rate30d) / 0.001,
            0
          )
        legendFormat: 'Budget Remaining'
    min: 0
    max: 1
    thresholds:
      - value: 0.5
        color: green
      - value: 0.25
        color: yellow
      - value: 0
        color: red

  - title: 'Burn Rate (1h window)'
    type: timeseries
    targets:
      - expr: 'sli:error_ratio:rate1h / 0.001'
        legendFormat: 'Burn Rate (1h)'
    overrides:
      - matcher: 'Burn Rate Threshold'
        properties:
          - id: custom.lineStyle
            value: dash

Organizational Adoption Strategy

Phased Rollout Roadmap

SLI/SLO adoption cannot succeed through technical implementation alone. It requires organization-wide alignment and cultural change.

Phase 1 - Pilot (1-2 months)

  • Select 1-2 of the most critical services.
  • Start with simple SLIs extractable from existing metrics.
  • Set SLOs slightly below current performance so they are achievable.

Phase 2 - Expansion (3-6 months)

  • Extend to major services based on pilot results.
  • Establish Error Budget policies and secure stakeholder alignment.
  • Deploy Burn Rate alerts to production.

Phase 3 - Maturity (6-12 months)

  • Apply SLOs to all user-facing services.
  • Integrate SLOs into incident review, capacity planning, and release management.
  • Institutionalize quarterly SLO review processes.

Key Success Factors

  • Executive sponsorship: SLO-driven decisions can include feature freezes, so executive support is essential.
  • Product manager involvement: PMs must participate in SLO target decisions. If SLOs become an engineering-only metric, adoption will fail.
  • Incremental approach: Do not aim for perfect SLOs from the start. "Wrong but useful" beats "perfect but never shipped."
  • Automation investment: Automate SLO report generation, Error Budget tracking, and Burn Rate alerting to minimize operational burden.

Failure Cases and Lessons Learned

Case 1: Overly Aggressive SLO

A team set a 99.99% SLO for an internal service that had no external SLA. The monthly Error Budget was only 4 minutes and 23 seconds, making normal deployments impossible. Every deployment caused a few seconds of downtime, immediately exhausting the Error Budget and effectively halting feature development. The solution was to lower the SLO to 99.9% and adopt zero-downtime deployment strategies (Blue-Green, Canary).

Case 2: Alert Fatigue from Poor Burn Rate Configuration

A team adopted Burn Rate alerts but only used a short window (5 minutes), causing alerts on every brief error spike. With dozens of alerts per day, on-call engineers began ignoring them. The fix was switching to the Multi-Window approach and separating page alerts from ticket alerts.

Case 3: No Policy for Error Budget Exhaustion

The Error Budget was exhausted, but there was no "feature freeze" policy, so the development team continued releasing new features. This led to repeated incidents and user churn. The problem was resolved only after the team drafted a written Error Budget policy and obtained executive approval to enforce it.


Operational Checklist

A checklist for deploying SLI/SLO/Error Budget in production:

  • Are SLIs defined for each Critical User Journey (CUJ)?
  • Are SLIs automatically computed via Prometheus Recording Rules?
  • Have engineering and product teams agreed on SLO targets?
  • Is the Error Budget policy documented in writing?
  • Are Multi-Window Multi-Burn-Rate alerts configured?
  • Are page alerts (immediate response) separated from ticket alerts (business-hours response)?
  • Can SLI, Error Budget remaining, and Burn Rate be viewed in a Grafana dashboard?
  • Are SLO reports automatically generated weekly and monthly?
  • Has the feature freeze process for Error Budget exhaustion been approved by executive leadership?
  • Are quarterly SLO review meetings scheduled?
  • Is the postmortem process linked to Error Budget exhaustion events?

References