Skip to content
Published on

SLO and Error Budget Execution Manual

Authors
SLO and Error Budget Execution Manual

Why SLOs End Up as Dashboard Numbers

In most organizations, SLO (Service Level Objective) adoption fails in the following pattern: they set a number like "99.9% availability" and put it on a Grafana dashboard, but that number has no influence on release decisions, on-call priorities, or technical debt remediation schedules. SLOs are not measurement tools -- they are decision-making frameworks. Without organizational agreement that feature development stops and reliability work takes priority when the error budget is exhausted, SLOs are just nice-looking numbers.

This manual covers the execution procedures for connecting SLO numbers to actual organizational actions. It concretizes the Google SRE Workbook's error budget policy (sre.google/workbook/error-budget-policy) to a level that can be applied in practice.

Step 1: Define SLIs -- What to Measure

SLI (Service Level Indicator) is the raw metric used to calculate SLOs. You need to clearly define "what constitutes a good request."

SLI Types and Calculation Formulas

SLI TypeFormulaSuitable Services
Availabilitygood_requests / total_requestsAPI servers, web services
Latencyrequests_below_threshold / total_requestsUser-facing services
Throughputprocessed_jobs / submitted_jobsBatch processing, pipelines
Correctnesscorrect_responses / total_responsesML model serving, search
Freshnessfresh_data_reads / total_readsCache, data synchronization

Prometheus-based SLI Collection Setup

# prometheus_rules/sli_recording_rules.yaml
groups:
  - name: sli_availability
    interval: 30s
    rules:
      # API server availability SLI
      # "Good request" = all responses except HTTP 5xx
      - record: sli:api_availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)

  - name: sli_latency
    interval: 30s
    rules:
      # Latency SLI
      # "Good request" = requests completed within 300ms
      - record: sli:api_latency:ratio_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (service)
          /
          sum(rate(http_request_duration_seconds_count[5m])) by (service)

Common Mistakes When Defining SLIs

# Bad SLI definition examples

# 1. Using server-side health checks as SLI (unrelated to user experience)
bad_sli_1 = "health_check_success_rate"  # Server is alive but responses might be slow

# 2. Success rate including internal retries (differs from actual user perception)
bad_sli_2 = "requests_eventually_succeeded / requests_total"  # Includes retries

# 3. Average latency (p50 might be fine but p99 could be 10 seconds)
bad_sli_3 = "avg(request_duration)"  # Average hides tail latency

# Good SLI definitions
good_sli = {
    "availability": "Ratio of first responses received by users that are non-5xx",
    "latency": "Ratio of responses perceived by users within 300ms",
    "correctness": "Ratio where 3+ of top 5 search results are relevant",
}

Step 2: Set SLO Targets -- How Good Is Good Enough

You should never set an SLO to 100%. 100% means "there must never be any failure," which equates to "never deploy any new features."

SLO Target Calculation Process

def calculate_error_budget(slo_target: float, window_days: int = 30) -> dict:
    """Calculate error budget from SLO target"""
    total_minutes = window_days * 24 * 60
    error_budget_fraction = 1.0 - slo_target
    allowed_bad_minutes = total_minutes * error_budget_fraction

    # Request-based calculation (assuming 1 million requests per day)
    daily_requests = 1_000_000
    total_requests = daily_requests * window_days
    allowed_bad_requests = int(total_requests * error_budget_fraction)

    return {
        "slo_target": f"{slo_target * 100:.2f}%",
        "window_days": window_days,
        "error_budget_fraction": f"{error_budget_fraction * 100:.3f}%",
        "allowed_bad_minutes": round(allowed_bad_minutes, 1),
        "allowed_bad_minutes_per_day": round(allowed_bad_minutes / window_days, 2),
        "allowed_bad_requests_30d": allowed_bad_requests,
    }

# Error budget comparison by SLO
for target in [0.999, 0.995, 0.99, 0.9]:
    budget = calculate_error_budget(target)
    print(f"SLO {budget['slo_target']:>7s}: "
          f"Monthly {budget['allowed_bad_minutes']:>7.1f}min = "
          f"Daily {budget['allowed_bad_minutes_per_day']:>5.2f}min, "
          f"Monthly {budget['allowed_bad_requests_30d']:>8,} bad requests allowed")

Output:

SLO  99.90%: Monthly    43.2min = Daily  1.44min, Monthly   30,000 bad requests allowed
SLO  99.50%: Monthly   216.0min = Daily  7.20min, Monthly  150,000 bad requests allowed
SLO  99.00%: Monthly   432.0min = Daily 14.40min, Monthly  300,000 bad requests allowed
SLO  90.00%: Monthly 4,320.0min = Daily144.00min, Monthly 3,000,000 bad requests allowed

SLO Guidelines by Service Tier

Service TierAvailability SLOLatency SLO (P95)Rationale
Tier 1 (Payment, Auth)99.95%200msDirectly impacts revenue, immediate business impact on failure
Tier 2 (Search, Recommendations)99.9%500msCore user experience, alternative paths exist
Tier 3 (Notifications, Logs)99.5%2sDelay tolerable, async processing possible
Tier 4 (Internal Tools)99.0%5sInternal users, maintenance possible outside business hours

Step 3: Burn Rate Alert Design -- When to Respond

We implement the Multi-Window, Multi-Burn-Rate alerting recommended by the Google SRE Workbook. Core concept: burn rate represents "if errors continue at the current rate, how many times faster will the error budget be exhausted within the compliance window."

Burn Rate Concept

def explain_burn_rate():
    """Burn rate concept explanation"""
    # burn rate = 1: exhausts error budget over exactly 30 days
    # burn rate = 14: exhausts error budget in about 2.1 days
    # burn rate = 2: exhausts error budget in about 15 days

    examples = {
        "burn_rate_14": {
            "meaning": "Exhausting 30-day budget in 2.1 days",
            "use_case": "Acute incident. Detection needed within 5 minutes",
            "long_window": "1h",
            "short_window": "5m",
        },
        "burn_rate_6": {
            "meaning": "Exhausting 30-day budget in 5 days",
            "use_case": "Significant performance degradation. Detection within 30 minutes",
            "long_window": "6h",
            "short_window": "30m",
        },
        "burn_rate_2": {
            "meaning": "Exhausting 30-day budget in 15 days",
            "use_case": "Gradual quality degradation. Detection within hours",
            "long_window": "3d",
            "short_window": "6h",
        },
        "burn_rate_1": {
            "meaning": "Exhausting 30-day budget in exactly 30 days",
            "use_case": "Review in weekly meeting. No immediate alert needed",
            "long_window": "N/A",
            "short_window": "N/A",
        },
    }
    return examples

Prometheus Alert Rules

# prometheus_rules/slo_alerts.yaml
groups:
  - name: slo_burn_rate_alerts
    rules:
      # === Tier 1: Acute Incident Detection (Burn Rate 14, 1h/5m window) ===
      - alert: SLOBurnRateCritical
        # long window: burn rate 14 or higher over 1h
        # short window: confirm still high over 5m (false positive prevention)
        expr: |
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) by (service)
                 / sum(rate(http_requests_total[1h])) by (service))
          ) > (14 * (1 - 0.999))
          and
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
                 / sum(rate(http_requests_total[5m])) by (service))
          ) > (14 * (1 - 0.999))
        for: 1m
        labels:
          severity: critical
          slo_window: '1h/5m'
          burn_rate: '14'
        annotations:
          summary: '{{ $labels.service }}: SLO burn rate critical (14x)'
          description: |
            Service {{ $labels.service }} is consuming error budget at 14x
            the SLO rate. The entire budget will be exhausted in approximately 2.1 days.
            Investigate immediately.
          runbook: 'https://wiki.internal/runbook/slo-critical'

      # === Tier 2: Significant Performance Degradation (Burn Rate 6, 6h/30m window) ===
      - alert: SLOBurnRateHigh
        expr: |
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[6h])) by (service)
                 / sum(rate(http_requests_total[6h])) by (service))
          ) > (6 * (1 - 0.999))
          and
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[30m])) by (service)
                 / sum(rate(http_requests_total[30m])) by (service))
          ) > (6 * (1 - 0.999))
        for: 5m
        labels:
          severity: warning
          slo_window: '6h/30m'
          burn_rate: '6'
        annotations:
          summary: '{{ $labels.service }}: SLO burn rate high (6x)'
          description: |
            Service {{ $labels.service }} is consuming error budget at 6x
            the SLO rate. The entire budget will be exhausted in approximately 5 days.

      # === Tier 3: Gradual Quality Degradation (Burn Rate 2, 3d/6h window) ===
      - alert: SLOBurnRateLow
        expr: |
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[3d])) by (service)
                 / sum(rate(http_requests_total[3d])) by (service))
          ) > (2 * (1 - 0.999))
          and
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[6h])) by (service)
                 / sum(rate(http_requests_total[6h])) by (service))
          ) > (2 * (1 - 0.999))
        for: 30m
        labels:
          severity: info
          slo_window: '3d/6h'
          burn_rate: '2'
        annotations:
          summary: '{{ $labels.service }}: SLO burn rate elevated (2x)'

Why Multi-Window

The problem with single windows:

[Single Window: 1h]
10:00 - 10:05  Incident occurs, error rate 50%
10:05 - 10:10  Incident resolved, error rate 0%
...
10:55 - 11:00  Error rate 0%

-> Average error rate over 1h window: approximately 4.2%
-> Exceeds burn rate 14 threshold (0.014) -> Alert fires

But the incident ended 55 minutes ago!
=> False positive. On-call engineer gets paged for an already resolved issue.
[Multi-Window: 1h + 5m]
Long window (1h): Error rate 4.2% -> Exceeds threshold (O)
Short window (5m): Error rate 0% -> Below threshold (X)
=> Both conditions are not met, so no alert fires. Correct decision.

Step 4: Error Budget Policy -- What to Do When Exhausted

Error budget policy is the core of SLOs. Setting SLO numbers without this is meaningless.

Error Budget Policy Template

# error_budget_policy.yaml
# This document is agreed upon and signed by engineering leadership, product team, and SRE team.

policy:
  version: '2.0'
  effective_date: '2026-01-15'
  review_cycle: 'quarterly'

  budget_thresholds:
    # Budget >= 50%: Normal operations
    green:
      remaining_budget: '>= 50%'
      actions:
        - 'Feature development to reliability work ratio: 8:2'
        - 'Maintain standard release process'
        - 'Monitor trends in weekly SLO review'

    # Budget 20-50%: Caution
    yellow:
      remaining_budget: '20% ~ 50%'
      actions:
        - 'Feature development to reliability work ratio: 5:5'
        - 'Additional load testing required before releases'
        - 'Add canary stage to all deployments (1% -> 10% -> 50% -> 100%)'
        - 'SLO review twice per week'

    # Budget < 20%: Danger
    red:
      remaining_budget: '< 20%'
      actions:
        - 'Freeze new feature releases'
        - 'All personnel focused on reliability improvement'
        - 'VP approval required for all changes'
        - 'Daily SLO review'
        - 'Write postmortems and track action items'

    # Budget exhausted: Emergency
    exhausted:
      remaining_budget: '<= 0%'
      actions:
        - 'Immediately halt all non-essential deployments'
        - 'Review rollback candidates from changes in the last 30 days'
        - 'Escalate to CTO/VP'
        - 'Daily status reports'
        - 'Maintain feature freeze until window reset'

  exceptions:
    - 'Security patches are deployed immediately regardless of budget status'
    - 'Legal compliance requirements are exempt'
    - 'Emergency fixes to prevent data loss are exempt'

  escalation:
    - level: 'L1 (On-call Engineer)'
      condition: 'Burn rate alert fires'
      response_time: '15 minutes'
    - level: 'L2 (Team Lead)'
      condition: 'Budget < 50%'
      response_time: '1 hour'
    - level: 'L3 (VP Engineering)'
      condition: 'Budget < 20% or exhausted'
      response_time: 'Same day'

Error Budget Remaining Calculation and Reporting

import datetime
import requests
from dataclasses import dataclass

@dataclass
class ErrorBudgetReport:
    service: str
    slo_target: float
    window_days: int
    current_sli: float
    budget_remaining_pct: float
    budget_remaining_minutes: float
    estimated_exhaustion_date: str
    policy_status: str  # green, yellow, red, exhausted

def calculate_error_budget_status(
    prometheus_url: str,
    service: str,
    slo_target: float = 0.999,
    window_days: int = 30,
) -> ErrorBudgetReport:
    """Query current SLI from Prometheus and calculate error budget status"""

    # Query current SLI (30-day window)
    query = f'''
        sum(rate(http_requests_total{{service="{service}",status!~"5.."}}[{window_days}d]))
        /
        sum(rate(http_requests_total{{service="{service}"}}[{window_days}d]))
    '''
    resp = requests.get(
        f"{prometheus_url}/api/v1/query",
        params={"query": query},
    )
    result = resp.json()["data"]["result"]
    current_sli = float(result[0]["value"][1]) if result else 0.0

    # Calculate error budget
    total_budget = 1.0 - slo_target  # e.g., 0.001
    consumed = max(0.0, (1.0 - current_sli) - 0)  # actual error rate
    remaining = max(0.0, total_budget - consumed)
    remaining_pct = (remaining / total_budget) * 100 if total_budget > 0 else 0

    # Convert to time
    total_minutes = window_days * 24 * 60
    remaining_minutes = total_minutes * (remaining / total_budget) if total_budget > 0 else 0

    # Calculate estimated exhaustion date
    if consumed > 0 and remaining > 0:
        burn_rate = consumed / total_budget
        days_to_exhaustion = window_days * (remaining / total_budget) / burn_rate
        exhaustion_date = (
            datetime.date.today() + datetime.timedelta(days=days_to_exhaustion)
        ).isoformat()
    elif remaining <= 0:
        exhaustion_date = "EXHAUSTED"
    else:
        exhaustion_date = "N/A (no errors)"

    # Determine policy status
    if remaining_pct >= 50:
        status = "green"
    elif remaining_pct >= 20:
        status = "yellow"
    elif remaining_pct > 0:
        status = "red"
    else:
        status = "exhausted"

    return ErrorBudgetReport(
        service=service,
        slo_target=slo_target,
        window_days=window_days,
        current_sli=round(current_sli, 6),
        budget_remaining_pct=round(remaining_pct, 2),
        budget_remaining_minutes=round(remaining_minutes, 1),
        estimated_exhaustion_date=exhaustion_date,
        policy_status=status,
    )

Automated Slack Weekly Report

import json
from slack_sdk import WebClient

def send_weekly_slo_report(
    slack_token: str,
    channel: str,
    services: list[str],
    prometheus_url: str,
):
    """Send weekly SLO report to a Slack channel"""
    client = WebClient(token=slack_token)
    reports = []

    for service in services:
        report = calculate_error_budget_status(prometheus_url, service)
        reports.append(report)

    # Status emoji mapping (for Slack)
    status_emoji = {
        "green": ":large_green_circle:",
        "yellow": ":large_yellow_circle:",
        "red": ":red_circle:",
        "exhausted": ":rotating_light:",
    }

    blocks = [
        {"type": "header", "text": {"type": "plain_text", "text": "Weekly SLO Report"}},
        {"type": "divider"},
    ]

    for r in sorted(reports, key=lambda x: x.budget_remaining_pct):
        emoji = status_emoji[r.policy_status]
        blocks.append({
            "type": "section",
            "text": {
                "type": "mrkdwn",
                "text": (
                    f"{emoji} *{r.service}*\n"
                    f"  SLO: {r.slo_target*100:.2f}% | "
                    f"Current SLI: {r.current_sli*100:.4f}%\n"
                    f"  Budget remaining: {r.budget_remaining_pct:.1f}% "
                    f"({r.budget_remaining_minutes:.0f} min)\n"
                    f"  Estimated exhaustion: {r.estimated_exhaustion_date}"
                ),
            },
        })

    client.chat_postMessage(channel=channel, blocks=blocks)

Step 5: Release Gate Integration -- How SLOs Block Deployments

Automatically block deployments in the CI/CD pipeline based on error budget status.

GitHub Actions Release Gate

# .github/workflows/release-gate.yaml
name: SLO Release Gate
on:
  workflow_call:
    inputs:
      service:
        required: true
        type: string

jobs:
  check-error-budget:
    runs-on: ubuntu-latest
    steps:
      - name: Query error budget status
        id: budget
        run: |
          RESULT=$(curl -s "${{ secrets.PROMETHEUS_URL }}/api/v1/query" \
            --data-urlencode "query=slo:error_budget_remaining_pct{service=\"${{ inputs.service }}\"}" \
            | jq -r '.data.result[0].value[1]')
          echo "remaining_pct=$RESULT" >> $GITHUB_OUTPUT

      - name: Evaluate release gate
        run: |
          BUDGET="${{ steps.budget.outputs.remaining_pct }}"
          echo "Error budget remaining: ${BUDGET}%"

          if (( $(echo "$BUDGET < 20" | bc -l) )); then
            echo "::error::ERROR BUDGET CRITICAL (${BUDGET}%). Release blocked."
            echo "Error budget is below 20%. Complete reliability work first."
            exit 1
          elif (( $(echo "$BUDGET < 50" | bc -l) )); then
            echo "::warning::Error budget at ${BUDGET}%. Canary deployment required."
            echo "canary_required=true" >> $GITHUB_OUTPUT
          else
            echo "Error budget healthy at ${BUDGET}%. Proceeding."
          fi

      - name: Notify Slack on block
        if: failure()
        run: |
          curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H 'Content-Type: application/json' \
            -d "{\"text\":\"Release blocked for ${{ inputs.service }}: error budget < 20%\"}"

Step 6: Include SLO Impact in Postmortems

Incident postmortems must include "how much did this impact the SLO." This is the clearest way to quantify the business impact of an incident.

Postmortem SLO Impact Section Template

## SLO Impact Analysis

### Affected SLO

- Service: payment-api
- SLO Target: 99.95% availability (30-day window)
- Pre-incident SLI: 99.97%
- Post-incident SLI: 99.93%

### Error Budget Consumption

- Incident duration: 23 minutes
- Error requests during incident: 12,847
- Budget consumed relative to total window: 28.5%
- Pre-incident budget remaining: 71.2%
- Post-incident budget remaining: 42.7%

### Policy Status Change

- Before incident: GREEN (71.2%)
- After incident: YELLOW (42.7%)
- Action: Add canary stage to releases for the next 2 weeks

### Business Impact

- Failed payment attempts: approximately 3,200
- Estimated revenue loss: approximately $48,000
- Increase in customer inquiries: 127

Troubleshooting

1. SLI Value Stuck at 100%

Cause: Missing metric collection. Error responses are not being recorded as separate metrics.

# How to check
curl -s http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=http_requests_total{status=~"5.."}' | jq '.data.result | length'
# If 0, 5xx metrics are not being collected

# Fix: Check the status code label in the application's metrics middleware

2. Alerts Firing Too Frequently (10+ times per day)

Diagnostic sequence:

  1. Check if the burn rate threshold is too low (setting alerts at burn rate 1 will fire even in normal conditions)
  2. Check if the short window is too long (30m is appropriate; 5m generates too much noise)
  3. Check if the for duration is too short (minimum 1m for critical, 5m for warning)
  4. Review whether the SLO target itself is unrealistic compared to current service levels

3. Error Budget Resets on New Window but Same Problem Repeats

Root cause: Postmortem action items were not completed before the window reset.

Fix: Add a condition to the error budget policy: "If the previous window was exhausted, action item completion rate must be 80% or higher to return to GREEN in the next window."

4. SLI Definition Mismatch Between Teams

Case: The backend team uses "server response time" while the frontend team uses "user-perceived loading time" as their SLI. Same SLO of 99.9% but measuring different things.

Fix: Manage SLI definition documents in an organization-wide shared wiki and conduct joint reviews with product/platform/SRE teams quarterly.

Quiz

Q1. What is the 30-day error budget for SLO 99.9% in minutes? Answer: 43.2 minutes. 30 days x 24 hours x 60 minutes = 43,200 minutes. 43,200 x 0.001 = 43.2 minutes.

Q2. What situation does a burn rate of 14 indicate? Answer: It means errors are occurring at a rate that would exhaust the 30-day error budget in approximately 2.1 days (30/14). This is an acute incident situation requiring immediate response.

Q3. What is the role of the short window in Multi-Window alerting? Answer: It confirms whether the anomaly detected in the long window is still ongoing. It prevents false positives from past incidents that have already ended. The Google SRE Workbook recommends setting the short window to 1/12 of the long window.

Q4. Should security patches also be halted when the error budget is exhausted? Answer: No. Security patches, legal compliance requirements, and emergency fixes to prevent data loss should be deployed regardless of error budget status. These exceptions must be explicitly stated in the error budget policy.

Q5. Why should you never set an SLO to 100%? Answer: An SLO of 100% means an error budget of 0, so no changes can be deployed as long as there is any possibility of errors. This is effectively declaring "we will never deploy new features," completely blocking innovation. Additionally, 100% is realistically unachievable due to inevitable infrastructure failures (hardware failures, network issues).

Q6. Why should you not use "average response time" as an SLI? Answer: Averages hide tail latency. A service with P50 of 100ms and P99 of 10 seconds may show an average of about 200ms, but 1 in 100 users waits 10 seconds. For SLIs, you should use "ratio of responses within a threshold" (e.g., ratio of responses within 300ms).

References