Split View: SLO와 Error Budget 실행 매뉴얼

SLO와 Error Budget 실행 매뉴얼

SLO가 대시보드 숫자로 끝나는 이유
1단계: SLI 정의 -- 무엇을 측정할 것인가
2단계: SLO 목표 설정 -- 얼마나 좋아야 하는가
- SLO 목표 산정 프로세스
- 서비스 티어별 SLO 가이드라인
3단계: Burn Rate 알림 설계 -- 언제 대응할 것인가
4단계: Error Budget Policy -- 소진되면 무엇을 하는가
5단계: 릴리스 게이트 연동 -- SLO가 배포를 차단하는 구조
- GitHub Actions 릴리스 게이트
6단계: 포스트모템에 SLO 영향도 포함
- 포스트모템 SLO 영향도 섹션 템플릿
트러블슈팅
퀴즈
참고 자료

SLO가 대시보드 숫자로 끝나는 이유

대부분의 조직에서 SLO(Service Level Objective) 도입은 다음과 같은 패턴으로 실패한다. "99.9% 가용성"이라는 숫자를 정하고 Grafana 대시보드에 올려두지만, 그 숫자가 릴리스 결정, 온콜 우선순위, 기술 부채 청산 일정에 전혀 영향을 주지 못한다. SLO는 측정 도구가 아니라 의사결정 프레임워크다. Error budget이 소진되면 기능 개발을 멈추고 안정성 작업에 집중한다는 조직적 합의가 없으면, SLO는 그냥 보기 좋은 숫자에 불과하다.

이 매뉴얼은 SLO 숫자를 조직의 실제 행동으로 연결하는 실행 절차를 다룬다. Google SRE Workbook의 error budget policy(sre.google/workbook/error-budget-policy)를 실무에 적용할 수 있는 수준으로 구체화했다.

1단계: SLI 정의 -- 무엇을 측정할 것인가

SLI(Service Level Indicator)는 SLO를 계산하는 원시 지표다. "무엇이 좋은 요청인가"를 명확히 정의해야 한다.

SLI 유형과 계산 공식

SLI 유형	수식	적합한 서비스
가용성 (Availability)	`good_requests / total_requests`	API 서버, 웹 서비스
지연시간 (Latency)	`requests_below_threshold / total_requests`	사용자 대면 서비스
처리량 (Throughput)	`processed_jobs / submitted_jobs`	배치 처리, 파이프라인
정확성 (Correctness)	`correct_responses / total_responses`	ML 모델 서빙, 검색
신선도 (Freshness)	`fresh_data_reads / total_reads`	캐시, 데이터 동기화

Prometheus 기반 SLI 수집 설정

# prometheus_rules/sli_recording_rules.yaml
groups:
  - name: sli_availability
    interval: 30s
    rules:
      # API 서버 가용성 SLI
      # "좋은 요청" = HTTP 5xx를 제외한 모든 응답
      - record: sli:api_availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)

  - name: sli_latency
    interval: 30s
    rules:
      # 지연시간 SLI
      # "좋은 요청" = 300ms 이내에 완료된 요청
      - record: sli:api_latency:ratio_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (service)
          /
          sum(rate(http_request_duration_seconds_count[5m])) by (service)

SLI 정의 시 흔한 실수

# 나쁜 SLI 정의 예시들

# 1. 서버 관점의 헬스체크를 SLI로 사용 (사용자 경험과 무관)
bad_sli_1 = "health_check_success_rate"  # 서버는 살아있지만 응답이 느릴 수 있음

# 2. 내부 재시도를 포함한 성공률 (실제 사용자 체감과 다름)
bad_sli_2 = "requests_eventually_succeeded / requests_total"  # 재시도 포함

# 3. 평균 지연시간 (p50은 괜찮지만 p99가 10초일 수 있음)
bad_sli_3 = "avg(request_duration)"  # 평균은 tail latency를 숨김

# 좋은 SLI 정의
good_sli = {
    "availability": "사용자가 받은 첫 응답이 non-5xx인 비율",
    "latency": "사용자가 체감한 응답 시간이 300ms 이내인 비율",
    "correctness": "검색 결과의 상위 5개 중 관련 결과가 3개 이상인 비율",
}

2단계: SLO 목표 설정 -- 얼마나 좋아야 하는가

SLO를 100%로 설정하면 안 된다. 100%는 "절대 장애가 없어야 한다"는 뜻이고, 이는 "새 기능을 절대 배포하지 않는다"와 같다.

SLO 목표 산정 프로세스

def calculate_error_budget(slo_target: float, window_days: int = 30) -> dict:
    """SLO 목표로부터 error budget 계산"""
    total_minutes = window_days * 24 * 60
    error_budget_fraction = 1.0 - slo_target
    allowed_bad_minutes = total_minutes * error_budget_fraction

    # 요청 기반 계산 (일 평균 100만 요청 가정)
    daily_requests = 1_000_000
    total_requests = daily_requests * window_days
    allowed_bad_requests = int(total_requests * error_budget_fraction)

    return {
        "slo_target": f"{slo_target * 100:.2f}%",
        "window_days": window_days,
        "error_budget_fraction": f"{error_budget_fraction * 100:.3f}%",
        "allowed_bad_minutes": round(allowed_bad_minutes, 1),
        "allowed_bad_minutes_per_day": round(allowed_bad_minutes / window_days, 2),
        "allowed_bad_requests_30d": allowed_bad_requests,
    }

# SLO별 error budget 비교
for target in [0.999, 0.995, 0.99, 0.9]:
    budget = calculate_error_budget(target)
    print(f"SLO {budget['slo_target']:>7s}: "
          f"월 {budget['allowed_bad_minutes']:>7.1f}분 = "
          f"일 {budget['allowed_bad_minutes_per_day']:>5.2f}분, "
          f"월 {budget['allowed_bad_requests_30d']:>8,}건 허용")

출력:

SLO  99.90%: 월    43.2분 = 일  1.44분, 월   30,000건 허용
SLO  99.50%: 월   216.0분 = 일  7.20분, 월  150,000건 허용
SLO  99.00%: 월   432.0분 = 일 14.40분, 월  300,000건 허용
SLO  90.00%: 월 4,320.0분 = 일144.00분, 월 3,000,000건 허용

서비스 티어별 SLO 가이드라인

서비스 티어	가용성 SLO	Latency SLO (P95)	근거
Tier 1 (결제, 인증)	99.95%	200ms	매출 직결, 장애 시 즉각 비즈니스 임팩트
Tier 2 (검색, 추천)	99.9%	500ms	사용자 경험 핵심, 대체 경로 존재
Tier 3 (알림, 로그)	99.5%	2s	지연 허용 가능, 비동기 처리 가능
Tier 4 (내부 도구)	99.0%	5s	내부 사용자, 업무 시간 외 유지보수 가능

3단계: Burn Rate 알림 설계 -- 언제 대응할 것인가

Google SRE Workbook에서 권장하는 Multi-Window, Multi-Burn-Rate 알림을 구현한다. 핵심 개념: burn rate는 "현재 속도로 에러가 계속 발생하면, compliance window 내에 error budget이 몇 배 빨리 소진되는가"를 나타낸다.

Burn Rate 개념

def explain_burn_rate():
    """Burn rate 개념 설명"""
    # burn rate = 1: error budget을 정확히 30일에 걸쳐 소진
    # burn rate = 14: error budget을 약 2.1일 만에 소진
    # burn rate = 2: error budget을 약 15일 만에 소진

    examples = {
        "burn_rate_14": {
            "meaning": "30일치 budget을 2.1일 만에 소진하는 속도",
            "use_case": "급성 장애. 5분 이내 감지 필요",
            "long_window": "1h",
            "short_window": "5m",
        },
        "burn_rate_6": {
            "meaning": "30일치 budget을 5일 만에 소진하는 속도",
            "use_case": "중요한 성능 저하. 30분 이내 감지",
            "long_window": "6h",
            "short_window": "30m",
        },
        "burn_rate_2": {
            "meaning": "30일치 budget을 15일 만에 소진하는 속도",
            "use_case": "완만한 품질 저하. 수 시간 내 감지",
            "long_window": "3d",
            "short_window": "6h",
        },
        "burn_rate_1": {
            "meaning": "30일치 budget을 정확히 30일에 소진",
            "use_case": "주간 리뷰에서 확인. 즉각 알림 불필요",
            "long_window": "N/A",
            "short_window": "N/A",
        },
    }
    return examples

Prometheus 알림 규칙

# prometheus_rules/slo_alerts.yaml
groups:
  - name: slo_burn_rate_alerts
    rules:
      # === Tier 1: 급성 장애 감지 (Burn Rate 14, 1h/5m 윈도우) ===
      - alert: SLOBurnRateCritical
        # long window: 1h 동안 burn rate 14 이상
        # short window: 5m 동안에도 여전히 높은지 확인 (오탐 방지)
        expr: |
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) by (service)
                 / sum(rate(http_requests_total[1h])) by (service))
          ) > (14 * (1 - 0.999))
          and
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
                 / sum(rate(http_requests_total[5m])) by (service))
          ) > (14 * (1 - 0.999))
        for: 1m
        labels:
          severity: critical
          slo_window: '1h/5m'
          burn_rate: '14'
        annotations:
          summary: '{{ $labels.service }}: SLO burn rate critical (14x)'
          description: |
            서비스 {{ $labels.service }}의 에러율이 SLO 대비 14배 속도로
            error budget을 소진 중입니다. 약 2.1일 내에 전체 budget이 소진됩니다.
            즉시 확인하세요.
          runbook: 'https://wiki.internal/runbook/slo-critical'

      # === Tier 2: 중요한 성능 저하 (Burn Rate 6, 6h/30m 윈도우) ===
      - alert: SLOBurnRateHigh
        expr: |
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[6h])) by (service)
                 / sum(rate(http_requests_total[6h])) by (service))
          ) > (6 * (1 - 0.999))
          and
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[30m])) by (service)
                 / sum(rate(http_requests_total[30m])) by (service))
          ) > (6 * (1 - 0.999))
        for: 5m
        labels:
          severity: warning
          slo_window: '6h/30m'
          burn_rate: '6'
        annotations:
          summary: '{{ $labels.service }}: SLO burn rate high (6x)'
          description: |
            서비스 {{ $labels.service }}의 에러율이 SLO 대비 6배 속도로
            error budget을 소진 중입니다. 약 5일 내에 전체 budget이 소진됩니다.

      # === Tier 3: 완만한 품질 저하 (Burn Rate 2, 3d/6h 윈도우) ===
      - alert: SLOBurnRateLow
        expr: |
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[3d])) by (service)
                 / sum(rate(http_requests_total[3d])) by (service))
          ) > (2 * (1 - 0.999))
          and
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[6h])) by (service)
                 / sum(rate(http_requests_total[6h])) by (service))
          ) > (2 * (1 - 0.999))
        for: 30m
        labels:
          severity: info
          slo_window: '3d/6h'
          burn_rate: '2'
        annotations:
          summary: '{{ $labels.service }}: SLO burn rate elevated (2x)'

왜 Multi-Window인가

단일 윈도우의 문제점:

[단일 윈도우: 1h]
10:00 - 10:05  장애 발생, 에러율 50%
10:05 - 10:10  장애 해소, 에러율 0%
...
10:55 - 11:00  에러율 0%

-> 1시간 윈도우의 평균 에러율: 약 4.2%
-> burn rate 14 threshold(0.014) 초과 -> 알림 발생

그런데 장애는 55분 전에 이미 끝났다!
=> 오탐(false positive). 온콜 엔지니어가 이미 해결된 문제로 호출됨.

[다중 윈도우: 1h + 5m]
Long window (1h): 에러율 4.2% -> threshold 초과 (O)
Short window (5m): 에러율 0% -> threshold 미달 (X)
=> 두 조건 모두 충족되지 않으므로 알림 발생하지 않음. 정확한 판단.

4단계: Error Budget Policy -- 소진되면 무엇을 하는가

Error budget policy는 SLO의 핵심이다. 이것 없이 SLO 숫자만 설정하면 아무 의미가 없다.

Error Budget Policy 템플릿

# error_budget_policy.yaml
# 이 문서는 엔지니어링 리더십, 제품 팀, SRE 팀이 합의하여 서명한다.

policy:
  version: '2.0'
  effective_date: '2026-01-15'
  review_cycle: 'quarterly'

  budget_thresholds:
    # Budget >= 50%: 정상 운영
    green:
      remaining_budget: '>= 50%'
      actions:
        - '기능 개발과 안정성 작업 비율: 8:2'
        - '일반적인 릴리스 프로세스 유지'
        - '주간 SLO 리뷰에서 추이 확인'

    # Budget 20-50%: 주의
    yellow:
      remaining_budget: '20% ~ 50%'
      actions:
        - '기능 개발과 안정성 작업 비율: 5:5'
        - '릴리스 전 추가 load test 필수'
        - '모든 배포에 canary 단계 추가 (1% -> 10% -> 50% -> 100%)'
        - '주 2회 SLO 리뷰'

    # Budget < 20%: 위험
    red:
      remaining_budget: '< 20%'
      actions:
        - '신규 기능 릴리스 동결'
        - '안정성 개선 작업에 전 인력 집중'
        - '모든 변경 사항에 VP 승인 필요'
        - '매일 SLO 리뷰'
        - '포스트모템 작성 및 action item 추적'

    # Budget 소진: 비상
    exhausted:
      remaining_budget: '<= 0%'
      actions:
        - '모든 비필수 배포 즉시 중단'
        - '최근 30일 내 변경사항 중 의심 대상 롤백 검토'
        - 'CTO/VP 에스컬레이션'
        - '일일 상황 보고'
        - 'window reset까지 기능 동결 유지'

  exceptions:
    - '보안 패치는 budget 상태와 무관하게 즉시 배포'
    - '법적 compliance 요구사항은 예외'
    - '데이터 유실 방지를 위한 긴급 수정은 예외'

  escalation:
    - level: 'L1 (온콜 엔지니어)'
      condition: 'burn rate alert 발생'
      response_time: '15분'
    - level: 'L2 (팀 리드)'
      condition: 'budget < 50%'
      response_time: '1시간'
    - level: 'L3 (VP Engineering)'
      condition: 'budget < 20% 또는 exhausted'
      response_time: '같은 날'

Error Budget 잔량 계산 및 리포팅

import datetime
import requests
from dataclasses import dataclass

@dataclass
class ErrorBudgetReport:
    service: str
    slo_target: float
    window_days: int
    current_sli: float
    budget_remaining_pct: float
    budget_remaining_minutes: float
    estimated_exhaustion_date: str
    policy_status: str  # green, yellow, red, exhausted

def calculate_error_budget_status(
    prometheus_url: str,
    service: str,
    slo_target: float = 0.999,
    window_days: int = 30,
) -> ErrorBudgetReport:
    """Prometheus에서 현재 SLI를 조회하여 error budget 상태 계산"""

    # 현재 SLI 조회 (30일 윈도우)
    query = f'''
        sum(rate(http_requests_total{{service="{service}",status!~"5.."}}[{window_days}d]))
        /
        sum(rate(http_requests_total{{service="{service}"}}[{window_days}d]))
    '''
    resp = requests.get(
        f"{prometheus_url}/api/v1/query",
        params={"query": query},
    )
    result = resp.json()["data"]["result"]
    current_sli = float(result[0]["value"][1]) if result else 0.0

    # Error budget 계산
    total_budget = 1.0 - slo_target  # 예: 0.001
    consumed = max(0.0, (1.0 - current_sli) - 0)  # 실제 에러율
    remaining = max(0.0, total_budget - consumed)
    remaining_pct = (remaining / total_budget) * 100 if total_budget > 0 else 0

    # 시간 환산
    total_minutes = window_days * 24 * 60
    remaining_minutes = total_minutes * (remaining / total_budget) if total_budget > 0 else 0

    # 소진 예상일 계산
    if consumed > 0 and remaining > 0:
        burn_rate = consumed / total_budget
        days_to_exhaustion = window_days * (remaining / total_budget) / burn_rate
        exhaustion_date = (
            datetime.date.today() + datetime.timedelta(days=days_to_exhaustion)
        ).isoformat()
    elif remaining <= 0:
        exhaustion_date = "EXHAUSTED"
    else:
        exhaustion_date = "N/A (no errors)"

    # Policy 상태 판정
    if remaining_pct >= 50:
        status = "green"
    elif remaining_pct >= 20:
        status = "yellow"
    elif remaining_pct > 0:
        status = "red"
    else:
        status = "exhausted"

    return ErrorBudgetReport(
        service=service,
        slo_target=slo_target,
        window_days=window_days,
        current_sli=round(current_sli, 6),
        budget_remaining_pct=round(remaining_pct, 2),
        budget_remaining_minutes=round(remaining_minutes, 1),
        estimated_exhaustion_date=exhaustion_date,
        policy_status=status,
    )

Slack 주간 리포트 자동화

import json
from slack_sdk import WebClient

def send_weekly_slo_report(
    slack_token: str,
    channel: str,
    services: list[str],
    prometheus_url: str,
):
    """주간 SLO 리포트를 Slack 채널에 전송"""
    client = WebClient(token=slack_token)
    reports = []

    for service in services:
        report = calculate_error_budget_status(prometheus_url, service)
        reports.append(report)

    # 상태별 이모지 매핑 (Slack용)
    status_emoji = {
        "green": ":large_green_circle:",
        "yellow": ":large_yellow_circle:",
        "red": ":red_circle:",
        "exhausted": ":rotating_light:",
    }

    blocks = [
        {"type": "header", "text": {"type": "plain_text", "text": "Weekly SLO Report"}},
        {"type": "divider"},
    ]

    for r in sorted(reports, key=lambda x: x.budget_remaining_pct):
        emoji = status_emoji[r.policy_status]
        blocks.append({
            "type": "section",
            "text": {
                "type": "mrkdwn",
                "text": (
                    f"{emoji} *{r.service}*\n"
                    f"  SLO: {r.slo_target*100:.2f}% | "
                    f"Current SLI: {r.current_sli*100:.4f}%\n"
                    f"  Budget remaining: {r.budget_remaining_pct:.1f}% "
                    f"({r.budget_remaining_minutes:.0f} min)\n"
                    f"  Estimated exhaustion: {r.estimated_exhaustion_date}"
                ),
            },
        })

    client.chat_postMessage(channel=channel, blocks=blocks)

5단계: 릴리스 게이트 연동 -- SLO가 배포를 차단하는 구조

Error budget 상태에 따라 CI/CD 파이프라인에서 배포를 자동으로 차단한다.

GitHub Actions 릴리스 게이트

# .github/workflows/release-gate.yaml
name: SLO Release Gate
on:
  workflow_call:
    inputs:
      service:
        required: true
        type: string

jobs:
  check-error-budget:
    runs-on: ubuntu-latest
    steps:
      - name: Query error budget status
        id: budget
        run: |
          RESULT=$(curl -s "${{ secrets.PROMETHEUS_URL }}/api/v1/query" \
            --data-urlencode "query=slo:error_budget_remaining_pct{service=\"${{ inputs.service }}\"}" \
            | jq -r '.data.result[0].value[1]')
          echo "remaining_pct=$RESULT" >> $GITHUB_OUTPUT

      - name: Evaluate release gate
        run: |
          BUDGET="${{ steps.budget.outputs.remaining_pct }}"
          echo "Error budget remaining: ${BUDGET}%"

          if (( $(echo "$BUDGET < 20" | bc -l) )); then
            echo "::error::ERROR BUDGET CRITICAL (${BUDGET}%). Release blocked."
            echo "Error budget가 20% 미만입니다. 안정성 작업을 먼저 완료하세요."
            exit 1
          elif (( $(echo "$BUDGET < 50" | bc -l) )); then
            echo "::warning::Error budget at ${BUDGET}%. Canary deployment required."
            echo "canary_required=true" >> $GITHUB_OUTPUT
          else
            echo "Error budget healthy at ${BUDGET}%. Proceeding."
          fi

      - name: Notify Slack on block
        if: failure()
        run: |
          curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H 'Content-Type: application/json' \
            -d "{\"text\":\"Release blocked for ${{ inputs.service }}: error budget < 20%\"}"

6단계: 포스트모템에 SLO 영향도 포함

장애 포스트모템에 "SLO에 얼마나 영향을 줬는가"를 반드시 포함해야 한다. 이것이 장애의 비즈니스 임팩트를 정량화하는 가장 명확한 방법이다.

포스트모템 SLO 영향도 섹션 템플릿

## SLO 영향도 분석

### 영향받은 SLO

- 서비스: payment-api
- SLO 목표: 99.95% availability (30일 윈도우)
- 장애 전 SLI: 99.97%
- 장애 후 SLI: 99.93%

### Error Budget 소비량

- 장애 지속 시간: 23분
- 장애 동안 에러 요청 수: 12,847건
- 전체 윈도우 대비 budget 소비: 28.5%
- 장애 전 budget 잔량: 71.2%
- 장애 후 budget 잔량: 42.7%

### Policy 상태 변경

- 장애 전: GREEN (71.2%)
- 장애 후: YELLOW (42.7%)
- 조치: 향후 2주간 릴리스에 canary 단계 추가

### 비즈니스 임팩트

- 실패한 결제 시도: 약 3,200건
- 추정 매출 손실: 약 $48,000
- 고객 문의 증가: 127건

트러블슈팅

1. SLI 값이 100%로 고정되어 있음

원인: 메트릭 수집 누락. 에러 응답이 별도 메트릭으로 기록되지 않는 경우.

# 확인 방법
curl -s http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=http_requests_total{status=~"5.."}' | jq '.data.result | length'
# 0이면 5xx 메트릭이 수집되지 않고 있음

# 해결: 애플리케이션의 메트릭 미들웨어에서 status code label 확인

2. 알림이 너무 자주 발생함 (하루 10회 이상)

진단 순서:

burn rate threshold가 너무 낮은지 확인 (burn rate 1로 알림 설정하면 정상 상태에서도 발화)
short window가 너무 긴지 확인 (30m이 적정, 5m이면 노이즈 많음)
for duration이 너무 짧은지 확인 (critical은 최소 1m, warning은 5m)
SLO 목표 자체가 현재 서비스 수준과 동떨어진 것은 아닌지 검토

3. Error budget이 새 윈도우에서 리셋되었지만 같은 문제가 반복

근본 원인: 포스트모템 action item이 완료되지 않은 상태에서 윈도우만 리셋됨.

해결: Error budget policy에 "이전 윈도우에서 소진된 경우, action item 완료율 80% 이상이어야 다음 윈도우에서 GREEN 복귀" 조건 추가.

4. 팀 간 SLI 정의 불일치

사례: 백엔드 팀은 "서버 응답 시간"을, 프론트엔드 팀은 "사용자 체감 로딩 시간"을 SLI로 사용. 같은 SLO 99.9%지만 서로 다른 것을 측정.

해결: SLI 정의 문서를 조직 전체 공유 위키에 관리하고, 분기마다 제품/플랫폼/SRE가 공동 리뷰.

퀴즈

Q1. SLO 99.9%의 30일 error budget은 분 단위로 얼마인가?

정답: ||43.2분이다. 30일 x 24시간 x 60분 = 43,200분. 43,200 x 0.001 = 43.2분.||

Q2. Burn rate 14는 어떤 상황을 의미하는가?

정답: ||30일치 error budget을 약 2.1일(30/14) 만에 소진하는 속도로 에러가 발생하고 있다는 뜻이다. 급성 장애 상황으로 즉각 대응이 필요하다.||

Q3. Multi-Window 알림에서 short window의 역할은?

정답: ||Long window에서 감지된 이상이 현재도 진행 중인지 확인하는 것이다. 이미 종료된 과거 장애에 대한 오탐(false positive)을 방지한다. Google SRE Workbook에서는 short window를 long window의 1/12로 권장한다.||

Q4. Error budget이 소진되었을 때 보안 패치도 배포를 중단해야 하는가?

정답: ||아니다. 보안 패치, 법적 compliance 요구사항, 데이터 유실 방지를 위한 긴급 수정은 error budget 상태와 무관하게 배포해야 한다. 이러한 예외는 error budget policy에 명시해야 한다.||

Q5. SLO를 100%로 설정하면 안 되는 이유는?

정답: ||SLO 100%는 error budget이 0이므로, 어떤 변경이든 에러 가능성이 있는 한 배포할 수 없다. 이는 사실상 "새 기능을 절대 배포하지 않겠다"는 선언이며, 혁신을 완전히 차단한다. 또한 인프라 자체의 불가피한 장애(하드웨어 고장, 네트워크 문제)로 인해 100%는 현실적으로 달성 불가능하다.||

Q6. SLI로 "평균 응답 시간"을 사용하면 안 되는 이유는?

정답: ||평균은 tail latency를 숨긴다. P50이 100ms이고 P99가 10초인 서비스의 평균은 약 200ms로 보일 수 있지만, 100명 중 1명은 10초를 기다린다. SLI로는 "threshold 이내에 응답한 비율"(예: 300ms 이내 응답 비율)을 사용해야 한다.||

참고 자료

SLO and Error Budget Execution Manual

Why SLOs End Up as Dashboard Numbers
Step 1: Define SLIs -- What to Measure
Step 2: Set SLO Targets -- How Good Is Good Enough
- SLO Target Calculation Process
- SLO Guidelines by Service Tier
Step 3: Burn Rate Alert Design -- When to Respond
Step 4: Error Budget Policy -- What to Do When Exhausted
Step 5: Release Gate Integration -- How SLOs Block Deployments
- GitHub Actions Release Gate
Step 6: Include SLO Impact in Postmortems
- Postmortem SLO Impact Section Template
Troubleshooting
Quiz
References

Why SLOs End Up as Dashboard Numbers

In most organizations, SLO (Service Level Objective) adoption fails in the following pattern: they set a number like "99.9% availability" and put it on a Grafana dashboard, but that number has no influence on release decisions, on-call priorities, or technical debt remediation schedules. SLOs are not measurement tools -- they are decision-making frameworks. Without organizational agreement that feature development stops and reliability work takes priority when the error budget is exhausted, SLOs are just nice-looking numbers.

This manual covers the execution procedures for connecting SLO numbers to actual organizational actions. It concretizes the Google SRE Workbook's error budget policy (sre.google/workbook/error-budget-policy) to a level that can be applied in practice.

Step 1: Define SLIs -- What to Measure

SLI (Service Level Indicator) is the raw metric used to calculate SLOs. You need to clearly define "what constitutes a good request."

SLI Types and Calculation Formulas

SLI Type	Formula	Suitable Services
Availability	`good_requests / total_requests`	API servers, web services
Latency	`requests_below_threshold / total_requests`	User-facing services
Throughput	`processed_jobs / submitted_jobs`	Batch processing, pipelines
Correctness	`correct_responses / total_responses`	ML model serving, search
Freshness	`fresh_data_reads / total_reads`	Cache, data synchronization

Prometheus-based SLI Collection Setup

# prometheus_rules/sli_recording_rules.yaml
groups:
  - name: sli_availability
    interval: 30s
    rules:
      # API server availability SLI
      # "Good request" = all responses except HTTP 5xx
      - record: sli:api_availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)

  - name: sli_latency
    interval: 30s
    rules:
      # Latency SLI
      # "Good request" = requests completed within 300ms
      - record: sli:api_latency:ratio_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (service)
          /
          sum(rate(http_request_duration_seconds_count[5m])) by (service)

Common Mistakes When Defining SLIs

# Bad SLI definition examples

# 1. Using server-side health checks as SLI (unrelated to user experience)
bad_sli_1 = "health_check_success_rate"  # Server is alive but responses might be slow

# 2. Success rate including internal retries (differs from actual user perception)
bad_sli_2 = "requests_eventually_succeeded / requests_total"  # Includes retries

# 3. Average latency (p50 might be fine but p99 could be 10 seconds)
bad_sli_3 = "avg(request_duration)"  # Average hides tail latency

# Good SLI definitions
good_sli = {
    "availability": "Ratio of first responses received by users that are non-5xx",
    "latency": "Ratio of responses perceived by users within 300ms",
    "correctness": "Ratio where 3+ of top 5 search results are relevant",
}

Step 2: Set SLO Targets -- How Good Is Good Enough

You should never set an SLO to 100%. 100% means "there must never be any failure," which equates to "never deploy any new features."

SLO Target Calculation Process

def calculate_error_budget(slo_target: float, window_days: int = 30) -> dict:
    """Calculate error budget from SLO target"""
    total_minutes = window_days * 24 * 60
    error_budget_fraction = 1.0 - slo_target
    allowed_bad_minutes = total_minutes * error_budget_fraction

    # Request-based calculation (assuming 1 million requests per day)
    daily_requests = 1_000_000
    total_requests = daily_requests * window_days
    allowed_bad_requests = int(total_requests * error_budget_fraction)

    return {
        "slo_target": f"{slo_target * 100:.2f}%",
        "window_days": window_days,
        "error_budget_fraction": f"{error_budget_fraction * 100:.3f}%",
        "allowed_bad_minutes": round(allowed_bad_minutes, 1),
        "allowed_bad_minutes_per_day": round(allowed_bad_minutes / window_days, 2),
        "allowed_bad_requests_30d": allowed_bad_requests,
    }

# Error budget comparison by SLO
for target in [0.999, 0.995, 0.99, 0.9]:
    budget = calculate_error_budget(target)
    print(f"SLO {budget['slo_target']:>7s}: "
          f"Monthly {budget['allowed_bad_minutes']:>7.1f}min = "
          f"Daily {budget['allowed_bad_minutes_per_day']:>5.2f}min, "
          f"Monthly {budget['allowed_bad_requests_30d']:>8,} bad requests allowed")

Output:

SLO  99.90%: Monthly    43.2min = Daily  1.44min, Monthly   30,000 bad requests allowed
SLO  99.50%: Monthly   216.0min = Daily  7.20min, Monthly  150,000 bad requests allowed
SLO  99.00%: Monthly   432.0min = Daily 14.40min, Monthly  300,000 bad requests allowed
SLO  90.00%: Monthly 4,320.0min = Daily144.00min, Monthly 3,000,000 bad requests allowed

SLO Guidelines by Service Tier

Service Tier	Availability SLO	Latency SLO (P95)	Rationale
Tier 1 (Payment, Auth)	99.95%	200ms	Directly impacts revenue, immediate business impact on failure
Tier 2 (Search, Recommendations)	99.9%	500ms	Core user experience, alternative paths exist
Tier 3 (Notifications, Logs)	99.5%	2s	Delay tolerable, async processing possible
Tier 4 (Internal Tools)	99.0%	5s	Internal users, maintenance possible outside business hours

Step 3: Burn Rate Alert Design -- When to Respond

We implement the Multi-Window, Multi-Burn-Rate alerting recommended by the Google SRE Workbook. Core concept: burn rate represents "if errors continue at the current rate, how many times faster will the error budget be exhausted within the compliance window."

Burn Rate Concept

def explain_burn_rate():
    """Burn rate concept explanation"""
    # burn rate = 1: exhausts error budget over exactly 30 days
    # burn rate = 14: exhausts error budget in about 2.1 days
    # burn rate = 2: exhausts error budget in about 15 days

    examples = {
        "burn_rate_14": {
            "meaning": "Exhausting 30-day budget in 2.1 days",
            "use_case": "Acute incident. Detection needed within 5 minutes",
            "long_window": "1h",
            "short_window": "5m",
        },
        "burn_rate_6": {
            "meaning": "Exhausting 30-day budget in 5 days",
            "use_case": "Significant performance degradation. Detection within 30 minutes",
            "long_window": "6h",
            "short_window": "30m",
        },
        "burn_rate_2": {
            "meaning": "Exhausting 30-day budget in 15 days",
            "use_case": "Gradual quality degradation. Detection within hours",
            "long_window": "3d",
            "short_window": "6h",
        },
        "burn_rate_1": {
            "meaning": "Exhausting 30-day budget in exactly 30 days",
            "use_case": "Review in weekly meeting. No immediate alert needed",
            "long_window": "N/A",
            "short_window": "N/A",
        },
    }
    return examples

Prometheus Alert Rules

# prometheus_rules/slo_alerts.yaml
groups:
  - name: slo_burn_rate_alerts
    rules:
      # === Tier 1: Acute Incident Detection (Burn Rate 14, 1h/5m window) ===
      - alert: SLOBurnRateCritical
        # long window: burn rate 14 or higher over 1h
        # short window: confirm still high over 5m (false positive prevention)
        expr: |
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) by (service)
                 / sum(rate(http_requests_total[1h])) by (service))
          ) > (14 * (1 - 0.999))
          and
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
                 / sum(rate(http_requests_total[5m])) by (service))
          ) > (14 * (1 - 0.999))
        for: 1m
        labels:
          severity: critical
          slo_window: '1h/5m'
          burn_rate: '14'
        annotations:
          summary: '{{ $labels.service }}: SLO burn rate critical (14x)'
          description: |
            Service {{ $labels.service }} is consuming error budget at 14x
            the SLO rate. The entire budget will be exhausted in approximately 2.1 days.
            Investigate immediately.
          runbook: 'https://wiki.internal/runbook/slo-critical'

      # === Tier 2: Significant Performance Degradation (Burn Rate 6, 6h/30m window) ===
      - alert: SLOBurnRateHigh
        expr: |
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[6h])) by (service)
                 / sum(rate(http_requests_total[6h])) by (service))
          ) > (6 * (1 - 0.999))
          and
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[30m])) by (service)
                 / sum(rate(http_requests_total[30m])) by (service))
          ) > (6 * (1 - 0.999))
        for: 5m
        labels:
          severity: warning
          slo_window: '6h/30m'
          burn_rate: '6'
        annotations:
          summary: '{{ $labels.service }}: SLO burn rate high (6x)'
          description: |
            Service {{ $labels.service }} is consuming error budget at 6x
            the SLO rate. The entire budget will be exhausted in approximately 5 days.

      # === Tier 3: Gradual Quality Degradation (Burn Rate 2, 3d/6h window) ===
      - alert: SLOBurnRateLow
        expr: |
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[3d])) by (service)
                 / sum(rate(http_requests_total[3d])) by (service))
          ) > (2 * (1 - 0.999))
          and
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[6h])) by (service)
                 / sum(rate(http_requests_total[6h])) by (service))
          ) > (2 * (1 - 0.999))
        for: 30m
        labels:
          severity: info
          slo_window: '3d/6h'
          burn_rate: '2'
        annotations:
          summary: '{{ $labels.service }}: SLO burn rate elevated (2x)'

Why Multi-Window

The problem with single windows:

[Single Window: 1h]
10:00 - 10:05  Incident occurs, error rate 50%
10:05 - 10:10  Incident resolved, error rate 0%
...
10:55 - 11:00  Error rate 0%

-> Average error rate over 1h window: approximately 4.2%
-> Exceeds burn rate 14 threshold (0.014) -> Alert fires

But the incident ended 55 minutes ago!
=> False positive. On-call engineer gets paged for an already resolved issue.

[Multi-Window: 1h + 5m]
Long window (1h): Error rate 4.2% -> Exceeds threshold (O)
Short window (5m): Error rate 0% -> Below threshold (X)
=> Both conditions are not met, so no alert fires. Correct decision.

Step 4: Error Budget Policy -- What to Do When Exhausted

Error budget policy is the core of SLOs. Setting SLO numbers without this is meaningless.

Error Budget Policy Template

# error_budget_policy.yaml
# This document is agreed upon and signed by engineering leadership, product team, and SRE team.

policy:
  version: '2.0'
  effective_date: '2026-01-15'
  review_cycle: 'quarterly'

  budget_thresholds:
    # Budget >= 50%: Normal operations
    green:
      remaining_budget: '>= 50%'
      actions:
        - 'Feature development to reliability work ratio: 8:2'
        - 'Maintain standard release process'
        - 'Monitor trends in weekly SLO review'

    # Budget 20-50%: Caution
    yellow:
      remaining_budget: '20% ~ 50%'
      actions:
        - 'Feature development to reliability work ratio: 5:5'
        - 'Additional load testing required before releases'
        - 'Add canary stage to all deployments (1% -> 10% -> 50% -> 100%)'
        - 'SLO review twice per week'

    # Budget < 20%: Danger
    red:
      remaining_budget: '< 20%'
      actions:
        - 'Freeze new feature releases'
        - 'All personnel focused on reliability improvement'
        - 'VP approval required for all changes'
        - 'Daily SLO review'
        - 'Write postmortems and track action items'

    # Budget exhausted: Emergency
    exhausted:
      remaining_budget: '<= 0%'
      actions:
        - 'Immediately halt all non-essential deployments'
        - 'Review rollback candidates from changes in the last 30 days'
        - 'Escalate to CTO/VP'
        - 'Daily status reports'
        - 'Maintain feature freeze until window reset'

  exceptions:
    - 'Security patches are deployed immediately regardless of budget status'
    - 'Legal compliance requirements are exempt'
    - 'Emergency fixes to prevent data loss are exempt'

  escalation:
    - level: 'L1 (On-call Engineer)'
      condition: 'Burn rate alert fires'
      response_time: '15 minutes'
    - level: 'L2 (Team Lead)'
      condition: 'Budget < 50%'
      response_time: '1 hour'
    - level: 'L3 (VP Engineering)'
      condition: 'Budget < 20% or exhausted'
      response_time: 'Same day'

Error Budget Remaining Calculation and Reporting

import datetime
import requests
from dataclasses import dataclass

@dataclass
class ErrorBudgetReport:
    service: str
    slo_target: float
    window_days: int
    current_sli: float
    budget_remaining_pct: float
    budget_remaining_minutes: float
    estimated_exhaustion_date: str
    policy_status: str  # green, yellow, red, exhausted

def calculate_error_budget_status(
    prometheus_url: str,
    service: str,
    slo_target: float = 0.999,
    window_days: int = 30,
) -> ErrorBudgetReport:
    """Query current SLI from Prometheus and calculate error budget status"""

    # Query current SLI (30-day window)
    query = f'''
        sum(rate(http_requests_total{{service="{service}",status!~"5.."}}[{window_days}d]))
        /
        sum(rate(http_requests_total{{service="{service}"}}[{window_days}d]))
    '''
    resp = requests.get(
        f"{prometheus_url}/api/v1/query",
        params={"query": query},
    )
    result = resp.json()["data"]["result"]
    current_sli = float(result[0]["value"][1]) if result else 0.0

    # Calculate error budget
    total_budget = 1.0 - slo_target  # e.g., 0.001
    consumed = max(0.0, (1.0 - current_sli) - 0)  # actual error rate
    remaining = max(0.0, total_budget - consumed)
    remaining_pct = (remaining / total_budget) * 100 if total_budget > 0 else 0

    # Convert to time
    total_minutes = window_days * 24 * 60
    remaining_minutes = total_minutes * (remaining / total_budget) if total_budget > 0 else 0

    # Calculate estimated exhaustion date
    if consumed > 0 and remaining > 0:
        burn_rate = consumed / total_budget
        days_to_exhaustion = window_days * (remaining / total_budget) / burn_rate
        exhaustion_date = (
            datetime.date.today() + datetime.timedelta(days=days_to_exhaustion)
        ).isoformat()
    elif remaining <= 0:
        exhaustion_date = "EXHAUSTED"
    else:
        exhaustion_date = "N/A (no errors)"

    # Determine policy status
    if remaining_pct >= 50:
        status = "green"
    elif remaining_pct >= 20:
        status = "yellow"
    elif remaining_pct > 0:
        status = "red"
    else:
        status = "exhausted"

    return ErrorBudgetReport(
        service=service,
        slo_target=slo_target,
        window_days=window_days,
        current_sli=round(current_sli, 6),
        budget_remaining_pct=round(remaining_pct, 2),
        budget_remaining_minutes=round(remaining_minutes, 1),
        estimated_exhaustion_date=exhaustion_date,
        policy_status=status,
    )

Automated Slack Weekly Report

import json
from slack_sdk import WebClient

def send_weekly_slo_report(
    slack_token: str,
    channel: str,
    services: list[str],
    prometheus_url: str,
):
    """Send weekly SLO report to a Slack channel"""
    client = WebClient(token=slack_token)
    reports = []

    for service in services:
        report = calculate_error_budget_status(prometheus_url, service)
        reports.append(report)

    # Status emoji mapping (for Slack)
    status_emoji = {
        "green": ":large_green_circle:",
        "yellow": ":large_yellow_circle:",
        "red": ":red_circle:",
        "exhausted": ":rotating_light:",
    }

    blocks = [
        {"type": "header", "text": {"type": "plain_text", "text": "Weekly SLO Report"}},
        {"type": "divider"},
    ]

    for r in sorted(reports, key=lambda x: x.budget_remaining_pct):
        emoji = status_emoji[r.policy_status]
        blocks.append({
            "type": "section",
            "text": {
                "type": "mrkdwn",
                "text": (
                    f"{emoji} *{r.service}*\n"
                    f"  SLO: {r.slo_target*100:.2f}% | "
                    f"Current SLI: {r.current_sli*100:.4f}%\n"
                    f"  Budget remaining: {r.budget_remaining_pct:.1f}% "
                    f"({r.budget_remaining_minutes:.0f} min)\n"
                    f"  Estimated exhaustion: {r.estimated_exhaustion_date}"
                ),
            },
        })

    client.chat_postMessage(channel=channel, blocks=blocks)

Step 5: Release Gate Integration -- How SLOs Block Deployments

Automatically block deployments in the CI/CD pipeline based on error budget status.

GitHub Actions Release Gate

# .github/workflows/release-gate.yaml
name: SLO Release Gate
on:
  workflow_call:
    inputs:
      service:
        required: true
        type: string

jobs:
  check-error-budget:
    runs-on: ubuntu-latest
    steps:
      - name: Query error budget status
        id: budget
        run: |
          RESULT=$(curl -s "${{ secrets.PROMETHEUS_URL }}/api/v1/query" \
            --data-urlencode "query=slo:error_budget_remaining_pct{service=\"${{ inputs.service }}\"}" \
            | jq -r '.data.result[0].value[1]')
          echo "remaining_pct=$RESULT" >> $GITHUB_OUTPUT

      - name: Evaluate release gate
        run: |
          BUDGET="${{ steps.budget.outputs.remaining_pct }}"
          echo "Error budget remaining: ${BUDGET}%"

          if (( $(echo "$BUDGET < 20" | bc -l) )); then
            echo "::error::ERROR BUDGET CRITICAL (${BUDGET}%). Release blocked."
            echo "Error budget is below 20%. Complete reliability work first."
            exit 1
          elif (( $(echo "$BUDGET < 50" | bc -l) )); then
            echo "::warning::Error budget at ${BUDGET}%. Canary deployment required."
            echo "canary_required=true" >> $GITHUB_OUTPUT
          else
            echo "Error budget healthy at ${BUDGET}%. Proceeding."
          fi

      - name: Notify Slack on block
        if: failure()
        run: |
          curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H 'Content-Type: application/json' \
            -d "{\"text\":\"Release blocked for ${{ inputs.service }}: error budget < 20%\"}"

Step 6: Include SLO Impact in Postmortems

Incident postmortems must include "how much did this impact the SLO." This is the clearest way to quantify the business impact of an incident.

Postmortem SLO Impact Section Template

## SLO Impact Analysis

### Affected SLO

- Service: payment-api
- SLO Target: 99.95% availability (30-day window)
- Pre-incident SLI: 99.97%
- Post-incident SLI: 99.93%

### Error Budget Consumption

- Incident duration: 23 minutes
- Error requests during incident: 12,847
- Budget consumed relative to total window: 28.5%
- Pre-incident budget remaining: 71.2%
- Post-incident budget remaining: 42.7%

### Policy Status Change

- Before incident: GREEN (71.2%)
- After incident: YELLOW (42.7%)
- Action: Add canary stage to releases for the next 2 weeks

### Business Impact

- Failed payment attempts: approximately 3,200
- Estimated revenue loss: approximately $48,000
- Increase in customer inquiries: 127

Troubleshooting

1. SLI Value Stuck at 100%

Cause: Missing metric collection. Error responses are not being recorded as separate metrics.

# How to check
curl -s http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=http_requests_total{status=~"5.."}' | jq '.data.result | length'
# If 0, 5xx metrics are not being collected

# Fix: Check the status code label in the application's metrics middleware

2. Alerts Firing Too Frequently (10+ times per day)

Diagnostic sequence:

Check if the burn rate threshold is too low (setting alerts at burn rate 1 will fire even in normal conditions)
Check if the short window is too long (30m is appropriate; 5m generates too much noise)
Check if the for duration is too short (minimum 1m for critical, 5m for warning)
Review whether the SLO target itself is unrealistic compared to current service levels

3. Error Budget Resets on New Window but Same Problem Repeats

Root cause: Postmortem action items were not completed before the window reset.

Fix: Add a condition to the error budget policy: "If the previous window was exhausted, action item completion rate must be 80% or higher to return to GREEN in the next window."

4. SLI Definition Mismatch Between Teams

Case: The backend team uses "server response time" while the frontend team uses "user-perceived loading time" as their SLI. Same SLO of 99.9% but measuring different things.

Fix: Manage SLI definition documents in an organization-wide shared wiki and conduct joint reviews with product/platform/SRE teams quarterly.

Quiz

Q1. What is the 30-day error budget for SLO 99.9% in minutes?

Answer: 43.2 minutes. 30 days x 24 hours x 60 minutes = 43,200 minutes. 43,200 x 0.001 = 43.2 minutes.

Q2. What situation does a burn rate of 14 indicate?

Answer: It means errors are occurring at a rate that would exhaust the 30-day error budget in approximately 2.1 days (30/14). This is an acute incident situation requiring immediate response.

Q3. What is the role of the short window in Multi-Window alerting?

Answer: It confirms whether the anomaly detected in the long window is still ongoing. It prevents false positives from past incidents that have already ended. The Google SRE Workbook recommends setting the short window to 1/12 of the long window.

Q4. Should security patches also be halted when the error budget is exhausted?

Answer: No. Security patches, legal compliance requirements, and emergency fixes to prevent data loss should be deployed regardless of error budget status. These exceptions must be explicitly stated in the error budget policy.

Q5. Why should you never set an SLO to 100%?

Answer: An SLO of 100% means an error budget of 0, so no changes can be deployed as long as there is any possibility of errors. This is effectively declaring "we will never deploy new features," completely blocking innovation. Additionally, 100% is realistically unachievable due to inevitable infrastructure failures (hardware failures, network issues).

Q6. Why should you not use "average response time" as an SLI?

Answer: Averages hide tail latency. A service with P50 of 100ms and P99 of 10 seconds may show an average of about 200ms, but 1 in 100 users waits 10 seconds. For SLIs, you should use "ratio of responses within a threshold" (e.g., ratio of responses within 300ms).