Split View: SRE 실전 가이드 2025: 인시던트 관리, 포스트모템, Error Budget, On-Call, Toil 제거

SRE 실전 가이드 2025: 인시던트 관리, 포스트모템, Error Budget, On-Call, Toil 제거

도입: SRE란 무엇인가

Site Reliability Engineering(SRE)은 Google이 만든 소프트웨어 엔지니어링 접근법으로, "소프트웨어 엔지니어에게 운영 문제를 맡기면 어떻게 될까?"라는 질문에서 시작되었습니다.

Ben Treynor Sloss(Google VP of Engineering)의 유명한 정의가 있습니다.

"SRE is what happens when you ask a software engineer to design an operations function."

SRE는 단순한 직무가 아닌 문화이자 철학입니다. 핵심 원칙은 다음과 같습니다.

SRE 핵심 원칙:
1. 운영 작업에 소프트웨어 엔지니어링을 적용
2. SLO로 안정성 목표를 정량화
3. Error Budget으로 혁신과 안정성의 균형 유지
4. Toil을 줄이고 자동화에 투자
5. 모니터링은 증상 기반으로, 원인 기반이 아닌
6. 단순성을 추구
7. 비난 없는 포스트모템으로 지속적 학습

SRE vs DevOps

DevOps와 SRE의 관계:

DevOps = 문화, 철학, 가치관
  - 개발과 운영의 협업
  - 지속적 통합/배포
  - 인프라 as Code
  - 피드백 루프

SRE = DevOps의 구체적 구현
  - "class SRE implements DevOps"
  - 측정 가능한 목표 (SLO/SLI)
  - Error Budget이라는 의사결정 프레임워크
  - 엔지니어링 기반 운영 접근

둘은 대립이 아닌 보완 관계:
DevOps가 "무엇을"이라면, SRE는 "어떻게"

1. SLO, SLI, SLA

1.1 개념 정리

SLA (Service Level Agreement)
= 서비스 제공자와 고객 간의 계약
= 위반 시 재정적/법적 결과가 따름
예: "월간 가용성 99.9% 보장, 미달 시 서비스 크레딧 10% 제공"

SLO (Service Level Objective)
= 내부 목표
= SLA보다 더 엄격하게 설정 (여유분 확보)
예: "월간 가용성 99.95% 목표" (SLA 99.9%보다 엄격)

SLI (Service Level Indicator)
= 실제 측정값
= SLO를 평가하기 위한 메트릭
예: "지난 30일간 실제 가용성 99.97%"

1.2 좋은 SLI 선택하기

서비스 유형별 SLI 예시:

API 서비스:
- 가용성: 성공 요청 수 / 전체 요청 수
- 레이턴시: p99 응답 시간 < 200ms인 요청 비율
- 처리량: 초당 처리 가능한 요청 수

데이터 파이프라인:
- 신선도: 데이터가 N분 이내에 처리된 비율
- 완전성: 처리된 레코드 수 / 기대 레코드 수
- 정확성: 올바르게 처리된 레코드 비율

스토리지 시스템:
- 내구성: 데이터 손실 없이 보존된 비율
- 가용성: 성공한 읽기/쓰기 요청 비율
- 레이턴시: p50 읽기 레이턴시

1.3 SLO 설정 가이드

SLO 설정 프로세스:

1단계: 사용자 관점에서 시작
  "사용자에게 가장 중요한 것은 무엇인가?"
  → 페이지 로드 시간, 결제 성공률, 알림 지연

2단계: SLI 정의
  가용성 SLI = 성공 HTTP 응답(2xx, 3xx) / 전체 HTTP 응답
  레이턴시 SLI = 200ms 이내 응답 비율

3단계: 초기 SLO 설정
  - 현재 성능 데이터 분석 (최근 30일)
  - 현재 p50 수준을 초기 SLO로 설정
  - 너무 높게 설정하지 않을 것!

4단계: Error Budget 계산
  SLO 99.9% → Error Budget = 0.1%
  월 기준: 30일 x 24시간 x 60분 x 0.001 = 43.2분

5단계: 반복 개선
  - 4주마다 SLO 리뷰
  - 사용자 피드백 반영
  - 필요 시 SLO 조정

1.4 SLO 대시보드

# Prometheus + Grafana SLO 대시보드
# 가용성 SLO 쿼리
availability_slo:
  target: 99.9
  window: 30d
  query: |
    sum(rate(http_requests_total{status=~"2.."}[30d]))
    /
    sum(rate(http_requests_total[30d]))
    * 100

# 레이턴시 SLO 쿼리
latency_slo:
  target: 99.0
  threshold: 200ms
  window: 30d
  query: |
    sum(rate(http_request_duration_seconds_bucket{le="0.2"}[30d]))
    /
    sum(rate(http_request_duration_seconds_count[30d]))
    * 100

# Error Budget 소비율
error_budget_consumption:
  query: |
    1 - (
      sum(rate(http_requests_total{status=~"2.."}[30d]))
      /
      sum(rate(http_requests_total[30d]))
    )
    /
    (1 - 0.999)
    * 100

2. Error Budget 정책

2.1 Error Budget이란

Error Budget은 SLO에서 파생되는 "허용 가능한 비신뢰성"입니다.

Error Budget 계산:

SLO 99.9% (Three Nines):
  연간: 365일 x 24시간 x 60분 x 0.001 = 525.6분 (8시간 45분)
  월간: 30일 x 24시간 x 60분 x 0.001 = 43.2분
  주간: 7일 x 24시간 x 60분 x 0.001 = 10.08분

SLO 99.95% (Three and a Half Nines):
  연간: 262.8분 (4시간 23분)
  월간: 21.6분
  주간: 5.04분

SLO 99.99% (Four Nines):
  연간: 52.56분
  월간: 4.32분
  주간: 1.01분

2.2 Error Budget 정책 문서

Error Budget 정책 v2.0
최종 수정: 2025-04-14
승인: CTO, VP Engineering, SRE Director

1. 목적
   Error Budget은 안정성과 혁신 사이의 균형을 유지하는
   정량적 프레임워크입니다.

2. 예산 상태별 대응

   녹색 (예산 50% 이상 잔여):
   - 정상 기능 개발 및 배포
   - 카오스 실험 허용
   - 공격적인 릴리스 일정 가능

   황색 (예산 20~50% 잔여):
   - 배포 속도 감소 (주 2회로 제한)
   - 안정성 개선 작업 우선순위 상향
   - 카오스 실험은 스테이징에서만

   적색 (예산 20% 미만):
   - 안정성 관련 변경만 배포
   - 기능 배포 동결 (Feature Freeze)
   - SRE 팀이 배포 승인 게이트키퍼
   - 근본 원인 분석 및 해결에 집중

   소진 (예산 0%):
   - 모든 비필수 배포 완전 중단
   - 인시던트 레벨 대응
   - SRE/개발팀 합동 안정성 스프린트
   - 경영진 보고 및 복구 계획 수립

3. 예외 사항
   - 보안 패치: 예산 상태에 관계없이 배포
   - 법적 요구사항: 예산 상태에 관계없이 배포
   - 데이터 무결성 문제: 즉시 대응

4. 리뷰 주기
   - 주간: SLO 대시보드 리뷰
   - 월간: Error Budget 소비 패턴 분석
   - 분기: SLO 목표 재검토

2.3 Error Budget과 의사결정

시나리오: 월간 Error Budget 43.2분 중 35분 소비 (잔여 8.2분)

Q: 새 기능 배포를 진행해야 하는가?

분석:
- 잔여 예산: 8.2분 (19%)
- 상태: 적색 (20% 미만)
- 배포 리스크: 과거 유사 배포에서 평균 2분 장애 발생

결정:
→ 기능 배포 보류
→ 남은 예산으로 안정성 개선 작업 수행
→ 다음 달 예산이 리셋되면 배포 진행

커뮤니케이션:
"이번 달 Error Budget의 81%를 소비했습니다.
 정책에 따라 안정성 관련 변경만 배포합니다.
 기능 X의 배포는 다음 달로 연기합니다."

3. 인시던트 관리 라이프사이클

3.1 인시던트 심각도 레벨

심각도 정의:

SEV1 (Critical):
  정의: 핵심 서비스 전면 장애, 다수 사용자 영향
  예시: 전체 결제 시스템 다운, 데이터 유출
  대응: 즉시 (5분 이내)
  에스컬레이션: VP + 전체 On-Call 팀
  커뮤니케이션: 15분마다 상태 업데이트
  포스트모템: 필수 (48시간 이내)

SEV2 (High):
  정의: 주요 기능 저하, 상당수 사용자 영향
  예시: 검색 기능 50% 장애, API 레이턴시 10배 증가
  대응: 15분 이내
  에스컬레이션: 팀 리드 + On-Call
  커뮤니케이션: 30분마다 상태 업데이트
  포스트모템: 필수 (1주 이내)

SEV3 (Medium):
  정의: 부분적 기능 저하, 소수 사용자 영향
  예시: 특정 리전 사용자의 이미지 로딩 느림
  대응: 1시간 이내
  에스컬레이션: On-Call 엔지니어
  커뮤니케이션: 필요 시
  포스트모템: 선택 (학습 가치가 있을 때)

SEV4 (Low):
  정의: 경미한 이슈, 사용자 영향 최소
  예시: 내부 대시보드 로딩 지연
  대응: 다음 업무일
  에스컬레이션: 해당 팀
  커뮤니케이션: 불필요
  포스트모템: 불필요

3.2 인시던트 대응 프로세스

인시던트 라이프사이클:

1. 탐지 (Detection)
   ┌─ 자동 알림 (모니터링 시스템)
   ├─ 사용자 신고 (고객 지원팀)
   └─ 내부 발견 (엔지니어 직접 발견)
   │
   ▼
2. 분류 (Triage)
   ┌─ 심각도 판단 (SEV1~4)
   ├─ 인시던트 커맨더 지정
   └─ 대응팀 소집
   │
   ▼
3. 완화 (Mitigation)
   ┌─ 즉시 가능한 완화 조치
   ├─ 롤백, 스케일업, 트래픽 전환
   └─ 서비스 복구 확인
   │
   ▼
4. 해결 (Resolution)
   ┌─ 근본 원인 파악
   ├─ 영구적 수정 배포
   └─ 모니터링으로 안정성 확인
   │
   ▼
5. 포스트모템 (Postmortem)
   ┌─ 타임라인 작성
   ├─ 근본 원인 분석
   ├─ 액션 아이템 도출
   └─ 전사 공유

3.3 인시던트 커맨더 역할

인시던트 커맨더(IC)는 인시던트 대응의 핵심입니다.

인시던트 커맨더 책임:

1. 조율 (Coordination)
   - 대응팀 역할 배분
   - 작업 우선순위 결정
   - 중복 작업 방지
   - 필요한 리소스 확보

2. 커뮤니케이션 (Communication)
   - 내부 상태 업데이트 (Slack 채널)
   - 외부 상태 업데이트 (상태 페이지)
   - 경영진 보고
   - 고객 지원팀 브리핑

3. 의사결정 (Decision Making)
   - 롤백 여부 결정
   - 에스컬레이션 판단
   - 리소스 추가 요청
   - 인시던트 종료 선언

4. 기록 (Documentation)
   - 주요 이벤트 타임스탬프 기록
   - 의사결정 사유 기록
   - 포스트모템을 위한 정보 수집

인시던트 커맨더 커뮤니케이션 템플릿:

"인시던트 상태 업데이트 - [시간]
 심각도: SEV[N]
 상태: [조사 중 / 완화 중 / 모니터링 중]
 영향: [영향 범위 설명]
 현재 조치: [진행 중인 조치]
 다음 단계: [계획된 다음 조치]
 다음 업데이트: [시간]"

3.4 인시던트 대응 도구

인시던트 대응 워크플로우 도구:

알림:
- PagerDuty: On-Call 관리 및 알림
- OpsGenie: 알림 라우팅 및 에스컬레이션
- Grafana Alerting: 메트릭 기반 알림

커뮤니케이션:
- Slack (전용 인시던트 채널 자동 생성)
- Zoom/Google Meet (War Room)
- StatusPage (외부 상태 페이지)

인시던트 관리:
- Incident.io: Slack 통합 인시던트 관리
- Rootly: 자동화된 인시던트 워크플로우
- Blameless: 인시던트 + 포스트모템 플랫폼
- FireHydrant: 인시던트 대응 자동화

문서화:
- Confluence/Notion: 포스트모템 저장
- Google Docs: 실시간 협업 문서
- Jira: 액션 아이템 추적

4. 인시던트 커뮤니케이션

4.1 내부 커뮤니케이션

Slack 인시던트 채널 구조:

#incident-2025-0414-payment
  - 인시던트 대응의 메인 채널
  - IC, 대응팀, 관찰자 참여
  - 봇이 자동으로 타임라인 기록

#incident-2025-0414-payment-comms
  - 외부 커뮤니케이션 조율
  - 고객 지원팀, PR팀 참여
  - 고객 대면 메시지 초안 작성/승인

규칙:
1. 메인 채널에서는 대응 관련 내용만
2. 잡담이나 추측은 별도 스레드에서
3. 모든 주요 변경은 채널에 공유
4. IC가 아닌 사람은 IC에게 보고

4.2 외부 커뮤니케이션

상태 페이지 업데이트 템플릿:

조사 중:
"현재 [서비스명]에서 [증상]이 발생하고 있음을 확인했습니다.
 엔지니어링 팀이 원인을 조사하고 있습니다.
 추가 정보가 확인되는 대로 업데이트하겠습니다."

완화 중:
"[서비스명]의 문제 원인을 파악했으며,
 복구 작업을 진행하고 있습니다.
 일부 사용자는 여전히 영향을 받을 수 있습니다."

모니터링 중:
"[서비스명]의 복구 조치를 적용했으며,
 현재 서비스가 정상화되고 있습니다.
 지속적으로 모니터링하고 있습니다."

해결:
"[서비스명]의 문제가 해결되었습니다.
 [시작 시간]부터 [종료 시간]까지 약 [N]분간
 서비스에 영향이 있었습니다.
 자세한 분석 결과는 포스트모템으로 공유하겠습니다."

5. Blameless 포스트모템

5.1 왜 Blameless인가

비난 문화(Blame Culture)는 다음과 같은 문제를 야기합니다.

비난 문화의 문제점:

1. 정보 은폐
   "내가 한 실수를 보고하면 처벌받을 것이다"
   → 진짜 원인이 숨겨짐

2. 방어적 행동
   "나의 잘못이 아님을 증명해야 한다"
   → 시스템 개선보다 책임 회피에 집중

3. 혁신 저해
   "실수하면 안 되니까 변경을 최소화하자"
   → 필요한 개선조차 하지 않음

4. 신뢰 파괴
   "동료가 나를 비난할 수 있다"
   → 팀 협업 저해

Blameless 문화:
"사람이 아닌 시스템을 고치자"
- 모든 사람은 자신이 옳다고 생각하는 행동을 했다
- 장애는 시스템의 취약점을 드러낸 것이다
- 실수를 공유하는 것이 안전하다
- 근본 원인은 항상 시스템/프로세스에 있다

5.2 포스트모템 템플릿

포스트모템: [인시던트 제목]
날짜: YYYY-MM-DD
작성자: [이름]
리뷰어: [이름들]

1. 요약
   [1-2문장으로 인시던트 요약]

2. 영향
   - 기간: [시작] ~ [종료] ([N]분)
   - 영향 받은 사용자: [수 또는 비율]
   - 영향 받은 서비스: [서비스 목록]
   - 재정적 영향: [있다면]

3. 타임라인 (모든 시간 KST)
   14:23 - 모니터링 알림 발생 (payment-service 에러율 증가)
   14:25 - On-Call 엔지니어 확인
   14:28 - SEV2 인시던트 선언, IC 지정
   14:30 - 인시던트 채널 생성
   14:35 - 원인 조사: 데이터베이스 연결 풀 고갈 확인
   14:40 - 완화 조치: 연결 풀 크기 증가
   14:45 - 서비스 정상화 확인
   14:50 - 모니터링 상태로 전환
   15:15 - 인시던트 종료 선언

4. 근본 원인
   [상세한 기술적 설명]

5. 5 Whys 분석
   Why 1: 왜 결제가 실패했는가?
   → 데이터베이스 연결을 얻지 못했기 때문
   
   Why 2: 왜 연결을 얻지 못했는가?
   → 연결 풀이 모두 소진되었기 때문
   
   Why 3: 왜 연결 풀이 소진되었는가?
   → 느린 쿼리가 연결을 오래 점유했기 때문
   
   Why 4: 왜 느린 쿼리가 발생했는가?
   → 인덱스가 없는 테이블에 대한 풀 스캔
   
   Why 5: 왜 인덱스가 없었는가?
   → 새 테이블 추가 시 인덱스 리뷰 프로세스가 없었음

6. 잘한 점
   - 알림이 2분 이내에 발생
   - IC가 신속하게 팀을 조율
   - 완화 조치가 효과적

7. 개선할 점
   - 데이터베이스 연결 풀 모니터링 부재
   - 느린 쿼리 감지 알림 없음
   - 새 테이블 인덱스 리뷰 프로세스 필요

8. 액션 아이템
   [HIGH] 데이터베이스 연결 풀 사용률 알림 추가
   담당: DB팀 | 기한: 2025-04-21

   [HIGH] 새 테이블/쿼리 인덱스 리뷰 체크리스트 도입
   담당: 백엔드팀 | 기한: 2025-04-28

   [MED] 느린 쿼리 자동 감지 및 알림 시스템 구축
   담당: SRE팀 | 기한: 2025-05-15

   [LOW] 연결 풀 크기 자동 조정 메커니즘 검토
   담당: 인프라팀 | 기한: 2025-06-01

9. 교훈
   [이 인시던트에서 배운 핵심 교훈]

5.3 포스트모템 리뷰 프로세스

포스트모템 리뷰 체크리스트:

작성 품질:
[ ] 타임라인이 정확하고 완전한가?
[ ] 근본 원인이 깊이 분석되었는가?
[ ] 5 Whys가 시스템/프로세스에서 끝나는가? (사람에서 끝나면 안 됨)
[ ] 비난 없는 언어로 작성되었는가?

액션 아이템:
[ ] 모든 액션 아이템에 담당자가 지정되어 있는가?
[ ] 기한이 현실적인가?
[ ] 우선순위가 적절한가?
[ ] 재발 방지에 효과적인 조치인가?

공유:
[ ] 관련 팀에 공유되었는가?
[ ] 전사 포스트모템 리뷰 미팅에서 발표 예정인가?
[ ] 유사한 서비스에 동일한 취약점이 없는지 확인했는가?

5.4 포스트모템 문화 구축

포스트모템 문화 실천 방법:

1. 주간 포스트모템 리뷰 미팅
   - 매주 금요일 30분
   - 해당 주의 인시던트 포스트모템 공유
   - 모든 팀에서 참여 가능

2. 포스트모템 읽기 클럽
   - 다른 회사의 공개 포스트모템 분석
   - 우리 시스템에 적용할 수 있는 교훈 도출
   - 월 1회

3. 비난 없는 언어 가이드
   나쁜 예: "홍길동이 잘못된 쿼리를 배포해서 장애가 발생했다"
   좋은 예: "인덱스가 없는 쿼리가 프로덕션에 배포되었다.
            코드 리뷰 과정에서 쿼리 성능 검토 단계가 없었다."

4. 액션 아이템 추적
   - Jira 보드로 액션 아이템 관리
   - 월간 완료율 추적
   - 완료되지 않은 항목에 대한 리뷰

6. On-Call 운영

6.1 On-Call 로테이션

On-Call 로테이션 설계:

기본 원칙:
1. 최소 2명이 항상 On-Call (Primary + Secondary)
2. 1주 교대 (월요일 09:00 시작)
3. 연속 On-Call 금지 (최소 2주 간격)
4. 공휴일은 자원 봉사 또는 추가 보상

로테이션 예시 (6명 팀):
주 1: Alice (P) + Bob (S)
주 2: Charlie (P) + Diana (S)
주 3: Eve (P) + Frank (S)
주 4: Bob (P) + Alice (S)
주 5: Diana (P) + Charlie (S)
주 6: Frank (P) + Eve (S)

스왑 규칙:
- 최소 48시간 전 통보
- 본인이 교체자를 찾을 책임
- 팀 리드에게 스왑 사실 통보
- On-Call 캘린더 업데이트

6.2 에스컬레이션 체인

에스컬레이션 정책:

Level 1: On-Call Primary (즉시)
  → 5분 내 응답 없으면

Level 2: On-Call Secondary (5분)
  → 15분 내 해결 안 되면

Level 3: 팀 리드 (15분)
  → SEV1이거나 30분 내 해결 안 되면

Level 4: 엔지니어링 매니저 (30분)
  → SEV1이 1시간 이상 지속되면

Level 5: VP/CTO (1시간)

자동 에스컬레이션 설정 (PagerDuty):

escalation_policy:
  name: "payment-service"
  rules:
    - targets:
        - type: "user_reference"
          id: "PRIMARY_ON_CALL"
      escalation_delay_in_minutes: 5
    - targets:
        - type: "user_reference"
          id: "SECONDARY_ON_CALL"
      escalation_delay_in_minutes: 15
    - targets:
        - type: "user_reference"
          id: "TEAM_LEAD"
      escalation_delay_in_minutes: 30

6.3 On-Call 피로도 관리

On-Call 피로도 관리 전략:

1. 알림 품질 개선
   문제: 너무 많은 알림 → 알림 피로 → 진짜 문제 놓침
   해결:
   - 주간 알림 리뷰: 불필요한 알림 비활성화
   - 알림 중복 제거
   - 알림에 컨텍스트 추가 (무엇이 문제인지, 어떻게 대응할지)
   - 목표: On-Call 1주당 알림 수 < 20회

2. 런북 작성
   모든 반복되는 알림에 대한 런북 작성
   런북 구조:
   - 증상 설명
   - 영향 범위 확인 방법
   - 단계별 대응 절차
   - 에스컬레이션 기준
   - 관련 대시보드 링크

3. 보상 체계
   - On-Call 수당 (주당 고정 금액)
   - 야간/주말 호출 시 추가 보상
   - 대체 휴무 제공
   - 장기 On-Call 시 특별 보상

4. 정신 건강
   - On-Call 후 디브리핑 세션
   - 힘든 인시던트 후 Mental Health Day
   - On-Call이 아닌 날은 완전히 분리
   - 팀 내 On-Call 경험 공유 문화

6.4 런북 예시

런북: payment-service 높은 에러율

트리거: payment-service HTTP 5xx 비율 > 1%

1. 상황 파악
   - Grafana 대시보드 확인:
     [대시보드 URL]
   - 에러 로그 확인:
     kubectl logs -l app=payment-service --tail=100 -n production
   - 최근 배포 확인:
     kubectl rollout history deployment/payment-service -n production

2. 일반적인 원인

   원인 A: 데이터베이스 연결 문제
   확인: 연결 풀 메트릭 확인
   조치: 연결 풀 리셋 또는 확장
   명령어:
     kubectl rollout restart deployment/payment-service -n production

   원인 B: 외부 API 장애
   확인: 외부 API 상태 페이지 확인
   조치: Circuit breaker 강제 오픈
   명령어:
     curl -X POST http://payment-service/admin/circuit-breaker/open

   원인 C: 최근 배포 문제
   확인: 배포 타임라인과 에러 시작 시간 비교
   조치: 이전 버전으로 롤백
   명령어:
     kubectl rollout undo deployment/payment-service -n production

3. 에스컬레이션
   - 15분 내 해결 안 되면: 팀 리드에게 에스컬레이션
   - SEV1으로 판단되면: 인시던트 선언

7. Toil 제거

7.1 Toil이란

Google SRE 책에서 정의하는 Toil의 특성입니다.

Toil의 특성:

1. 수동적 (Manual)
   사람이 직접 수행해야 하는 작업
   예: 수동으로 서버 재시작

2. 반복적 (Repetitive)
   같은 작업을 반복적으로 수행
   예: 매주 인증서 갱신

3. 자동화 가능 (Automatable)
   기계가 대신할 수 있는 작업
   예: 디스크 정리 스크립트 실행

4. 전술적 (Tactical)
   장기적 가치 없는 즉각적 대응
   예: 임시 수정(workaround) 적용

5. 서비스 성장에 비례 (Scales with Service)
   서비스가 커질수록 작업량이 증가
   예: 수동 사용자 프로비저닝

6. 지속적 가치 없음 (No Enduring Value)
   작업 후 서비스가 개선되지 않음
   예: 수동 배포 승인

7.2 Toil 측정

Toil 측정 방법:

1. 시간 추적
   - 2주간 모든 업무 시간 기록
   - Toil vs 엔지니어링 시간 분류
   - 목표: Toil < 50%

2. Toil 카테고리:
   카테고리 A: 인시던트 대응 (응급)
   카테고리 B: 정기 운영 작업 (예정된)
   카테고리 C: 수동 프로비저닝 (요청 기반)
   카테고리 D: 수동 모니터링/검증 (확인)

3. 측정 템플릿:

   팀: SRE 팀
   기간: 2025-04-01 ~ 2025-04-14

   | 작업 | 빈도 | 소요 시간 | Toil? | 자동화 가능? |
   |------|------|-----------|-------|-------------|
   | 인증서 갱신 | 주 1회 | 30분 | Yes | Yes |
   | 디스크 정리 | 일 1회 | 15분 | Yes | Yes |
   | 배포 승인 | 일 3회 | 10분/회 | Yes | Yes |
   | 용량 리뷰 | 주 1회 | 2시간 | Partial | Partial |
   | 인시던트 대응 | 주 2회 | 1시간/회 | No | No |

   총 시간: 80시간 (2주)
   Toil 시간: 45시간 (56%)
   목표: 40시간 이하 (50%)

7.3 Toil 자동화 우선순위

자동화 우선순위 매트릭스:

            높은 빈도    낮은 빈도
높은 영향  | P1: 즉시    | P2: 계획   |
           | 자동화      | 자동화     |
           |-------------|------------|
낮은 영향  | P3: 백로그  | P4: 무시   |
           | 에 추가     | 가능       |

P1 자동화 예시:
- 인증서 자동 갱신 (cert-manager)
- 디스크 자동 정리 (cron job)
- 배포 자동 승인 (CI/CD 파이프라인)
- 자동 스케일링 (HPA/VPA)

P2 자동화 예시:
- 용량 계획 자동화 (예측 기반)
- 인시던트 대응 자동화 (자동 복구)
- 보안 패치 자동 적용

7.4 자동화 사례

# cert-manager를 이용한 인증서 자동 갱신
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: my-service-tls
  namespace: production
spec:
  secretName: my-service-tls-secret
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - api.example.com
    - www.example.com
  renewBefore: 720h  # 만료 30일 전 자동 갱신

# 디스크 정리 자동화 CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
  name: disk-cleanup
  namespace: production
spec:
  schedule: "0 3 * * *"  # 매일 새벽 3시
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: cleanup
              image: busybox:1.36
              command:
                - /bin/sh
                - -c
                - |
                  find /data/logs -mtime +7 -delete
                  find /data/tmp -mtime +1 -delete
          restartPolicy: OnFailure

# HPA 자동 스케일링
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

8. 용량 계획

8.1 수요 예측

용량 계획 프레임워크:

1. 현재 용량 파악
   - CPU: 전체 클러스터의 60% 사용 중
   - 메모리: 전체 클러스터의 55% 사용 중
   - 스토리지: 월 100GB 증가
   - 네트워크: 피크 타임 대역폭의 40% 사용

2. 성장률 분석
   - 지난 12개월 트래픽 증가율: 월 15%
   - 계절성 패턴: 블랙프라이데이 3배, 설날 2배
   - 계획된 이벤트: 신규 기능 출시 예상 트래픽 50% 증가

3. 여유분 확보
   - 일반적인 여유분: 30~50%
   - N+1 원칙: 노드 1대 장애 시에도 서비스 유지
   - 피크 대비: 정상 트래픽의 3배 처리 가능

4. 프로비저닝 리드타임
   - 클라우드: 분 단위 (오토스케일링)
   - 온프레미스: 주~월 단위 (하드웨어 구매)
   - 혼합: 기본 온프레미스 + 버스트 클라우드

8.2 부하 테스트

# k6 부하 테스트 스크립트
# k6-load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '5m', target: 100 },   # 5분간 100 VU로 증가
    { duration: '10m', target: 100 },   # 10분간 100 VU 유지
    { duration: '5m', target: 500 },    # 5분간 500 VU로 증가
    { duration: '10m', target: 500 },   # 10분간 500 VU 유지
    { duration: '5m', target: 1000 },   # 5분간 1000 VU로 증가
    { duration: '10m', target: 1000 },  # 10분간 1000 VU 유지
    { duration: '5m', target: 0 },      # 5분간 종료
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],
    http_req_failed: ['rate<0.01'],
  },
};

9. 릴리스 엔지니어링

9.1 카나리 배포

# Argo Rollouts 카나리 배포
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
  namespace: production
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: payment-service-canary
      stableService: payment-service-stable
      trafficRouting:
        istio:
          virtualService:
            name: payment-service-vs
      steps:
        - setWeight: 5
        - pause:
            duration: 10m
        - setWeight: 20
        - pause:
            duration: 10m
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 50
        - pause:
            duration: 15m
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 2
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 60s
      successCondition: result[0] >= 0.999
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{
              service="payment-service",
              status=~"2.."
            }[5m]))
            /
            sum(rate(http_requests_total{
              service="payment-service"
            }[5m]))

9.2 Feature Flag

# Feature Flag를 활용한 안전한 릴리스
class FeatureFlagManager:
    def __init__(self):
        self.flags = {}

    def is_enabled(self, flag_name, user_id=None, percentage=None):
        """
        Feature Flag 활성화 여부 확인
        - flag_name: 기능 이름
        - user_id: 특정 사용자 대상
        - percentage: 점진적 롤아웃 비율
        """
        flag = self.flags.get(flag_name)
        if not flag:
            return False

        # 전체 비활성화
        if not flag.get('enabled'):
            return False

        # 특정 사용자 화이트리스트
        if user_id and user_id in flag.get('whitelist', []):
            return True

        # 점진적 롤아웃
        if percentage is not None:
            return (hash(f"{flag_name}:{user_id}") % 100) < percentage

        return flag.get('enabled', False)

Feature Flag 배포 전략:

1단계: 내부 직원만 (dogfooding)
   percentage: 0, whitelist: [내부 직원 ID 목록]

2단계: 1% 롤아웃
   percentage: 1

3단계: 10% 롤아웃
   percentage: 10

4단계: 50% 롤아웃
   percentage: 50

5단계: 전체 롤아웃
   percentage: 100

각 단계마다:
- 에러율 모니터링
- 사용자 피드백 수집
- 비즈니스 메트릭 확인
- 문제 발견 시 즉시 비활성화

10. Google SRE 책 핵심 교훈

Google SRE 책에서 가장 중요한 10가지 교훈을 정리합니다.

교훈 1: 100% 가용성은 잘못된 목표
  → 완벽한 시스템은 존재하지 않는다
  → 적절한 수준의 안정성을 SLO로 정의하라

교훈 2: Error Budget이 혁신을 가능하게 한다
  → 예산이 있으면 위험을 감수할 수 있다
  → 예산이 없으면 안정성에 집중해야 한다

교훈 3: Toil이 50%를 넘으면 위험 신호
  → SRE의 시간 중 50% 이상이 Toil이면
  → 자동화에 투자하거나 팀을 확대해야 한다

교훈 4: 모니터링은 증상 기반이어야 한다
  → "CPU가 90%이다" (원인) vs "응답 시간이 느리다" (증상)
  → 사용자가 경험하는 증상을 모니터링하라

교훈 5: 포스트모템은 비난 없이 수행하라
  → 사람이 아닌 시스템을 고치자
  → 실수를 공유하는 것이 안전한 문화를 만들어라

교훈 6: 단순성이 안정성이다
  → 복잡한 시스템은 예측 불가능하게 실패한다
  → 가능하면 단순하게 유지하라

교훈 7: 릴리스 엔지니어링은 안정성의 핵심
  → 카나리 배포, Feature Flag, 자동 롤백
  → 배포가 안전해야 자주 배포할 수 있다

교훈 8: 용량 계획은 사후가 아닌 사전에
  → 트래픽이 증가하기 전에 준비하라
  → 부하 테스트로 한계를 파악하라

교훈 9: On-Call은 지속 가능해야 한다
  → 알림 피로를 줄여라
  → 적절한 보상과 휴식을 제공하라

교훈 10: SRE는 조직 전체의 책임
  → SRE 팀만의 일이 아니다
  → 개발팀도 안정성에 대한 책임을 져야 한다

11. SRE 도구 생태계

11.1 인시던트 관리 도구

도구 비교:

PagerDuty:
  - 시장 리더
  - 강력한 On-Call 관리
  - 800+ 통합
  - AI 기반 인시던트 분류
  - 가격: 사용자당 월 $21~

OpsGenie (Atlassian):
  - Jira/Confluence 네이티브 통합
  - 유연한 알림 라우팅
  - 팀 협업 기능
  - 가격: 사용자당 월 $9~

Incident.io:
  - Slack 네이티브
  - 자동화된 워크플로우
  - 포스트모템 생성 자동화
  - 가격: 사용자당 월 $16~

Rootly:
  - Slack 기반 인시던트 관리
  - 런북 자동 실행
  - 풍부한 통합
  - 가격: 사용자당 월 $15~

Blameless:
  - 포스트모템 특화
  - SLO 관리 기능
  - 인시던트 + 포스트모템 통합
  - 가격: 문의 필요

FireHydrant:
  - 인시던트 라이프사이클 전체 관리
  - 상태 페이지 통합
  - 자동 에스컬레이션
  - 가격: 무료 티어 + 유료

11.2 관찰 가능성 도구

Three Pillars of Observability:

로그 (Logs):
  - ELK Stack (Elasticsearch + Logstash + Kibana)
  - Loki + Grafana
  - Datadog Logs
  - Splunk

메트릭 (Metrics):
  - Prometheus + Grafana
  - Datadog
  - New Relic
  - CloudWatch

트레이스 (Traces):
  - Jaeger
  - Zipkin
  - Datadog APM
  - OpenTelemetry (통합 표준)

12. SRE 팀 구축

12.1 팀 모델

SRE 팀 모델:

1. 중앙 집중 모델 (Centralized)
   특징: 하나의 SRE 팀이 전체 서비스 담당
   장점: 일관된 실천, 지식 공유 용이
   단점: 병목 현상, 서비스별 깊이 부족
   적합: 소규모 조직 (서비스 10개 미만)

2. 임베디드 모델 (Embedded)
   특징: SRE가 개발팀에 내장
   장점: 서비스 깊이 이해, 빠른 대응
   단점: 일관성 부족, SRE 고립 위험
   적합: 대규모 조직 (서비스 50개 이상)

3. 하이브리드 모델 (Hybrid)
   특징: 중앙 SRE 팀 + 각 팀 SRE 챔피언
   장점: 일관성 + 깊이의 균형
   단점: 조율 비용
   적합: 중규모 조직 (서비스 10~50개)

4. 컨설팅 모델 (Consulting)
   특징: SRE가 필요할 때만 개입
   장점: 확장 가능, 비용 효율적
   단점: 지속적 관여 부족
   적합: SRE 초기 도입 조직

12.2 SRE 채용

SRE 엔지니어 핵심 역량:

기술:
- 시스템 관리 (Linux, 네트워크, 스토리지)
- 프로그래밍 (Python, Go, Bash)
- 클라우드 플랫폼 (AWS, GCP, Azure)
- 컨테이너/오케스트레이션 (Docker, Kubernetes)
- 모니터링/관찰 가능성 (Prometheus, Grafana)
- CI/CD (Jenkins, GitHub Actions, ArgoCD)
- IaC (Terraform, Pulumi)

소프트 스킬:
- 문제 해결 능력 (체계적 디버깅)
- 커뮤니케이션 (인시던트 중 명확한 소통)
- 스트레스 관리 (On-Call 압박 하에서 의사결정)
- 문서화 능력 (런북, 포스트모템)
- 협업 (개발팀과의 협력)

면접 질문 예시:
1. "프로덕션에서 API 응답 시간이 갑자기 10배 느려졌습니다.
    어떻게 디버깅하시겠습니까?"
2. "SLO 99.9%의 서비스에서 Error Budget을 소진했습니다.
    어떤 조치를 취하시겠습니까?"
3. "반복적으로 발생하는 수동 운영 작업이 있습니다.
    자동화 계획을 세워주세요."

12.3 SRE 온보딩

SRE 온보딩 프로그램 (12주):

주 1-2: 기초
  - 아키텍처 개요
  - 핵심 서비스 이해
  - 모니터링 도구 학습
  - On-Call 도구 설정

주 3-4: 관찰
  - 선배 SRE의 On-Call 섀도잉
  - 인시던트 대응 관찰
  - 런북 읽기 및 이해

주 5-6: 실습
  - 간단한 인시던트 대응 (선배 지원)
  - 런북 업데이트
  - 모니터링 알림 개선

주 7-8: 독립
  - Secondary On-Call 수행
  - 포스트모템 작성
  - Toil 자동화 프로젝트 시작

주 9-10: 심화
  - Primary On-Call 수행
  - 인시던트 커맨더 경험
  - SLO 리뷰 참여

주 11-12: 기여
  - 자동화 프로젝트 완료
  - 온보딩 문서 개선
  - 다음 신규 멤버 멘토링 준비

13. SRE 시간 배분

이상적인 SRE 시간 배분:

엔지니어링 작업: 50% 이상
  - 자동화 개발
  - 도구 개선
  - 시스템 설계
  - 코드 리뷰

운영 작업 (Toil): 50% 이하
  - On-Call 대응
  - 배포 관리
  - 수동 프로비저닝
  - 수동 모니터링

경고 신호:
- Toil > 50%: 자동화 투자 필요
- Toil > 70%: 팀 확대 또는 서비스 재설계 필요
- Toil > 90%: 위기 - 즉각적인 경영진 개입 필요

SRE 시간 추적 방법:
- 2주 단위로 시간 카테고리별 기록
- 분기별 Toil 비율 리포트
- Toil 감소 목표 설정 및 추적

14. 퀴즈

아래 퀴즈를 통해 학습한 내용을 점검해 보세요.

Q1: SLO 99.9%일 때, 월간 Error Budget은 얼마인가요?

정답: 약 43.2분

계산: 30일 x 24시간 x 60분 x (1 - 0.999) = 30 x 24 x 60 x 0.001 = 43.2분

이는 한 달 동안 약 43분의 장애 시간이 허용된다는 의미입니다. 이 시간을 초과하면 Error Budget이 소진되며, 정책에 따라 기능 배포를 중단하고 안정성 개선에 집중해야 합니다.

Q2: Blameless 포스트모템에서 "5 Whys" 분석이 사람이 아닌 시스템에서 끝나야 하는 이유는?

정답:

5 Whys 분석이 "누군가가 실수했다"로 끝나면, 개선 조치가 "그 사람에게 교육을 실시한다" 수준에 그칩니다. 이는 근본적인 해결이 아닙니다.

시스템/프로세스에서 끝나야 하는 이유:

재발 방지: 시스템을 고치면 누가 작업하더라도 같은 실수가 발생하지 않습니다.
정보 공유: 비난을 두려워하면 사람들이 실수를 숨기게 되어, 진짜 원인을 찾기 어려워집니다.
확장 가능한 해결: "교육"은 한 사람에게만 적용되지만, "자동화 검증"은 모든 배포에 적용됩니다.

나쁜 예: "개발자가 인덱스 없는 쿼리를 배포했다" (사람 비난) 좋은 예: "배포 전 쿼리 성능 검증 단계가 없었다" (프로세스 개선)

Q3: Toil의 5가지 특성을 나열하고, 왜 Toil을 50% 이하로 유지해야 하는지 설명하세요.

정답:

Toil의 5가지 특성:

수동적 (Manual): 사람이 직접 수행
반복적 (Repetitive): 같은 작업 반복
자동화 가능 (Automatable): 기계로 대체 가능
전술적 (Tactical): 장기적 가치 없는 즉각 대응
서비스 성장에 비례 (Scales with Service): 서비스 커질수록 증가

50% 이하로 유지해야 하는 이유:

SRE의 핵심 가치는 엔지니어링을 통한 운영 개선입니다
Toil이 50%를 넘으면 자동화와 시스템 개선에 투자할 시간이 부족해집니다
이는 악순환을 만듭니다: Toil이 많아서 자동화할 시간이 없고, 자동화를 못 해서 Toil이 더 늘어남
Google SRE 책에서는 Toil이 50%를 넘으면 "위험 신호"로 정의합니다

Q4: On-Call 피로도를 줄이기 위한 3가지 전략을 설명하세요.

정답:

알림 품질 개선
- 불필요한 알림을 비활성화합니다 (주간 알림 리뷰)
- 알림에 충분한 컨텍스트를 추가합니다 (무엇이 문제인지, 어떻게 대응할지)
- 목표: On-Call 1주당 알림 수 20회 미만
런북 작성 및 유지
- 모든 반복 알림에 대한 단계별 대응 절차를 문서화합니다
- 런북이 있으면 야간 호출 시에도 빠르고 정확하게 대응할 수 있습니다
- 정기적으로 런북을 업데이트합니다
적절한 보상과 휴식
- On-Call 수당과 야간/주말 추가 보상을 제공합니다
- 힘든 인시던트 후 Mental Health Day를 제공합니다
- On-Call이 아닌 날은 완전히 분리하여 번아웃을 방지합니다

Q5: "class SRE implements DevOps"라는 표현의 의미를 설명하세요.

정답:

이 표현은 객체 지향 프로그래밍의 비유로, SRE와 DevOps의 관계를 설명합니다.

DevOps는 인터페이스(Interface): 문화, 철학, 가치관을 정의합니다. "개발과 운영이 협업해야 한다", "자동화해야 한다", "지속적으로 개선해야 한다"는 원칙을 제시하지만, 구체적인 구현 방법은 명시하지 않습니다.
SRE는 구현 클래스(Implementation): DevOps의 철학을 구체적으로 구현합니다. SLO/SLI로 목표를 정량화하고, Error Budget으로 의사결정 프레임워크를 제공하며, Toil 측정과 자동화로 실행합니다.

즉, DevOps가 "무엇을 해야 하는가"를 정의한다면, SRE는 "어떻게 할 것인가"를 구체화합니다. 둘은 대립 관계가 아닌 보완 관계입니다.

참고 자료

Site Reliability Engineering - Betsy Beyer 외 (Google, O'Reilly)
The Site Reliability Workbook - Betsy Beyer 외 (Google, O'Reilly)
Building Secure and Reliable Systems - Heather Adkins 외 (Google, O'Reilly)
Google SRE Resources - sre.google
Implementing Service Level Objectives - Alex Hidalgo (O'Reilly)
Incident Management for Operations - Rob Schnepp 외 (O'Reilly)
PagerDuty Incident Response Guide - response.pagerduty.com
Atlassian Incident Management Handbook - atlassian.com/incident-management
Blameless Postmortem Guide - blameless.com
Rootly SRE Guide - rootly.com/blog
Incident.io Blog - incident.io/blog
Netflix Tech Blog: SRE Practices - netflixtechblog.com
LinkedIn SRE Practices - engineering.linkedin.com
Dropbox SRE - dropbox.tech

SRE Practices Guide 2025: Incident Management, Postmortem, Error Budget, On-Call, Toil Elimination

Introduction: What is SRE

Site Reliability Engineering (SRE) is a software engineering approach created by Google, born from the question: "What happens when you ask a software engineer to design an operations function?"

Ben Treynor Sloss (Google VP of Engineering) famously defined it:

"SRE is what happens when you ask a software engineer to design an operations function."

SRE is not just a job title -- it is a culture and philosophy. The core principles are:

SRE Core Principles:
1. Apply software engineering to operations problems
2. Quantify reliability targets with SLOs
3. Balance innovation and reliability with Error Budgets
4. Reduce Toil and invest in automation
5. Monitoring should be symptom-based, not cause-based
6. Pursue simplicity
7. Learn continuously through blameless postmortems

SRE vs DevOps

Relationship between DevOps and SRE:

DevOps = Culture, philosophy, values
  - Collaboration between dev and ops
  - Continuous integration/delivery
  - Infrastructure as Code
  - Feedback loops

SRE = Concrete implementation of DevOps
  - "class SRE implements DevOps"
  - Measurable objectives (SLO/SLI)
  - Error Budget as a decision framework
  - Engineering-based operations approach

They complement, not compete:
DevOps defines "what to do", SRE defines "how to do it"

1. SLO, SLI, SLA

1.1 Concepts

SLA (Service Level Agreement)
= Contract between service provider and customer
= Financial/legal consequences for violations
Example: "99.9% monthly availability guaranteed, 10% service credit if missed"

SLO (Service Level Objective)
= Internal target
= Set stricter than SLA (to maintain buffer)
Example: "99.95% monthly availability target" (stricter than 99.9% SLA)

SLI (Service Level Indicator)
= Actual measurement
= Metrics used to evaluate SLOs
Example: "Actual availability over last 30 days: 99.97%"

1.2 Choosing Good SLIs

SLI Examples by Service Type:

API Services:
- Availability: Successful requests / Total requests
- Latency: Proportion of requests with p99 < 200ms
- Throughput: Requests processed per second

Data Pipelines:
- Freshness: Proportion of data processed within N minutes
- Completeness: Processed records / Expected records
- Correctness: Proportion of correctly processed records

Storage Systems:
- Durability: Proportion of data preserved without loss
- Availability: Successful read/write request ratio
- Latency: p50 read latency

1.3 SLO Setting Guide

SLO Setting Process:

Step 1: Start from the user perspective
  "What matters most to users?"
  -> Page load time, payment success rate, notification delay

Step 2: Define SLIs
  Availability SLI = Successful HTTP responses (2xx, 3xx) / Total responses
  Latency SLI = Proportion of responses within 200ms

Step 3: Set initial SLO
  - Analyze current performance data (last 30 days)
  - Set current p50 level as initial SLO
  - Don't set it too high!

Step 4: Calculate Error Budget
  SLO 99.9% -> Error Budget = 0.1%
  Monthly: 30 days x 24 hours x 60 minutes x 0.001 = 43.2 minutes

Step 5: Iterate
  - Review SLO every 4 weeks
  - Incorporate user feedback
  - Adjust SLO as needed

1.4 SLO Dashboard

# Prometheus + Grafana SLO Dashboard
# Availability SLO Query
availability_slo:
  target: 99.9
  window: 30d
  query: |
    sum(rate(http_requests_total{status=~"2.."}[30d]))
    /
    sum(rate(http_requests_total[30d]))
    * 100

# Latency SLO Query
latency_slo:
  target: 99.0
  threshold: 200ms
  window: 30d
  query: |
    sum(rate(http_request_duration_seconds_bucket{le="0.2"}[30d]))
    /
    sum(rate(http_request_duration_seconds_count[30d]))
    * 100

# Error Budget Consumption Rate
error_budget_consumption:
  query: |
    1 - (
      sum(rate(http_requests_total{status=~"2.."}[30d]))
      /
      sum(rate(http_requests_total[30d]))
    )
    /
    (1 - 0.999)
    * 100

2. Error Budget Policy

2.1 What is Error Budget

Error Budget is the "acceptable unreliability" derived from the SLO.

Error Budget Calculations:

SLO 99.9% (Three Nines):
  Annual: 365d x 24h x 60m x 0.001 = 525.6 minutes (8h 45m)
  Monthly: 30d x 24h x 60m x 0.001 = 43.2 minutes
  Weekly: 7d x 24h x 60m x 0.001 = 10.08 minutes

SLO 99.95% (Three and a Half Nines):
  Annual: 262.8 minutes (4h 23m)
  Monthly: 21.6 minutes
  Weekly: 5.04 minutes

SLO 99.99% (Four Nines):
  Annual: 52.56 minutes
  Monthly: 4.32 minutes
  Weekly: 1.01 minutes

2.2 Error Budget Policy Document

Error Budget Policy v2.0
Last Modified: 2025-04-14
Approved by: CTO, VP Engineering, SRE Director

1. Purpose
   Error Budget is a quantitative framework for maintaining
   balance between reliability and innovation.

2. Response by Budget Status

   Green (more than 50% budget remaining):
   - Normal feature development and deployment
   - Chaos experiments permitted
   - Aggressive release schedule possible

   Yellow (20-50% budget remaining):
   - Reduce deployment velocity (limit to 2x per week)
   - Prioritize reliability improvement work
   - Chaos experiments only in staging

   Red (less than 20% budget remaining):
   - Deploy only reliability-related changes
   - Feature freeze
   - SRE team acts as deployment gatekeeper
   - Focus on root cause analysis and resolution

   Exhausted (0% budget):
   - Complete halt of all non-essential deployments
   - Incident-level response
   - Joint SRE/dev reliability sprint
   - Executive reporting and recovery planning

3. Exceptions
   - Security patches: Deploy regardless of budget status
   - Legal requirements: Deploy regardless of budget status
   - Data integrity issues: Respond immediately

4. Review Cadence
   - Weekly: SLO dashboard review
   - Monthly: Error Budget consumption pattern analysis
   - Quarterly: SLO target reassessment

2.3 Error Budget Decision Making

Scenario: 35 of 43.2 monthly Error Budget minutes consumed (8.2 remaining)

Q: Should we proceed with the new feature deployment?

Analysis:
- Remaining budget: 8.2 minutes (19%)
- Status: Red (below 20%)
- Deployment risk: Average 2 minutes downtime from similar past deployments

Decision:
-> Defer feature deployment
-> Use remaining budget for reliability improvements
-> Proceed with deployment when budget resets next month

Communication:
"We have consumed 81% of this month's Error Budget.
 Per our policy, we will deploy only reliability-related changes.
 Feature X deployment is deferred to next month."

3. Incident Management Lifecycle

3.1 Incident Severity Levels

Severity Definitions:

SEV1 (Critical):
  Definition: Core service total outage, many users affected
  Examples: Complete payment system down, data breach
  Response: Immediate (within 5 minutes)
  Escalation: VP + full on-call team
  Communication: Status update every 15 minutes
  Postmortem: Required (within 48 hours)

SEV2 (High):
  Definition: Major feature degradation, significant users affected
  Examples: Search 50% failure, API latency 10x increase
  Response: Within 15 minutes
  Escalation: Team lead + on-call
  Communication: Status update every 30 minutes
  Postmortem: Required (within 1 week)

SEV3 (Medium):
  Definition: Partial feature degradation, few users affected
  Examples: Slow image loading for specific region
  Response: Within 1 hour
  Escalation: On-call engineer
  Communication: As needed
  Postmortem: Optional (when learning value exists)

SEV4 (Low):
  Definition: Minor issue, minimal user impact
  Examples: Internal dashboard loading delay
  Response: Next business day
  Escalation: Owning team
  Communication: Not needed
  Postmortem: Not needed

3.2 Incident Response Process

Incident Lifecycle:

1. Detection
   +-- Automated alerts (monitoring systems)
   +-- User reports (customer support)
   +-- Internal discovery (engineer directly)
   |
   v
2. Triage
   +-- Severity assessment (SEV1-4)
   +-- Assign Incident Commander
   +-- Assemble response team
   |
   v
3. Mitigation
   +-- Immediate mitigation actions
   +-- Rollback, scale up, traffic shift
   +-- Verify service recovery
   |
   v
4. Resolution
   +-- Identify root cause
   +-- Deploy permanent fix
   +-- Confirm stability via monitoring
   |
   v
5. Postmortem
   +-- Create timeline
   +-- Root cause analysis
   +-- Derive action items
   +-- Share company-wide

3.3 Incident Commander Role

The Incident Commander (IC) is the cornerstone of incident response.

Incident Commander Responsibilities:

1. Coordination
   - Assign roles to response team
   - Prioritize work
   - Prevent duplicate efforts
   - Secure necessary resources

2. Communication
   - Internal status updates (Slack channel)
   - External status updates (status page)
   - Executive reporting
   - Customer support briefing

3. Decision Making
   - Decide whether to rollback
   - Judge escalation needs
   - Request additional resources
   - Declare incident closure

4. Documentation
   - Record key event timestamps
   - Document decision rationale
   - Collect information for postmortem

IC Communication Template:

"Incident Status Update - [TIME]
 Severity: SEV[N]
 Status: [Investigating / Mitigating / Monitoring]
 Impact: [Description of impact]
 Current Action: [Ongoing actions]
 Next Steps: [Planned next actions]
 Next Update: [Time]"

3.4 Incident Response Tools

Incident Response Workflow Tools:

Alerting:
- PagerDuty: On-call management and alerting
- OpsGenie: Alert routing and escalation
- Grafana Alerting: Metric-based alerts

Communication:
- Slack (auto-create dedicated incident channels)
- Zoom/Google Meet (War Room)
- StatusPage (external status page)

Incident Management:
- Incident.io: Slack-integrated incident management
- Rootly: Automated incident workflows
- Blameless: Incident + postmortem platform
- FireHydrant: Incident response automation

Documentation:
- Confluence/Notion: Postmortem storage
- Google Docs: Real-time collaboration
- Jira: Action item tracking

4. Incident Communication

4.1 Internal Communication

Slack Incident Channel Structure:

#incident-2025-0414-payment
  - Main channel for incident response
  - IC, response team, observers participate
  - Bot automatically records timeline

#incident-2025-0414-payment-comms
  - External communication coordination
  - Customer support, PR team participate
  - Draft/approve customer-facing messages

Rules:
1. Main channel for response-related content only
2. Casual discussion or speculation in separate threads
3. Share all major changes in the channel
4. Non-IC members report to IC

4.2 External Communication

Status Page Update Templates:

Investigating:
"We are aware of an issue with [service name] causing [symptoms].
 Our engineering team is investigating the cause.
 We will provide updates as more information becomes available."

Mitigating:
"We have identified the cause of the [service name] issue
 and are implementing a fix.
 Some users may still experience impact."

Monitoring:
"We have applied a fix for the [service name] issue
 and the service is recovering.
 We are continuing to monitor."

Resolved:
"The [service name] issue has been resolved.
 Service was impacted for approximately [N] minutes
 from [start time] to [end time].
 A detailed postmortem will be shared."

5. Blameless Postmortem

5.1 Why Blameless

Blame culture causes the following problems:

Problems with Blame Culture:

1. Information hiding
   "If I report my mistake, I'll be punished"
   -> True causes remain hidden

2. Defensive behavior
   "I need to prove it wasn't my fault"
   -> Focus on avoiding blame rather than improving systems

3. Innovation suppression
   "We shouldn't change anything to avoid mistakes"
   -> Even necessary improvements don't get made

4. Trust destruction
   "My colleague might blame me"
   -> Team collaboration suffers

Blameless Culture:
"Fix the system, not the person"
- Everyone acted as they believed was correct
- Failures reveal system vulnerabilities
- Sharing mistakes is safe
- Root causes always lie in systems/processes

5.2 Postmortem Template

Postmortem: [Incident Title]
Date: YYYY-MM-DD
Author: [Name]
Reviewers: [Names]

1. Summary
   [1-2 sentence incident summary]

2. Impact
   - Duration: [Start] to [End] ([N] minutes)
   - Affected users: [Count or percentage]
   - Affected services: [Service list]
   - Financial impact: [If applicable]

3. Timeline (all times UTC)
   14:23 - Monitoring alert fires (payment-service error rate increase)
   14:25 - On-call engineer acknowledges
   14:28 - SEV2 incident declared, IC assigned
   14:30 - Incident channel created
   14:35 - Investigation: database connection pool exhaustion confirmed
   14:40 - Mitigation: increase connection pool size
   14:45 - Service recovery confirmed
   14:50 - Switch to monitoring state
   15:15 - Incident closure declared

4. Root Cause
   [Detailed technical explanation]

5. 5 Whys Analysis
   Why 1: Why did payments fail?
   -> Could not obtain a database connection

   Why 2: Why could it not obtain a connection?
   -> Connection pool was fully exhausted

   Why 3: Why was the connection pool exhausted?
   -> Slow queries were holding connections for too long

   Why 4: Why were there slow queries?
   -> Full table scan on a table without proper indexes

   Why 5: Why were indexes missing?
   -> No index review process when adding new tables

6. What Went Well
   - Alert fired within 2 minutes
   - IC quickly coordinated the team
   - Mitigation was effective

7. What Needs Improvement
   - No database connection pool monitoring
   - No slow query detection alerting
   - Need index review process for new tables

8. Action Items
   [HIGH] Add database connection pool utilization alerting
   Owner: DB Team | Due: 2025-04-21

   [HIGH] Implement new table/query index review checklist
   Owner: Backend Team | Due: 2025-04-28

   [MED] Build slow query auto-detection and alerting system
   Owner: SRE Team | Due: 2025-05-15

   [LOW] Investigate automatic connection pool sizing mechanism
   Owner: Infra Team | Due: 2025-06-01

9. Lessons Learned
   [Key lessons from this incident]

5.3 Postmortem Review Process

Postmortem Review Checklist:

Writing Quality:
[ ] Is the timeline accurate and complete?
[ ] Is the root cause analyzed in depth?
[ ] Do the 5 Whys end at system/process? (must NOT end at a person)
[ ] Is the language blameless?

Action Items:
[ ] Do all action items have assigned owners?
[ ] Are deadlines realistic?
[ ] Are priorities appropriate?
[ ] Are the measures effective for preventing recurrence?

Sharing:
[ ] Has it been shared with relevant teams?
[ ] Is it scheduled for the company-wide postmortem review meeting?
[ ] Have similar services been checked for the same vulnerability?

5.4 Building Postmortem Culture

Postmortem Culture Practices:

1. Weekly Postmortem Review Meeting
   - Every Friday, 30 minutes
   - Share the week's incident postmortems
   - Open to all teams

2. Postmortem Reading Club
   - Analyze publicly available postmortems from other companies
   - Derive lessons applicable to our systems
   - Monthly

3. Blameless Language Guide
   Bad: "John deployed a query without indexes and caused the outage"
   Good: "A query without indexes was deployed to production.
          The code review process lacked a query performance review step."

4. Action Item Tracking
   - Manage action items via Jira board
   - Track monthly completion rate
   - Review incomplete items

6. On-Call Operations

6.1 On-Call Rotation

On-Call Rotation Design:

Core Principles:
1. Minimum 2 people always on-call (Primary + Secondary)
2. 1-week rotation (starts Monday 09:00)
3. No consecutive on-call (minimum 2-week gap)
4. Holidays are volunteer or extra-compensated

Rotation Example (6-person team):
Week 1: Alice (P) + Bob (S)
Week 2: Charlie (P) + Diana (S)
Week 3: Eve (P) + Frank (S)
Week 4: Bob (P) + Alice (S)
Week 5: Diana (P) + Charlie (S)
Week 6: Frank (P) + Eve (S)

Swap Rules:
- Minimum 48-hour advance notice
- Responsibility to find own replacement
- Notify team lead of swap
- Update on-call calendar

6.2 Escalation Chain

Escalation Policy:

Level 1: On-Call Primary (Immediate)
  -> If no response within 5 minutes

Level 2: On-Call Secondary (5 minutes)
  -> If unresolved within 15 minutes

Level 3: Team Lead (15 minutes)
  -> If SEV1 or unresolved within 30 minutes

Level 4: Engineering Manager (30 minutes)
  -> If SEV1 persists for more than 1 hour

Level 5: VP/CTO (1 hour)

Automatic Escalation Configuration (PagerDuty):

escalation_policy:
  name: "payment-service"
  rules:
    - targets:
        - type: "user_reference"
          id: "PRIMARY_ON_CALL"
      escalation_delay_in_minutes: 5
    - targets:
        - type: "user_reference"
          id: "SECONDARY_ON_CALL"
      escalation_delay_in_minutes: 15
    - targets:
        - type: "user_reference"
          id: "TEAM_LEAD"
      escalation_delay_in_minutes: 30

6.3 On-Call Fatigue Management

On-Call Fatigue Management Strategies:

1. Improve Alert Quality
   Problem: Too many alerts -> alert fatigue -> miss real issues
   Solutions:
   - Weekly alert review: disable unnecessary alerts
   - Eliminate duplicate alerts
   - Add context to alerts (what is wrong, how to respond)
   - Target: fewer than 20 alerts per on-call week

2. Write Runbooks
   Create runbooks for all recurring alerts
   Runbook structure:
   - Symptom description
   - Impact scope assessment method
   - Step-by-step response procedure
   - Escalation criteria
   - Related dashboard links

3. Compensation
   - On-call stipend (fixed weekly amount)
   - Additional compensation for night/weekend calls
   - Compensatory time off
   - Special compensation for extended on-call periods

4. Mental Health
   - Debrief session after on-call rotation
   - Mental Health Day after difficult incidents
   - Complete disconnect when not on-call
   - Team culture of sharing on-call experiences

6.4 Runbook Example

Runbook: payment-service High Error Rate

Trigger: payment-service HTTP 5xx rate exceeds 1%

1. Assess the Situation
   - Check Grafana dashboard:
     [Dashboard URL]
   - Check error logs:
     kubectl logs -l app=payment-service --tail=100 -n production
   - Check recent deployments:
     kubectl rollout history deployment/payment-service -n production

2. Common Causes

   Cause A: Database connection issue
   Check: Review connection pool metrics
   Action: Reset or expand connection pool
   Command:
     kubectl rollout restart deployment/payment-service -n production

   Cause B: External API failure
   Check: Check external API status page
   Action: Force-open circuit breaker
   Command:
     curl -X POST http://payment-service/admin/circuit-breaker/open

   Cause C: Recent deployment issue
   Check: Compare deployment timeline with error start time
   Action: Rollback to previous version
   Command:
     kubectl rollout undo deployment/payment-service -n production

3. Escalation
   - If unresolved within 15 minutes: Escalate to team lead
   - If judged SEV1: Declare incident

7. Toil Elimination

7.1 What is Toil

Characteristics of Toil as defined in the Google SRE book:

Toil Characteristics:

1. Manual
   Work that requires a human to perform
   Example: Manually restarting servers

2. Repetitive
   Performing the same task repeatedly
   Example: Renewing certificates weekly

3. Automatable
   Work that a machine could do instead
   Example: Running disk cleanup scripts

4. Tactical
   Immediate response with no long-term value
   Example: Applying temporary workarounds

5. Scales with Service
   Work increases as service grows
   Example: Manual user provisioning

6. No Enduring Value
   Service is not improved after the work
   Example: Manual deployment approval

7.2 Measuring Toil

Toil Measurement Methods:

1. Time Tracking
   - Record all work activities for 2 weeks
   - Classify Toil vs Engineering time
   - Target: Toil below 50%

2. Toil Categories:
   Category A: Incident response (emergency)
   Category B: Scheduled operational tasks (planned)
   Category C: Manual provisioning (request-based)
   Category D: Manual monitoring/verification (checks)

3. Measurement Template:

   Team: SRE Team
   Period: 2025-04-01 to 2025-04-14

   | Task | Frequency | Time Spent | Toil? | Automatable? |
   |------|-----------|-----------|-------|-------------|
   | Certificate renewal | Weekly | 30min | Yes | Yes |
   | Disk cleanup | Daily | 15min | Yes | Yes |
   | Deploy approval | 3x daily | 10min each | Yes | Yes |
   | Capacity review | Weekly | 2 hours | Partial | Partial |
   | Incident response | 2x weekly | 1hr each | No | No |

   Total hours: 80 hours (2 weeks)
   Toil hours: 45 hours (56%)
   Target: below 40 hours (50%)

7.3 Toil Automation Priority

Automation Priority Matrix:

              High Frequency    Low Frequency
High Impact  | P1: Automate   | P2: Plan to  |
             | immediately    | automate     |
             |----------------|--------------|
Low Impact   | P3: Add to     | P4: Can      |
             | backlog        | ignore       |

P1 Automation Examples:
- Automatic certificate renewal (cert-manager)
- Automatic disk cleanup (cron job)
- Automatic deploy approval (CI/CD pipeline)
- Auto-scaling (HPA/VPA)

P2 Automation Examples:
- Capacity planning automation (prediction-based)
- Incident response automation (auto-recovery)
- Automatic security patch application

7.4 Automation Examples

# Automatic certificate renewal with cert-manager
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: my-service-tls
  namespace: production
spec:
  secretName: my-service-tls-secret
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - api.example.com
    - www.example.com
  renewBefore: 720h  # Auto-renew 30 days before expiry

# Disk Cleanup Automation CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
  name: disk-cleanup
  namespace: production
spec:
  schedule: "0 3 * * *"  # Daily at 3 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: cleanup
              image: busybox:1.36
              command:
                - /bin/sh
                - -c
                - |
                  find /data/logs -mtime +7 -delete
                  find /data/tmp -mtime +1 -delete
          restartPolicy: OnFailure

# HPA Auto-scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

8. Capacity Planning

8.1 Demand Forecasting

Capacity Planning Framework:

1. Assess Current Capacity
   - CPU: 60% of total cluster in use
   - Memory: 55% of total cluster in use
   - Storage: Growing 100GB per month
   - Network: 40% of peak bandwidth in use

2. Growth Rate Analysis
   - Last 12 months traffic growth: 15% monthly
   - Seasonal patterns: Black Friday 3x, New Year 2x
   - Planned events: New feature launch expected 50% traffic increase

3. Headroom Buffer
   - General buffer: 30-50%
   - N+1 principle: Service survives 1 node failure
   - Peak preparedness: Handle 3x normal traffic

4. Provisioning Lead Time
   - Cloud: Minutes (auto-scaling)
   - On-premises: Weeks to months (hardware purchase)
   - Hybrid: Base on-premises + burst to cloud

8.2 Load Testing

// k6 Load Test Script
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '5m', target: 100 },   // Ramp up to 100 VUs
    { duration: '10m', target: 100 },   // Maintain 100 VUs
    { duration: '5m', target: 500 },    // Ramp up to 500 VUs
    { duration: '10m', target: 500 },   // Maintain 500 VUs
    { duration: '5m', target: 1000 },   // Ramp up to 1000 VUs
    { duration: '10m', target: 1000 },  // Maintain 1000 VUs
    { duration: '5m', target: 0 },      // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],
    http_req_failed: ['rate<0.01'],
  },
};

export default function () {
  const res = http.get('https://api.example.com/health');
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });
  sleep(1);
}

9. Release Engineering

9.1 Canary Deployment

# Argo Rollouts Canary Deployment
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
  namespace: production
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: payment-service-canary
      stableService: payment-service-stable
      trafficRouting:
        istio:
          virtualService:
            name: payment-service-vs
      steps:
        - setWeight: 5
        - pause:
            duration: 10m
        - setWeight: 20
        - pause:
            duration: 10m
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 50
        - pause:
            duration: 15m
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 2
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 60s
      successCondition: result[0] >= 0.999
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{
              service="payment-service",
              status=~"2.."
            }[5m]))
            /
            sum(rate(http_requests_total{
              service="payment-service"
            }[5m]))

9.2 Feature Flags

Feature Flag Deployment Strategy:

Phase 1: Internal employees only (dogfooding)
   percentage: 0, whitelist: [internal employee IDs]

Phase 2: 1% rollout
   percentage: 1

Phase 3: 10% rollout
   percentage: 10

Phase 4: 50% rollout
   percentage: 50

Phase 5: Full rollout
   percentage: 100

At each phase:
- Monitor error rates
- Collect user feedback
- Check business metrics
- Immediately disable if issues found

10. Google SRE Book Key Lessons

The 10 most important lessons from the Google SRE book.

Lesson 1: 100% availability is the wrong target
  -> No perfect system exists
  -> Define appropriate reliability level with SLOs

Lesson 2: Error Budget enables innovation
  -> With budget, you can take risks
  -> Without budget, focus on reliability

Lesson 3: Toil exceeding 50% is a danger signal
  -> If more than 50% of SRE time is Toil
  -> Invest in automation or expand the team

Lesson 4: Monitoring should be symptom-based
  -> "CPU is at 90%" (cause) vs "Response time is slow" (symptom)
  -> Monitor symptoms that users experience

Lesson 5: Postmortems must be blameless
  -> Fix the system, not the person
  -> Create a culture where sharing mistakes is safe

Lesson 6: Simplicity is reliability
  -> Complex systems fail in unpredictable ways
  -> Keep things as simple as possible

Lesson 7: Release engineering is key to reliability
  -> Canary deployment, feature flags, automatic rollback
  -> Safe deployments enable frequent deployments

Lesson 8: Capacity planning is proactive, not reactive
  -> Prepare before traffic grows
  -> Understand limits through load testing

Lesson 9: On-call must be sustainable
  -> Reduce alert fatigue
  -> Provide appropriate compensation and rest

Lesson 10: SRE is the whole organization's responsibility
  -> Not just the SRE team's job
  -> Development teams must share reliability responsibility

11. SRE Tool Ecosystem

11.1 Incident Management Tools

Tool Comparison:

PagerDuty:
  - Market leader
  - Strong on-call management
  - 800+ integrations
  - AI-based incident classification
  - Pricing: from $21/user/month

OpsGenie (Atlassian):
  - Native Jira/Confluence integration
  - Flexible alert routing
  - Team collaboration features
  - Pricing: from $9/user/month

Incident.io:
  - Slack-native
  - Automated workflows
  - Postmortem generation automation
  - Pricing: from $16/user/month

Rootly:
  - Slack-based incident management
  - Auto-execute runbooks
  - Rich integrations
  - Pricing: from $15/user/month

FireHydrant:
  - Full incident lifecycle management
  - Status page integration
  - Automatic escalation
  - Pricing: Free tier + paid plans

11.2 Observability Tools

Three Pillars of Observability:

Logs:
  - ELK Stack (Elasticsearch + Logstash + Kibana)
  - Loki + Grafana
  - Datadog Logs
  - Splunk

Metrics:
  - Prometheus + Grafana
  - Datadog
  - New Relic
  - CloudWatch

Traces:
  - Jaeger
  - Zipkin
  - Datadog APM
  - OpenTelemetry (unified standard)

12. Building an SRE Team

12.1 Team Models

SRE Team Models:

1. Centralized Model
   Traits: One SRE team covers all services
   Pros: Consistent practices, easy knowledge sharing
   Cons: Bottleneck, lack of per-service depth
   Fits: Small organizations (fewer than 10 services)

2. Embedded Model
   Traits: SREs embedded within development teams
   Pros: Deep service understanding, fast response
   Cons: Inconsistency, risk of SRE isolation
   Fits: Large organizations (50+ services)

3. Hybrid Model
   Traits: Central SRE team + SRE champions in each team
   Pros: Balance of consistency and depth
   Cons: Coordination overhead
   Fits: Medium organizations (10-50 services)

4. Consulting Model
   Traits: SRE engages only when needed
   Pros: Scalable, cost-efficient
   Cons: Lack of continuous involvement
   Fits: Organizations early in SRE adoption

12.2 SRE Hiring

SRE Engineer Key Competencies:

Technical:
- System administration (Linux, networking, storage)
- Programming (Python, Go, Bash)
- Cloud platforms (AWS, GCP, Azure)
- Containers/orchestration (Docker, Kubernetes)
- Monitoring/observability (Prometheus, Grafana)
- CI/CD (Jenkins, GitHub Actions, ArgoCD)
- IaC (Terraform, Pulumi)

Soft Skills:
- Problem solving (systematic debugging)
- Communication (clear during incidents)
- Stress management (decisions under on-call pressure)
- Documentation skills (runbooks, postmortems)
- Collaboration (working with dev teams)

12.3 SRE Onboarding

SRE Onboarding Program (12 Weeks):

Week 1-2: Foundations
  - Architecture overview
  - Core service understanding
  - Monitoring tool training
  - On-call tool setup

Week 3-4: Observation
  - Shadow senior SRE's on-call
  - Observe incident response
  - Read and understand runbooks

Week 5-6: Practice
  - Handle simple incidents (with senior support)
  - Update runbooks
  - Improve monitoring alerts

Week 7-8: Independence
  - Perform Secondary on-call
  - Write postmortems
  - Start Toil automation project

Week 9-10: Advanced
  - Perform Primary on-call
  - Experience as Incident Commander
  - Participate in SLO reviews

Week 11-12: Contribution
  - Complete automation project
  - Improve onboarding documentation
  - Prepare to mentor next new member

13. SRE Time Allocation

Ideal SRE Time Allocation:

Engineering Work: 50% or more
  - Automation development
  - Tool improvement
  - System design
  - Code review

Operational Work (Toil): 50% or less
  - On-call response
  - Deploy management
  - Manual provisioning
  - Manual monitoring

Warning Signs:
- Toil over 50%: Need automation investment
- Toil over 70%: Need team expansion or service redesign
- Toil over 90%: Crisis - immediate executive intervention needed

SRE Time Tracking Method:
- Record time by category in 2-week intervals
- Quarterly Toil ratio report
- Set and track Toil reduction targets

14. Quiz

Test your understanding with these questions.

Q1: With an SLO of 99.9%, what is the monthly Error Budget?

Answer: Approximately 43.2 minutes

Calculation: 30 days x 24 hours x 60 minutes x (1 - 0.999) = 30 x 24 x 60 x 0.001 = 43.2 minutes

This means approximately 43 minutes of downtime is allowed per month. If this is exceeded, the Error Budget is exhausted, and per policy, feature deployments should stop to focus on reliability improvements.

Q2: Why must the "5 Whys" analysis in a blameless postmortem end at system/process, not at a person?

Answer:

If the 5 Whys analysis ends at "someone made a mistake," the improvement action becomes just "train that person" -- which is not a fundamental fix.

Reasons it must end at system/process:

Prevents recurrence: Fixing the system means the same mistake cannot happen regardless of who performs the task.
Information sharing: When people fear blame, they hide mistakes, making it harder to find the real causes.
Scalable solutions: "Training" applies to one person, but "automated verification" applies to all deployments.

Bad: "The developer deployed a query without indexes" (blaming person) Good: "There was no query performance review step in the deployment process" (improving process)

Q3: List the 5 characteristics of Toil and explain why it should be kept below 50%.

Answer:

5 characteristics of Toil:

Manual: Requires a human to perform
Repetitive: Same task repeated
Automatable: Can be replaced by machines
Tactical: Immediate response with no long-term value
Scales with Service: Increases as the service grows

Why keep below 50%:

The core value of SRE is improving operations through engineering
When Toil exceeds 50%, there is insufficient time for automation and system improvement
This creates a vicious cycle: too much Toil to automate, and no automation means even more Toil
The Google SRE book defines Toil over 50% as a "danger signal"

Q4: Describe 3 strategies for reducing on-call fatigue.

Answer:

Improve alert quality
- Disable unnecessary alerts (weekly alert review)
- Add sufficient context to alerts (what is wrong, how to respond)
- Target: fewer than 20 alerts per on-call week
Write and maintain runbooks
- Document step-by-step response procedures for all recurring alerts
- Runbooks enable fast, accurate response even during night calls
- Regularly update runbooks
Provide adequate compensation and rest
- Offer on-call stipend and extra compensation for night/weekend calls
- Provide Mental Health Day after difficult incidents
- Ensure complete disconnect when not on-call to prevent burnout

Q5: Explain what "class SRE implements DevOps" means.

Answer:

This expression uses an object-oriented programming analogy to explain the relationship between SRE and DevOps.

DevOps is the Interface: It defines culture, philosophy, and values. It prescribes principles like "dev and ops must collaborate," "automate everything," and "continuously improve" -- but does not specify concrete implementation methods.
SRE is the Implementation Class: It concretely implements the DevOps philosophy. It quantifies objectives with SLO/SLI, provides a decision-making framework with Error Budgets, and executes through Toil measurement and automation.

In other words, if DevOps defines "what to do," SRE specifies "how to do it." They are complementary, not competitive.

References

Site Reliability Engineering - Betsy Beyer et al. (Google, O'Reilly)
The Site Reliability Workbook - Betsy Beyer et al. (Google, O'Reilly)
Building Secure and Reliable Systems - Heather Adkins et al. (Google, O'Reilly)
Google SRE Resources - sre.google
Implementing Service Level Objectives - Alex Hidalgo (O'Reilly)
Incident Management for Operations - Rob Schnepp et al. (O'Reilly)
PagerDuty Incident Response Guide - response.pagerduty.com
Atlassian Incident Management Handbook - atlassian.com/incident-management
Blameless Postmortem Guide - blameless.com
Rootly SRE Guide - rootly.com/blog
Incident.io Blog - incident.io/blog
Netflix Tech Blog: SRE Practices - netflixtechblog.com
LinkedIn SRE Practices - engineering.linkedin.com
Dropbox SRE - dropbox.tech