Skip to content
Published on

SRE Practices Guide 2025: Incident Management, Postmortem, Error Budget, On-Call, Toil Elimination

Authors

Introduction: What is SRE

Site Reliability Engineering (SRE) is a software engineering approach created by Google, born from the question: "What happens when you ask a software engineer to design an operations function?"

Ben Treynor Sloss (Google VP of Engineering) famously defined it:

"SRE is what happens when you ask a software engineer to design an operations function."

SRE is not just a job title -- it is a culture and philosophy. The core principles are:

SRE Core Principles:
1. Apply software engineering to operations problems
2. Quantify reliability targets with SLOs
3. Balance innovation and reliability with Error Budgets
4. Reduce Toil and invest in automation
5. Monitoring should be symptom-based, not cause-based
6. Pursue simplicity
7. Learn continuously through blameless postmortems

SRE vs DevOps

Relationship between DevOps and SRE:

DevOps = Culture, philosophy, values
  - Collaboration between dev and ops
  - Continuous integration/delivery
  - Infrastructure as Code
  - Feedback loops

SRE = Concrete implementation of DevOps
  - "class SRE implements DevOps"
  - Measurable objectives (SLO/SLI)
  - Error Budget as a decision framework
  - Engineering-based operations approach

They complement, not compete:
DevOps defines "what to do", SRE defines "how to do it"

1. SLO, SLI, SLA

1.1 Concepts

SLA (Service Level Agreement)
= Contract between service provider and customer
= Financial/legal consequences for violations
Example: "99.9% monthly availability guaranteed, 10% service credit if missed"

SLO (Service Level Objective)
= Internal target
= Set stricter than SLA (to maintain buffer)
Example: "99.95% monthly availability target" (stricter than 99.9% SLA)

SLI (Service Level Indicator)
= Actual measurement
= Metrics used to evaluate SLOs
Example: "Actual availability over last 30 days: 99.97%"

1.2 Choosing Good SLIs

SLI Examples by Service Type:

API Services:
- Availability: Successful requests / Total requests
- Latency: Proportion of requests with p99 < 200ms
- Throughput: Requests processed per second

Data Pipelines:
- Freshness: Proportion of data processed within N minutes
- Completeness: Processed records / Expected records
- Correctness: Proportion of correctly processed records

Storage Systems:
- Durability: Proportion of data preserved without loss
- Availability: Successful read/write request ratio
- Latency: p50 read latency

1.3 SLO Setting Guide

SLO Setting Process:

Step 1: Start from the user perspective
  "What matters most to users?"
  -> Page load time, payment success rate, notification delay

Step 2: Define SLIs
  Availability SLI = Successful HTTP responses (2xx, 3xx) / Total responses
  Latency SLI = Proportion of responses within 200ms

Step 3: Set initial SLO
  - Analyze current performance data (last 30 days)
  - Set current p50 level as initial SLO
  - Don't set it too high!

Step 4: Calculate Error Budget
  SLO 99.9% -> Error Budget = 0.1%
  Monthly: 30 days x 24 hours x 60 minutes x 0.001 = 43.2 minutes

Step 5: Iterate
  - Review SLO every 4 weeks
  - Incorporate user feedback
  - Adjust SLO as needed

1.4 SLO Dashboard

# Prometheus + Grafana SLO Dashboard
# Availability SLO Query
availability_slo:
  target: 99.9
  window: 30d
  query: |
    sum(rate(http_requests_total{status=~"2.."}[30d]))
    /
    sum(rate(http_requests_total[30d]))
    * 100

# Latency SLO Query
latency_slo:
  target: 99.0
  threshold: 200ms
  window: 30d
  query: |
    sum(rate(http_request_duration_seconds_bucket{le="0.2"}[30d]))
    /
    sum(rate(http_request_duration_seconds_count[30d]))
    * 100

# Error Budget Consumption Rate
error_budget_consumption:
  query: |
    1 - (
      sum(rate(http_requests_total{status=~"2.."}[30d]))
      /
      sum(rate(http_requests_total[30d]))
    )
    /
    (1 - 0.999)
    * 100

2. Error Budget Policy

2.1 What is Error Budget

Error Budget is the "acceptable unreliability" derived from the SLO.

Error Budget Calculations:

SLO 99.9% (Three Nines):
  Annual: 365d x 24h x 60m x 0.001 = 525.6 minutes (8h 45m)
  Monthly: 30d x 24h x 60m x 0.001 = 43.2 minutes
  Weekly: 7d x 24h x 60m x 0.001 = 10.08 minutes

SLO 99.95% (Three and a Half Nines):
  Annual: 262.8 minutes (4h 23m)
  Monthly: 21.6 minutes
  Weekly: 5.04 minutes

SLO 99.99% (Four Nines):
  Annual: 52.56 minutes
  Monthly: 4.32 minutes
  Weekly: 1.01 minutes

2.2 Error Budget Policy Document

Error Budget Policy v2.0
Last Modified: 2025-04-14
Approved by: CTO, VP Engineering, SRE Director

1. Purpose
   Error Budget is a quantitative framework for maintaining
   balance between reliability and innovation.

2. Response by Budget Status

   Green (more than 50% budget remaining):
   - Normal feature development and deployment
   - Chaos experiments permitted
   - Aggressive release schedule possible

   Yellow (20-50% budget remaining):
   - Reduce deployment velocity (limit to 2x per week)
   - Prioritize reliability improvement work
   - Chaos experiments only in staging

   Red (less than 20% budget remaining):
   - Deploy only reliability-related changes
   - Feature freeze
   - SRE team acts as deployment gatekeeper
   - Focus on root cause analysis and resolution

   Exhausted (0% budget):
   - Complete halt of all non-essential deployments
   - Incident-level response
   - Joint SRE/dev reliability sprint
   - Executive reporting and recovery planning

3. Exceptions
   - Security patches: Deploy regardless of budget status
   - Legal requirements: Deploy regardless of budget status
   - Data integrity issues: Respond immediately

4. Review Cadence
   - Weekly: SLO dashboard review
   - Monthly: Error Budget consumption pattern analysis
   - Quarterly: SLO target reassessment

2.3 Error Budget Decision Making

Scenario: 35 of 43.2 monthly Error Budget minutes consumed (8.2 remaining)

Q: Should we proceed with the new feature deployment?

Analysis:
- Remaining budget: 8.2 minutes (19%)
- Status: Red (below 20%)
- Deployment risk: Average 2 minutes downtime from similar past deployments

Decision:
-> Defer feature deployment
-> Use remaining budget for reliability improvements
-> Proceed with deployment when budget resets next month

Communication:
"We have consumed 81% of this month's Error Budget.
 Per our policy, we will deploy only reliability-related changes.
 Feature X deployment is deferred to next month."

3. Incident Management Lifecycle

3.1 Incident Severity Levels

Severity Definitions:

SEV1 (Critical):
  Definition: Core service total outage, many users affected
  Examples: Complete payment system down, data breach
  Response: Immediate (within 5 minutes)
  Escalation: VP + full on-call team
  Communication: Status update every 15 minutes
  Postmortem: Required (within 48 hours)

SEV2 (High):
  Definition: Major feature degradation, significant users affected
  Examples: Search 50% failure, API latency 10x increase
  Response: Within 15 minutes
  Escalation: Team lead + on-call
  Communication: Status update every 30 minutes
  Postmortem: Required (within 1 week)

SEV3 (Medium):
  Definition: Partial feature degradation, few users affected
  Examples: Slow image loading for specific region
  Response: Within 1 hour
  Escalation: On-call engineer
  Communication: As needed
  Postmortem: Optional (when learning value exists)

SEV4 (Low):
  Definition: Minor issue, minimal user impact
  Examples: Internal dashboard loading delay
  Response: Next business day
  Escalation: Owning team
  Communication: Not needed
  Postmortem: Not needed

3.2 Incident Response Process

Incident Lifecycle:

1. Detection
   +-- Automated alerts (monitoring systems)
   +-- User reports (customer support)
   +-- Internal discovery (engineer directly)
   |
   v
2. Triage
   +-- Severity assessment (SEV1-4)
   +-- Assign Incident Commander
   +-- Assemble response team
   |
   v
3. Mitigation
   +-- Immediate mitigation actions
   +-- Rollback, scale up, traffic shift
   +-- Verify service recovery
   |
   v
4. Resolution
   +-- Identify root cause
   +-- Deploy permanent fix
   +-- Confirm stability via monitoring
   |
   v
5. Postmortem
   +-- Create timeline
   +-- Root cause analysis
   +-- Derive action items
   +-- Share company-wide

3.3 Incident Commander Role

The Incident Commander (IC) is the cornerstone of incident response.

Incident Commander Responsibilities:

1. Coordination
   - Assign roles to response team
   - Prioritize work
   - Prevent duplicate efforts
   - Secure necessary resources

2. Communication
   - Internal status updates (Slack channel)
   - External status updates (status page)
   - Executive reporting
   - Customer support briefing

3. Decision Making
   - Decide whether to rollback
   - Judge escalation needs
   - Request additional resources
   - Declare incident closure

4. Documentation
   - Record key event timestamps
   - Document decision rationale
   - Collect information for postmortem

IC Communication Template:

"Incident Status Update - [TIME]
 Severity: SEV[N]
 Status: [Investigating / Mitigating / Monitoring]
 Impact: [Description of impact]
 Current Action: [Ongoing actions]
 Next Steps: [Planned next actions]
 Next Update: [Time]"

3.4 Incident Response Tools

Incident Response Workflow Tools:

Alerting:
- PagerDuty: On-call management and alerting
- OpsGenie: Alert routing and escalation
- Grafana Alerting: Metric-based alerts

Communication:
- Slack (auto-create dedicated incident channels)
- Zoom/Google Meet (War Room)
- StatusPage (external status page)

Incident Management:
- Incident.io: Slack-integrated incident management
- Rootly: Automated incident workflows
- Blameless: Incident + postmortem platform
- FireHydrant: Incident response automation

Documentation:
- Confluence/Notion: Postmortem storage
- Google Docs: Real-time collaboration
- Jira: Action item tracking

4. Incident Communication

4.1 Internal Communication

Slack Incident Channel Structure:

#incident-2025-0414-payment
  - Main channel for incident response
  - IC, response team, observers participate
  - Bot automatically records timeline

#incident-2025-0414-payment-comms
  - External communication coordination
  - Customer support, PR team participate
  - Draft/approve customer-facing messages

Rules:
1. Main channel for response-related content only
2. Casual discussion or speculation in separate threads
3. Share all major changes in the channel
4. Non-IC members report to IC

4.2 External Communication

Status Page Update Templates:

Investigating:
"We are aware of an issue with [service name] causing [symptoms].
 Our engineering team is investigating the cause.
 We will provide updates as more information becomes available."

Mitigating:
"We have identified the cause of the [service name] issue
 and are implementing a fix.
 Some users may still experience impact."

Monitoring:
"We have applied a fix for the [service name] issue
 and the service is recovering.
 We are continuing to monitor."

Resolved:
"The [service name] issue has been resolved.
 Service was impacted for approximately [N] minutes
 from [start time] to [end time].
 A detailed postmortem will be shared."

5. Blameless Postmortem

5.1 Why Blameless

Blame culture causes the following problems:

Problems with Blame Culture:

1. Information hiding
   "If I report my mistake, I'll be punished"
   -> True causes remain hidden

2. Defensive behavior
   "I need to prove it wasn't my fault"
   -> Focus on avoiding blame rather than improving systems

3. Innovation suppression
   "We shouldn't change anything to avoid mistakes"
   -> Even necessary improvements don't get made

4. Trust destruction
   "My colleague might blame me"
   -> Team collaboration suffers

Blameless Culture:
"Fix the system, not the person"
- Everyone acted as they believed was correct
- Failures reveal system vulnerabilities
- Sharing mistakes is safe
- Root causes always lie in systems/processes

5.2 Postmortem Template

Postmortem: [Incident Title]
Date: YYYY-MM-DD
Author: [Name]
Reviewers: [Names]

1. Summary
   [1-2 sentence incident summary]

2. Impact
   - Duration: [Start] to [End] ([N] minutes)
   - Affected users: [Count or percentage]
   - Affected services: [Service list]
   - Financial impact: [If applicable]

3. Timeline (all times UTC)
   14:23 - Monitoring alert fires (payment-service error rate increase)
   14:25 - On-call engineer acknowledges
   14:28 - SEV2 incident declared, IC assigned
   14:30 - Incident channel created
   14:35 - Investigation: database connection pool exhaustion confirmed
   14:40 - Mitigation: increase connection pool size
   14:45 - Service recovery confirmed
   14:50 - Switch to monitoring state
   15:15 - Incident closure declared

4. Root Cause
   [Detailed technical explanation]

5. 5 Whys Analysis
   Why 1: Why did payments fail?
   -> Could not obtain a database connection

   Why 2: Why could it not obtain a connection?
   -> Connection pool was fully exhausted

   Why 3: Why was the connection pool exhausted?
   -> Slow queries were holding connections for too long

   Why 4: Why were there slow queries?
   -> Full table scan on a table without proper indexes

   Why 5: Why were indexes missing?
   -> No index review process when adding new tables

6. What Went Well
   - Alert fired within 2 minutes
   - IC quickly coordinated the team
   - Mitigation was effective

7. What Needs Improvement
   - No database connection pool monitoring
   - No slow query detection alerting
   - Need index review process for new tables

8. Action Items
   [HIGH] Add database connection pool utilization alerting
   Owner: DB Team | Due: 2025-04-21

   [HIGH] Implement new table/query index review checklist
   Owner: Backend Team | Due: 2025-04-28

   [MED] Build slow query auto-detection and alerting system
   Owner: SRE Team | Due: 2025-05-15

   [LOW] Investigate automatic connection pool sizing mechanism
   Owner: Infra Team | Due: 2025-06-01

9. Lessons Learned
   [Key lessons from this incident]

5.3 Postmortem Review Process

Postmortem Review Checklist:

Writing Quality:
[ ] Is the timeline accurate and complete?
[ ] Is the root cause analyzed in depth?
[ ] Do the 5 Whys end at system/process? (must NOT end at a person)
[ ] Is the language blameless?

Action Items:
[ ] Do all action items have assigned owners?
[ ] Are deadlines realistic?
[ ] Are priorities appropriate?
[ ] Are the measures effective for preventing recurrence?

Sharing:
[ ] Has it been shared with relevant teams?
[ ] Is it scheduled for the company-wide postmortem review meeting?
[ ] Have similar services been checked for the same vulnerability?

5.4 Building Postmortem Culture

Postmortem Culture Practices:

1. Weekly Postmortem Review Meeting
   - Every Friday, 30 minutes
   - Share the week's incident postmortems
   - Open to all teams

2. Postmortem Reading Club
   - Analyze publicly available postmortems from other companies
   - Derive lessons applicable to our systems
   - Monthly

3. Blameless Language Guide
   Bad: "John deployed a query without indexes and caused the outage"
   Good: "A query without indexes was deployed to production.
          The code review process lacked a query performance review step."

4. Action Item Tracking
   - Manage action items via Jira board
   - Track monthly completion rate
   - Review incomplete items

6. On-Call Operations

6.1 On-Call Rotation

On-Call Rotation Design:

Core Principles:
1. Minimum 2 people always on-call (Primary + Secondary)
2. 1-week rotation (starts Monday 09:00)
3. No consecutive on-call (minimum 2-week gap)
4. Holidays are volunteer or extra-compensated

Rotation Example (6-person team):
Week 1: Alice (P) + Bob (S)
Week 2: Charlie (P) + Diana (S)
Week 3: Eve (P) + Frank (S)
Week 4: Bob (P) + Alice (S)
Week 5: Diana (P) + Charlie (S)
Week 6: Frank (P) + Eve (S)

Swap Rules:
- Minimum 48-hour advance notice
- Responsibility to find own replacement
- Notify team lead of swap
- Update on-call calendar

6.2 Escalation Chain

Escalation Policy:

Level 1: On-Call Primary (Immediate)
  -> If no response within 5 minutes

Level 2: On-Call Secondary (5 minutes)
  -> If unresolved within 15 minutes

Level 3: Team Lead (15 minutes)
  -> If SEV1 or unresolved within 30 minutes

Level 4: Engineering Manager (30 minutes)
  -> If SEV1 persists for more than 1 hour

Level 5: VP/CTO (1 hour)

Automatic Escalation Configuration (PagerDuty):

escalation_policy:
  name: "payment-service"
  rules:
    - targets:
        - type: "user_reference"
          id: "PRIMARY_ON_CALL"
      escalation_delay_in_minutes: 5
    - targets:
        - type: "user_reference"
          id: "SECONDARY_ON_CALL"
      escalation_delay_in_minutes: 15
    - targets:
        - type: "user_reference"
          id: "TEAM_LEAD"
      escalation_delay_in_minutes: 30

6.3 On-Call Fatigue Management

On-Call Fatigue Management Strategies:

1. Improve Alert Quality
   Problem: Too many alerts -> alert fatigue -> miss real issues
   Solutions:
   - Weekly alert review: disable unnecessary alerts
   - Eliminate duplicate alerts
   - Add context to alerts (what is wrong, how to respond)
   - Target: fewer than 20 alerts per on-call week

2. Write Runbooks
   Create runbooks for all recurring alerts
   Runbook structure:
   - Symptom description
   - Impact scope assessment method
   - Step-by-step response procedure
   - Escalation criteria
   - Related dashboard links

3. Compensation
   - On-call stipend (fixed weekly amount)
   - Additional compensation for night/weekend calls
   - Compensatory time off
   - Special compensation for extended on-call periods

4. Mental Health
   - Debrief session after on-call rotation
   - Mental Health Day after difficult incidents
   - Complete disconnect when not on-call
   - Team culture of sharing on-call experiences

6.4 Runbook Example

Runbook: payment-service High Error Rate

Trigger: payment-service HTTP 5xx rate exceeds 1%

1. Assess the Situation
   - Check Grafana dashboard:
     [Dashboard URL]
   - Check error logs:
     kubectl logs -l app=payment-service --tail=100 -n production
   - Check recent deployments:
     kubectl rollout history deployment/payment-service -n production

2. Common Causes

   Cause A: Database connection issue
   Check: Review connection pool metrics
   Action: Reset or expand connection pool
   Command:
     kubectl rollout restart deployment/payment-service -n production

   Cause B: External API failure
   Check: Check external API status page
   Action: Force-open circuit breaker
   Command:
     curl -X POST http://payment-service/admin/circuit-breaker/open

   Cause C: Recent deployment issue
   Check: Compare deployment timeline with error start time
   Action: Rollback to previous version
   Command:
     kubectl rollout undo deployment/payment-service -n production

3. Escalation
   - If unresolved within 15 minutes: Escalate to team lead
   - If judged SEV1: Declare incident

7. Toil Elimination

7.1 What is Toil

Characteristics of Toil as defined in the Google SRE book:

Toil Characteristics:

1. Manual
   Work that requires a human to perform
   Example: Manually restarting servers

2. Repetitive
   Performing the same task repeatedly
   Example: Renewing certificates weekly

3. Automatable
   Work that a machine could do instead
   Example: Running disk cleanup scripts

4. Tactical
   Immediate response with no long-term value
   Example: Applying temporary workarounds

5. Scales with Service
   Work increases as service grows
   Example: Manual user provisioning

6. No Enduring Value
   Service is not improved after the work
   Example: Manual deployment approval

7.2 Measuring Toil

Toil Measurement Methods:

1. Time Tracking
   - Record all work activities for 2 weeks
   - Classify Toil vs Engineering time
   - Target: Toil below 50%

2. Toil Categories:
   Category A: Incident response (emergency)
   Category B: Scheduled operational tasks (planned)
   Category C: Manual provisioning (request-based)
   Category D: Manual monitoring/verification (checks)

3. Measurement Template:

   Team: SRE Team
   Period: 2025-04-01 to 2025-04-14

   | Task | Frequency | Time Spent | Toil? | Automatable? |
   |------|-----------|-----------|-------|-------------|
   | Certificate renewal | Weekly | 30min | Yes | Yes |
   | Disk cleanup | Daily | 15min | Yes | Yes |
   | Deploy approval | 3x daily | 10min each | Yes | Yes |
   | Capacity review | Weekly | 2 hours | Partial | Partial |
   | Incident response | 2x weekly | 1hr each | No | No |

   Total hours: 80 hours (2 weeks)
   Toil hours: 45 hours (56%)
   Target: below 40 hours (50%)

7.3 Toil Automation Priority

Automation Priority Matrix:

              High Frequency    Low Frequency
High Impact  | P1: Automate   | P2: Plan to  |
             | immediately    | automate     |
             |----------------|--------------|
Low Impact   | P3: Add to     | P4: Can      |
             | backlog        | ignore       |

P1 Automation Examples:
- Automatic certificate renewal (cert-manager)
- Automatic disk cleanup (cron job)
- Automatic deploy approval (CI/CD pipeline)
- Auto-scaling (HPA/VPA)

P2 Automation Examples:
- Capacity planning automation (prediction-based)
- Incident response automation (auto-recovery)
- Automatic security patch application

7.4 Automation Examples

# Automatic certificate renewal with cert-manager
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: my-service-tls
  namespace: production
spec:
  secretName: my-service-tls-secret
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - api.example.com
    - www.example.com
  renewBefore: 720h  # Auto-renew 30 days before expiry
# Disk Cleanup Automation CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
  name: disk-cleanup
  namespace: production
spec:
  schedule: "0 3 * * *"  # Daily at 3 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: cleanup
              image: busybox:1.36
              command:
                - /bin/sh
                - -c
                - |
                  find /data/logs -mtime +7 -delete
                  find /data/tmp -mtime +1 -delete
          restartPolicy: OnFailure
# HPA Auto-scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

8. Capacity Planning

8.1 Demand Forecasting

Capacity Planning Framework:

1. Assess Current Capacity
   - CPU: 60% of total cluster in use
   - Memory: 55% of total cluster in use
   - Storage: Growing 100GB per month
   - Network: 40% of peak bandwidth in use

2. Growth Rate Analysis
   - Last 12 months traffic growth: 15% monthly
   - Seasonal patterns: Black Friday 3x, New Year 2x
   - Planned events: New feature launch expected 50% traffic increase

3. Headroom Buffer
   - General buffer: 30-50%
   - N+1 principle: Service survives 1 node failure
   - Peak preparedness: Handle 3x normal traffic

4. Provisioning Lead Time
   - Cloud: Minutes (auto-scaling)
   - On-premises: Weeks to months (hardware purchase)
   - Hybrid: Base on-premises + burst to cloud

8.2 Load Testing

// k6 Load Test Script
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '5m', target: 100 },   // Ramp up to 100 VUs
    { duration: '10m', target: 100 },   // Maintain 100 VUs
    { duration: '5m', target: 500 },    // Ramp up to 500 VUs
    { duration: '10m', target: 500 },   // Maintain 500 VUs
    { duration: '5m', target: 1000 },   // Ramp up to 1000 VUs
    { duration: '10m', target: 1000 },  // Maintain 1000 VUs
    { duration: '5m', target: 0 },      // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],
    http_req_failed: ['rate<0.01'],
  },
};

export default function () {
  const res = http.get('https://api.example.com/health');
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });
  sleep(1);
}

9. Release Engineering

9.1 Canary Deployment

# Argo Rollouts Canary Deployment
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
  namespace: production
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: payment-service-canary
      stableService: payment-service-stable
      trafficRouting:
        istio:
          virtualService:
            name: payment-service-vs
      steps:
        - setWeight: 5
        - pause:
            duration: 10m
        - setWeight: 20
        - pause:
            duration: 10m
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 50
        - pause:
            duration: 15m
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 2
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 60s
      successCondition: result[0] >= 0.999
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{
              service="payment-service",
              status=~"2.."
            }[5m]))
            /
            sum(rate(http_requests_total{
              service="payment-service"
            }[5m]))

9.2 Feature Flags

Feature Flag Deployment Strategy:

Phase 1: Internal employees only (dogfooding)
   percentage: 0, whitelist: [internal employee IDs]

Phase 2: 1% rollout
   percentage: 1

Phase 3: 10% rollout
   percentage: 10

Phase 4: 50% rollout
   percentage: 50

Phase 5: Full rollout
   percentage: 100

At each phase:
- Monitor error rates
- Collect user feedback
- Check business metrics
- Immediately disable if issues found

10. Google SRE Book Key Lessons

The 10 most important lessons from the Google SRE book.

Lesson 1: 100% availability is the wrong target
  -> No perfect system exists
  -> Define appropriate reliability level with SLOs

Lesson 2: Error Budget enables innovation
  -> With budget, you can take risks
  -> Without budget, focus on reliability

Lesson 3: Toil exceeding 50% is a danger signal
  -> If more than 50% of SRE time is Toil
  -> Invest in automation or expand the team

Lesson 4: Monitoring should be symptom-based
  -> "CPU is at 90%" (cause) vs "Response time is slow" (symptom)
  -> Monitor symptoms that users experience

Lesson 5: Postmortems must be blameless
  -> Fix the system, not the person
  -> Create a culture where sharing mistakes is safe

Lesson 6: Simplicity is reliability
  -> Complex systems fail in unpredictable ways
  -> Keep things as simple as possible

Lesson 7: Release engineering is key to reliability
  -> Canary deployment, feature flags, automatic rollback
  -> Safe deployments enable frequent deployments

Lesson 8: Capacity planning is proactive, not reactive
  -> Prepare before traffic grows
  -> Understand limits through load testing

Lesson 9: On-call must be sustainable
  -> Reduce alert fatigue
  -> Provide appropriate compensation and rest

Lesson 10: SRE is the whole organization's responsibility
  -> Not just the SRE team's job
  -> Development teams must share reliability responsibility

11. SRE Tool Ecosystem

11.1 Incident Management Tools

Tool Comparison:

PagerDuty:
  - Market leader
  - Strong on-call management
  - 800+ integrations
  - AI-based incident classification
  - Pricing: from $21/user/month

OpsGenie (Atlassian):
  - Native Jira/Confluence integration
  - Flexible alert routing
  - Team collaboration features
  - Pricing: from $9/user/month

Incident.io:
  - Slack-native
  - Automated workflows
  - Postmortem generation automation
  - Pricing: from $16/user/month

Rootly:
  - Slack-based incident management
  - Auto-execute runbooks
  - Rich integrations
  - Pricing: from $15/user/month

FireHydrant:
  - Full incident lifecycle management
  - Status page integration
  - Automatic escalation
  - Pricing: Free tier + paid plans

11.2 Observability Tools

Three Pillars of Observability:

Logs:
  - ELK Stack (Elasticsearch + Logstash + Kibana)
  - Loki + Grafana
  - Datadog Logs
  - Splunk

Metrics:
  - Prometheus + Grafana
  - Datadog
  - New Relic
  - CloudWatch

Traces:
  - Jaeger
  - Zipkin
  - Datadog APM
  - OpenTelemetry (unified standard)

12. Building an SRE Team

12.1 Team Models

SRE Team Models:

1. Centralized Model
   Traits: One SRE team covers all services
   Pros: Consistent practices, easy knowledge sharing
   Cons: Bottleneck, lack of per-service depth
   Fits: Small organizations (fewer than 10 services)

2. Embedded Model
   Traits: SREs embedded within development teams
   Pros: Deep service understanding, fast response
   Cons: Inconsistency, risk of SRE isolation
   Fits: Large organizations (50+ services)

3. Hybrid Model
   Traits: Central SRE team + SRE champions in each team
   Pros: Balance of consistency and depth
   Cons: Coordination overhead
   Fits: Medium organizations (10-50 services)

4. Consulting Model
   Traits: SRE engages only when needed
   Pros: Scalable, cost-efficient
   Cons: Lack of continuous involvement
   Fits: Organizations early in SRE adoption

12.2 SRE Hiring

SRE Engineer Key Competencies:

Technical:
- System administration (Linux, networking, storage)
- Programming (Python, Go, Bash)
- Cloud platforms (AWS, GCP, Azure)
- Containers/orchestration (Docker, Kubernetes)
- Monitoring/observability (Prometheus, Grafana)
- CI/CD (Jenkins, GitHub Actions, ArgoCD)
- IaC (Terraform, Pulumi)

Soft Skills:
- Problem solving (systematic debugging)
- Communication (clear during incidents)
- Stress management (decisions under on-call pressure)
- Documentation skills (runbooks, postmortems)
- Collaboration (working with dev teams)

12.3 SRE Onboarding

SRE Onboarding Program (12 Weeks):

Week 1-2: Foundations
  - Architecture overview
  - Core service understanding
  - Monitoring tool training
  - On-call tool setup

Week 3-4: Observation
  - Shadow senior SRE's on-call
  - Observe incident response
  - Read and understand runbooks

Week 5-6: Practice
  - Handle simple incidents (with senior support)
  - Update runbooks
  - Improve monitoring alerts

Week 7-8: Independence
  - Perform Secondary on-call
  - Write postmortems
  - Start Toil automation project

Week 9-10: Advanced
  - Perform Primary on-call
  - Experience as Incident Commander
  - Participate in SLO reviews

Week 11-12: Contribution
  - Complete automation project
  - Improve onboarding documentation
  - Prepare to mentor next new member

13. SRE Time Allocation

Ideal SRE Time Allocation:

Engineering Work: 50% or more
  - Automation development
  - Tool improvement
  - System design
  - Code review

Operational Work (Toil): 50% or less
  - On-call response
  - Deploy management
  - Manual provisioning
  - Manual monitoring

Warning Signs:
- Toil over 50%: Need automation investment
- Toil over 70%: Need team expansion or service redesign
- Toil over 90%: Crisis - immediate executive intervention needed

SRE Time Tracking Method:
- Record time by category in 2-week intervals
- Quarterly Toil ratio report
- Set and track Toil reduction targets

14. Quiz

Test your understanding with these questions.

Q1: With an SLO of 99.9%, what is the monthly Error Budget?

Answer: Approximately 43.2 minutes

Calculation: 30 days x 24 hours x 60 minutes x (1 - 0.999) = 30 x 24 x 60 x 0.001 = 43.2 minutes

This means approximately 43 minutes of downtime is allowed per month. If this is exceeded, the Error Budget is exhausted, and per policy, feature deployments should stop to focus on reliability improvements.

Q2: Why must the "5 Whys" analysis in a blameless postmortem end at system/process, not at a person?

Answer:

If the 5 Whys analysis ends at "someone made a mistake," the improvement action becomes just "train that person" -- which is not a fundamental fix.

Reasons it must end at system/process:

  1. Prevents recurrence: Fixing the system means the same mistake cannot happen regardless of who performs the task.
  2. Information sharing: When people fear blame, they hide mistakes, making it harder to find the real causes.
  3. Scalable solutions: "Training" applies to one person, but "automated verification" applies to all deployments.

Bad: "The developer deployed a query without indexes" (blaming person) Good: "There was no query performance review step in the deployment process" (improving process)

Q3: List the 5 characteristics of Toil and explain why it should be kept below 50%.

Answer:

5 characteristics of Toil:

  1. Manual: Requires a human to perform
  2. Repetitive: Same task repeated
  3. Automatable: Can be replaced by machines
  4. Tactical: Immediate response with no long-term value
  5. Scales with Service: Increases as the service grows

Why keep below 50%:

  • The core value of SRE is improving operations through engineering
  • When Toil exceeds 50%, there is insufficient time for automation and system improvement
  • This creates a vicious cycle: too much Toil to automate, and no automation means even more Toil
  • The Google SRE book defines Toil over 50% as a "danger signal"
Q4: Describe 3 strategies for reducing on-call fatigue.

Answer:

  1. Improve alert quality

    • Disable unnecessary alerts (weekly alert review)
    • Add sufficient context to alerts (what is wrong, how to respond)
    • Target: fewer than 20 alerts per on-call week
  2. Write and maintain runbooks

    • Document step-by-step response procedures for all recurring alerts
    • Runbooks enable fast, accurate response even during night calls
    • Regularly update runbooks
  3. Provide adequate compensation and rest

    • Offer on-call stipend and extra compensation for night/weekend calls
    • Provide Mental Health Day after difficult incidents
    • Ensure complete disconnect when not on-call to prevent burnout
Q5: Explain what "class SRE implements DevOps" means.

Answer:

This expression uses an object-oriented programming analogy to explain the relationship between SRE and DevOps.

  • DevOps is the Interface: It defines culture, philosophy, and values. It prescribes principles like "dev and ops must collaborate," "automate everything," and "continuously improve" -- but does not specify concrete implementation methods.

  • SRE is the Implementation Class: It concretely implements the DevOps philosophy. It quantifies objectives with SLO/SLI, provides a decision-making framework with Error Budgets, and executes through Toil measurement and automation.

In other words, if DevOps defines "what to do," SRE specifies "how to do it." They are complementary, not competitive.


References

  1. Site Reliability Engineering - Betsy Beyer et al. (Google, O'Reilly)
  2. The Site Reliability Workbook - Betsy Beyer et al. (Google, O'Reilly)
  3. Building Secure and Reliable Systems - Heather Adkins et al. (Google, O'Reilly)
  4. Google SRE Resources - sre.google
  5. Implementing Service Level Objectives - Alex Hidalgo (O'Reilly)
  6. Incident Management for Operations - Rob Schnepp et al. (O'Reilly)
  7. PagerDuty Incident Response Guide - response.pagerduty.com
  8. Atlassian Incident Management Handbook - atlassian.com/incident-management
  9. Blameless Postmortem Guide - blameless.com
  10. Rootly SRE Guide - rootly.com/blog
  11. Incident.io Blog - incident.io/blog
  12. Netflix Tech Blog: SRE Practices - netflixtechblog.com
  13. LinkedIn SRE Practices - engineering.linkedin.com
  14. Dropbox SRE - dropbox.tech