- Published on
SRE Practices Guide 2025: Incident Management, Postmortem, Error Budget, On-Call, Toil Elimination
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Introduction: What is SRE
Site Reliability Engineering (SRE) is a software engineering approach created by Google, born from the question: "What happens when you ask a software engineer to design an operations function?"
Ben Treynor Sloss (Google VP of Engineering) famously defined it:
"SRE is what happens when you ask a software engineer to design an operations function."
SRE is not just a job title -- it is a culture and philosophy. The core principles are:
SRE Core Principles:
1. Apply software engineering to operations problems
2. Quantify reliability targets with SLOs
3. Balance innovation and reliability with Error Budgets
4. Reduce Toil and invest in automation
5. Monitoring should be symptom-based, not cause-based
6. Pursue simplicity
7. Learn continuously through blameless postmortems
SRE vs DevOps
Relationship between DevOps and SRE:
DevOps = Culture, philosophy, values
- Collaboration between dev and ops
- Continuous integration/delivery
- Infrastructure as Code
- Feedback loops
SRE = Concrete implementation of DevOps
- "class SRE implements DevOps"
- Measurable objectives (SLO/SLI)
- Error Budget as a decision framework
- Engineering-based operations approach
They complement, not compete:
DevOps defines "what to do", SRE defines "how to do it"
1. SLO, SLI, SLA
1.1 Concepts
SLA (Service Level Agreement)
= Contract between service provider and customer
= Financial/legal consequences for violations
Example: "99.9% monthly availability guaranteed, 10% service credit if missed"
SLO (Service Level Objective)
= Internal target
= Set stricter than SLA (to maintain buffer)
Example: "99.95% monthly availability target" (stricter than 99.9% SLA)
SLI (Service Level Indicator)
= Actual measurement
= Metrics used to evaluate SLOs
Example: "Actual availability over last 30 days: 99.97%"
1.2 Choosing Good SLIs
SLI Examples by Service Type:
API Services:
- Availability: Successful requests / Total requests
- Latency: Proportion of requests with p99 < 200ms
- Throughput: Requests processed per second
Data Pipelines:
- Freshness: Proportion of data processed within N minutes
- Completeness: Processed records / Expected records
- Correctness: Proportion of correctly processed records
Storage Systems:
- Durability: Proportion of data preserved without loss
- Availability: Successful read/write request ratio
- Latency: p50 read latency
1.3 SLO Setting Guide
SLO Setting Process:
Step 1: Start from the user perspective
"What matters most to users?"
-> Page load time, payment success rate, notification delay
Step 2: Define SLIs
Availability SLI = Successful HTTP responses (2xx, 3xx) / Total responses
Latency SLI = Proportion of responses within 200ms
Step 3: Set initial SLO
- Analyze current performance data (last 30 days)
- Set current p50 level as initial SLO
- Don't set it too high!
Step 4: Calculate Error Budget
SLO 99.9% -> Error Budget = 0.1%
Monthly: 30 days x 24 hours x 60 minutes x 0.001 = 43.2 minutes
Step 5: Iterate
- Review SLO every 4 weeks
- Incorporate user feedback
- Adjust SLO as needed
1.4 SLO Dashboard
# Prometheus + Grafana SLO Dashboard
# Availability SLO Query
availability_slo:
target: 99.9
window: 30d
query: |
sum(rate(http_requests_total{status=~"2.."}[30d]))
/
sum(rate(http_requests_total[30d]))
* 100
# Latency SLO Query
latency_slo:
target: 99.0
threshold: 200ms
window: 30d
query: |
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))
* 100
# Error Budget Consumption Rate
error_budget_consumption:
query: |
1 - (
sum(rate(http_requests_total{status=~"2.."}[30d]))
/
sum(rate(http_requests_total[30d]))
)
/
(1 - 0.999)
* 100
2. Error Budget Policy
2.1 What is Error Budget
Error Budget is the "acceptable unreliability" derived from the SLO.
Error Budget Calculations:
SLO 99.9% (Three Nines):
Annual: 365d x 24h x 60m x 0.001 = 525.6 minutes (8h 45m)
Monthly: 30d x 24h x 60m x 0.001 = 43.2 minutes
Weekly: 7d x 24h x 60m x 0.001 = 10.08 minutes
SLO 99.95% (Three and a Half Nines):
Annual: 262.8 minutes (4h 23m)
Monthly: 21.6 minutes
Weekly: 5.04 minutes
SLO 99.99% (Four Nines):
Annual: 52.56 minutes
Monthly: 4.32 minutes
Weekly: 1.01 minutes
2.2 Error Budget Policy Document
Error Budget Policy v2.0
Last Modified: 2025-04-14
Approved by: CTO, VP Engineering, SRE Director
1. Purpose
Error Budget is a quantitative framework for maintaining
balance between reliability and innovation.
2. Response by Budget Status
Green (more than 50% budget remaining):
- Normal feature development and deployment
- Chaos experiments permitted
- Aggressive release schedule possible
Yellow (20-50% budget remaining):
- Reduce deployment velocity (limit to 2x per week)
- Prioritize reliability improvement work
- Chaos experiments only in staging
Red (less than 20% budget remaining):
- Deploy only reliability-related changes
- Feature freeze
- SRE team acts as deployment gatekeeper
- Focus on root cause analysis and resolution
Exhausted (0% budget):
- Complete halt of all non-essential deployments
- Incident-level response
- Joint SRE/dev reliability sprint
- Executive reporting and recovery planning
3. Exceptions
- Security patches: Deploy regardless of budget status
- Legal requirements: Deploy regardless of budget status
- Data integrity issues: Respond immediately
4. Review Cadence
- Weekly: SLO dashboard review
- Monthly: Error Budget consumption pattern analysis
- Quarterly: SLO target reassessment
2.3 Error Budget Decision Making
Scenario: 35 of 43.2 monthly Error Budget minutes consumed (8.2 remaining)
Q: Should we proceed with the new feature deployment?
Analysis:
- Remaining budget: 8.2 minutes (19%)
- Status: Red (below 20%)
- Deployment risk: Average 2 minutes downtime from similar past deployments
Decision:
-> Defer feature deployment
-> Use remaining budget for reliability improvements
-> Proceed with deployment when budget resets next month
Communication:
"We have consumed 81% of this month's Error Budget.
Per our policy, we will deploy only reliability-related changes.
Feature X deployment is deferred to next month."
3. Incident Management Lifecycle
3.1 Incident Severity Levels
Severity Definitions:
SEV1 (Critical):
Definition: Core service total outage, many users affected
Examples: Complete payment system down, data breach
Response: Immediate (within 5 minutes)
Escalation: VP + full on-call team
Communication: Status update every 15 minutes
Postmortem: Required (within 48 hours)
SEV2 (High):
Definition: Major feature degradation, significant users affected
Examples: Search 50% failure, API latency 10x increase
Response: Within 15 minutes
Escalation: Team lead + on-call
Communication: Status update every 30 minutes
Postmortem: Required (within 1 week)
SEV3 (Medium):
Definition: Partial feature degradation, few users affected
Examples: Slow image loading for specific region
Response: Within 1 hour
Escalation: On-call engineer
Communication: As needed
Postmortem: Optional (when learning value exists)
SEV4 (Low):
Definition: Minor issue, minimal user impact
Examples: Internal dashboard loading delay
Response: Next business day
Escalation: Owning team
Communication: Not needed
Postmortem: Not needed
3.2 Incident Response Process
Incident Lifecycle:
1. Detection
+-- Automated alerts (monitoring systems)
+-- User reports (customer support)
+-- Internal discovery (engineer directly)
|
v
2. Triage
+-- Severity assessment (SEV1-4)
+-- Assign Incident Commander
+-- Assemble response team
|
v
3. Mitigation
+-- Immediate mitigation actions
+-- Rollback, scale up, traffic shift
+-- Verify service recovery
|
v
4. Resolution
+-- Identify root cause
+-- Deploy permanent fix
+-- Confirm stability via monitoring
|
v
5. Postmortem
+-- Create timeline
+-- Root cause analysis
+-- Derive action items
+-- Share company-wide
3.3 Incident Commander Role
The Incident Commander (IC) is the cornerstone of incident response.
Incident Commander Responsibilities:
1. Coordination
- Assign roles to response team
- Prioritize work
- Prevent duplicate efforts
- Secure necessary resources
2. Communication
- Internal status updates (Slack channel)
- External status updates (status page)
- Executive reporting
- Customer support briefing
3. Decision Making
- Decide whether to rollback
- Judge escalation needs
- Request additional resources
- Declare incident closure
4. Documentation
- Record key event timestamps
- Document decision rationale
- Collect information for postmortem
IC Communication Template:
"Incident Status Update - [TIME]
Severity: SEV[N]
Status: [Investigating / Mitigating / Monitoring]
Impact: [Description of impact]
Current Action: [Ongoing actions]
Next Steps: [Planned next actions]
Next Update: [Time]"
3.4 Incident Response Tools
Incident Response Workflow Tools:
Alerting:
- PagerDuty: On-call management and alerting
- OpsGenie: Alert routing and escalation
- Grafana Alerting: Metric-based alerts
Communication:
- Slack (auto-create dedicated incident channels)
- Zoom/Google Meet (War Room)
- StatusPage (external status page)
Incident Management:
- Incident.io: Slack-integrated incident management
- Rootly: Automated incident workflows
- Blameless: Incident + postmortem platform
- FireHydrant: Incident response automation
Documentation:
- Confluence/Notion: Postmortem storage
- Google Docs: Real-time collaboration
- Jira: Action item tracking
4. Incident Communication
4.1 Internal Communication
Slack Incident Channel Structure:
#incident-2025-0414-payment
- Main channel for incident response
- IC, response team, observers participate
- Bot automatically records timeline
#incident-2025-0414-payment-comms
- External communication coordination
- Customer support, PR team participate
- Draft/approve customer-facing messages
Rules:
1. Main channel for response-related content only
2. Casual discussion or speculation in separate threads
3. Share all major changes in the channel
4. Non-IC members report to IC
4.2 External Communication
Status Page Update Templates:
Investigating:
"We are aware of an issue with [service name] causing [symptoms].
Our engineering team is investigating the cause.
We will provide updates as more information becomes available."
Mitigating:
"We have identified the cause of the [service name] issue
and are implementing a fix.
Some users may still experience impact."
Monitoring:
"We have applied a fix for the [service name] issue
and the service is recovering.
We are continuing to monitor."
Resolved:
"The [service name] issue has been resolved.
Service was impacted for approximately [N] minutes
from [start time] to [end time].
A detailed postmortem will be shared."
5. Blameless Postmortem
5.1 Why Blameless
Blame culture causes the following problems:
Problems with Blame Culture:
1. Information hiding
"If I report my mistake, I'll be punished"
-> True causes remain hidden
2. Defensive behavior
"I need to prove it wasn't my fault"
-> Focus on avoiding blame rather than improving systems
3. Innovation suppression
"We shouldn't change anything to avoid mistakes"
-> Even necessary improvements don't get made
4. Trust destruction
"My colleague might blame me"
-> Team collaboration suffers
Blameless Culture:
"Fix the system, not the person"
- Everyone acted as they believed was correct
- Failures reveal system vulnerabilities
- Sharing mistakes is safe
- Root causes always lie in systems/processes
5.2 Postmortem Template
Postmortem: [Incident Title]
Date: YYYY-MM-DD
Author: [Name]
Reviewers: [Names]
1. Summary
[1-2 sentence incident summary]
2. Impact
- Duration: [Start] to [End] ([N] minutes)
- Affected users: [Count or percentage]
- Affected services: [Service list]
- Financial impact: [If applicable]
3. Timeline (all times UTC)
14:23 - Monitoring alert fires (payment-service error rate increase)
14:25 - On-call engineer acknowledges
14:28 - SEV2 incident declared, IC assigned
14:30 - Incident channel created
14:35 - Investigation: database connection pool exhaustion confirmed
14:40 - Mitigation: increase connection pool size
14:45 - Service recovery confirmed
14:50 - Switch to monitoring state
15:15 - Incident closure declared
4. Root Cause
[Detailed technical explanation]
5. 5 Whys Analysis
Why 1: Why did payments fail?
-> Could not obtain a database connection
Why 2: Why could it not obtain a connection?
-> Connection pool was fully exhausted
Why 3: Why was the connection pool exhausted?
-> Slow queries were holding connections for too long
Why 4: Why were there slow queries?
-> Full table scan on a table without proper indexes
Why 5: Why were indexes missing?
-> No index review process when adding new tables
6. What Went Well
- Alert fired within 2 minutes
- IC quickly coordinated the team
- Mitigation was effective
7. What Needs Improvement
- No database connection pool monitoring
- No slow query detection alerting
- Need index review process for new tables
8. Action Items
[HIGH] Add database connection pool utilization alerting
Owner: DB Team | Due: 2025-04-21
[HIGH] Implement new table/query index review checklist
Owner: Backend Team | Due: 2025-04-28
[MED] Build slow query auto-detection and alerting system
Owner: SRE Team | Due: 2025-05-15
[LOW] Investigate automatic connection pool sizing mechanism
Owner: Infra Team | Due: 2025-06-01
9. Lessons Learned
[Key lessons from this incident]
5.3 Postmortem Review Process
Postmortem Review Checklist:
Writing Quality:
[ ] Is the timeline accurate and complete?
[ ] Is the root cause analyzed in depth?
[ ] Do the 5 Whys end at system/process? (must NOT end at a person)
[ ] Is the language blameless?
Action Items:
[ ] Do all action items have assigned owners?
[ ] Are deadlines realistic?
[ ] Are priorities appropriate?
[ ] Are the measures effective for preventing recurrence?
Sharing:
[ ] Has it been shared with relevant teams?
[ ] Is it scheduled for the company-wide postmortem review meeting?
[ ] Have similar services been checked for the same vulnerability?
5.4 Building Postmortem Culture
Postmortem Culture Practices:
1. Weekly Postmortem Review Meeting
- Every Friday, 30 minutes
- Share the week's incident postmortems
- Open to all teams
2. Postmortem Reading Club
- Analyze publicly available postmortems from other companies
- Derive lessons applicable to our systems
- Monthly
3. Blameless Language Guide
Bad: "John deployed a query without indexes and caused the outage"
Good: "A query without indexes was deployed to production.
The code review process lacked a query performance review step."
4. Action Item Tracking
- Manage action items via Jira board
- Track monthly completion rate
- Review incomplete items
6. On-Call Operations
6.1 On-Call Rotation
On-Call Rotation Design:
Core Principles:
1. Minimum 2 people always on-call (Primary + Secondary)
2. 1-week rotation (starts Monday 09:00)
3. No consecutive on-call (minimum 2-week gap)
4. Holidays are volunteer or extra-compensated
Rotation Example (6-person team):
Week 1: Alice (P) + Bob (S)
Week 2: Charlie (P) + Diana (S)
Week 3: Eve (P) + Frank (S)
Week 4: Bob (P) + Alice (S)
Week 5: Diana (P) + Charlie (S)
Week 6: Frank (P) + Eve (S)
Swap Rules:
- Minimum 48-hour advance notice
- Responsibility to find own replacement
- Notify team lead of swap
- Update on-call calendar
6.2 Escalation Chain
Escalation Policy:
Level 1: On-Call Primary (Immediate)
-> If no response within 5 minutes
Level 2: On-Call Secondary (5 minutes)
-> If unresolved within 15 minutes
Level 3: Team Lead (15 minutes)
-> If SEV1 or unresolved within 30 minutes
Level 4: Engineering Manager (30 minutes)
-> If SEV1 persists for more than 1 hour
Level 5: VP/CTO (1 hour)
Automatic Escalation Configuration (PagerDuty):
escalation_policy:
name: "payment-service"
rules:
- targets:
- type: "user_reference"
id: "PRIMARY_ON_CALL"
escalation_delay_in_minutes: 5
- targets:
- type: "user_reference"
id: "SECONDARY_ON_CALL"
escalation_delay_in_minutes: 15
- targets:
- type: "user_reference"
id: "TEAM_LEAD"
escalation_delay_in_minutes: 30
6.3 On-Call Fatigue Management
On-Call Fatigue Management Strategies:
1. Improve Alert Quality
Problem: Too many alerts -> alert fatigue -> miss real issues
Solutions:
- Weekly alert review: disable unnecessary alerts
- Eliminate duplicate alerts
- Add context to alerts (what is wrong, how to respond)
- Target: fewer than 20 alerts per on-call week
2. Write Runbooks
Create runbooks for all recurring alerts
Runbook structure:
- Symptom description
- Impact scope assessment method
- Step-by-step response procedure
- Escalation criteria
- Related dashboard links
3. Compensation
- On-call stipend (fixed weekly amount)
- Additional compensation for night/weekend calls
- Compensatory time off
- Special compensation for extended on-call periods
4. Mental Health
- Debrief session after on-call rotation
- Mental Health Day after difficult incidents
- Complete disconnect when not on-call
- Team culture of sharing on-call experiences
6.4 Runbook Example
Runbook: payment-service High Error Rate
Trigger: payment-service HTTP 5xx rate exceeds 1%
1. Assess the Situation
- Check Grafana dashboard:
[Dashboard URL]
- Check error logs:
kubectl logs -l app=payment-service --tail=100 -n production
- Check recent deployments:
kubectl rollout history deployment/payment-service -n production
2. Common Causes
Cause A: Database connection issue
Check: Review connection pool metrics
Action: Reset or expand connection pool
Command:
kubectl rollout restart deployment/payment-service -n production
Cause B: External API failure
Check: Check external API status page
Action: Force-open circuit breaker
Command:
curl -X POST http://payment-service/admin/circuit-breaker/open
Cause C: Recent deployment issue
Check: Compare deployment timeline with error start time
Action: Rollback to previous version
Command:
kubectl rollout undo deployment/payment-service -n production
3. Escalation
- If unresolved within 15 minutes: Escalate to team lead
- If judged SEV1: Declare incident
7. Toil Elimination
7.1 What is Toil
Characteristics of Toil as defined in the Google SRE book:
Toil Characteristics:
1. Manual
Work that requires a human to perform
Example: Manually restarting servers
2. Repetitive
Performing the same task repeatedly
Example: Renewing certificates weekly
3. Automatable
Work that a machine could do instead
Example: Running disk cleanup scripts
4. Tactical
Immediate response with no long-term value
Example: Applying temporary workarounds
5. Scales with Service
Work increases as service grows
Example: Manual user provisioning
6. No Enduring Value
Service is not improved after the work
Example: Manual deployment approval
7.2 Measuring Toil
Toil Measurement Methods:
1. Time Tracking
- Record all work activities for 2 weeks
- Classify Toil vs Engineering time
- Target: Toil below 50%
2. Toil Categories:
Category A: Incident response (emergency)
Category B: Scheduled operational tasks (planned)
Category C: Manual provisioning (request-based)
Category D: Manual monitoring/verification (checks)
3. Measurement Template:
Team: SRE Team
Period: 2025-04-01 to 2025-04-14
| Task | Frequency | Time Spent | Toil? | Automatable? |
|------|-----------|-----------|-------|-------------|
| Certificate renewal | Weekly | 30min | Yes | Yes |
| Disk cleanup | Daily | 15min | Yes | Yes |
| Deploy approval | 3x daily | 10min each | Yes | Yes |
| Capacity review | Weekly | 2 hours | Partial | Partial |
| Incident response | 2x weekly | 1hr each | No | No |
Total hours: 80 hours (2 weeks)
Toil hours: 45 hours (56%)
Target: below 40 hours (50%)
7.3 Toil Automation Priority
Automation Priority Matrix:
High Frequency Low Frequency
High Impact | P1: Automate | P2: Plan to |
| immediately | automate |
|----------------|--------------|
Low Impact | P3: Add to | P4: Can |
| backlog | ignore |
P1 Automation Examples:
- Automatic certificate renewal (cert-manager)
- Automatic disk cleanup (cron job)
- Automatic deploy approval (CI/CD pipeline)
- Auto-scaling (HPA/VPA)
P2 Automation Examples:
- Capacity planning automation (prediction-based)
- Incident response automation (auto-recovery)
- Automatic security patch application
7.4 Automation Examples
# Automatic certificate renewal with cert-manager
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: my-service-tls
namespace: production
spec:
secretName: my-service-tls-secret
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- api.example.com
- www.example.com
renewBefore: 720h # Auto-renew 30 days before expiry
# Disk Cleanup Automation CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: disk-cleanup
namespace: production
spec:
schedule: "0 3 * * *" # Daily at 3 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: cleanup
image: busybox:1.36
command:
- /bin/sh
- -c
- |
find /data/logs -mtime +7 -delete
find /data/tmp -mtime +1 -delete
restartPolicy: OnFailure
# HPA Auto-scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payment-service-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-service
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
8. Capacity Planning
8.1 Demand Forecasting
Capacity Planning Framework:
1. Assess Current Capacity
- CPU: 60% of total cluster in use
- Memory: 55% of total cluster in use
- Storage: Growing 100GB per month
- Network: 40% of peak bandwidth in use
2. Growth Rate Analysis
- Last 12 months traffic growth: 15% monthly
- Seasonal patterns: Black Friday 3x, New Year 2x
- Planned events: New feature launch expected 50% traffic increase
3. Headroom Buffer
- General buffer: 30-50%
- N+1 principle: Service survives 1 node failure
- Peak preparedness: Handle 3x normal traffic
4. Provisioning Lead Time
- Cloud: Minutes (auto-scaling)
- On-premises: Weeks to months (hardware purchase)
- Hybrid: Base on-premises + burst to cloud
8.2 Load Testing
// k6 Load Test Script
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '5m', target: 100 }, // Ramp up to 100 VUs
{ duration: '10m', target: 100 }, // Maintain 100 VUs
{ duration: '5m', target: 500 }, // Ramp up to 500 VUs
{ duration: '10m', target: 500 }, // Maintain 500 VUs
{ duration: '5m', target: 1000 }, // Ramp up to 1000 VUs
{ duration: '10m', target: 1000 }, // Maintain 1000 VUs
{ duration: '5m', target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ['p(99)<500'],
http_req_failed: ['rate<0.01'],
},
};
export default function () {
const res = http.get('https://api.example.com/health');
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 500ms': (r) => r.timings.duration < 500,
});
sleep(1);
}
9. Release Engineering
9.1 Canary Deployment
# Argo Rollouts Canary Deployment
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payment-service
namespace: production
spec:
replicas: 10
strategy:
canary:
canaryService: payment-service-canary
stableService: payment-service-stable
trafficRouting:
istio:
virtualService:
name: payment-service-vs
steps:
- setWeight: 5
- pause:
duration: 10m
- setWeight: 20
- pause:
duration: 10m
- analysis:
templates:
- templateName: success-rate
- setWeight: 50
- pause:
duration: 15m
- setWeight: 100
analysis:
templates:
- templateName: success-rate
startingStep: 2
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 60s
successCondition: result[0] >= 0.999
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{
service="payment-service",
status=~"2.."
}[5m]))
/
sum(rate(http_requests_total{
service="payment-service"
}[5m]))
9.2 Feature Flags
Feature Flag Deployment Strategy:
Phase 1: Internal employees only (dogfooding)
percentage: 0, whitelist: [internal employee IDs]
Phase 2: 1% rollout
percentage: 1
Phase 3: 10% rollout
percentage: 10
Phase 4: 50% rollout
percentage: 50
Phase 5: Full rollout
percentage: 100
At each phase:
- Monitor error rates
- Collect user feedback
- Check business metrics
- Immediately disable if issues found
10. Google SRE Book Key Lessons
The 10 most important lessons from the Google SRE book.
Lesson 1: 100% availability is the wrong target
-> No perfect system exists
-> Define appropriate reliability level with SLOs
Lesson 2: Error Budget enables innovation
-> With budget, you can take risks
-> Without budget, focus on reliability
Lesson 3: Toil exceeding 50% is a danger signal
-> If more than 50% of SRE time is Toil
-> Invest in automation or expand the team
Lesson 4: Monitoring should be symptom-based
-> "CPU is at 90%" (cause) vs "Response time is slow" (symptom)
-> Monitor symptoms that users experience
Lesson 5: Postmortems must be blameless
-> Fix the system, not the person
-> Create a culture where sharing mistakes is safe
Lesson 6: Simplicity is reliability
-> Complex systems fail in unpredictable ways
-> Keep things as simple as possible
Lesson 7: Release engineering is key to reliability
-> Canary deployment, feature flags, automatic rollback
-> Safe deployments enable frequent deployments
Lesson 8: Capacity planning is proactive, not reactive
-> Prepare before traffic grows
-> Understand limits through load testing
Lesson 9: On-call must be sustainable
-> Reduce alert fatigue
-> Provide appropriate compensation and rest
Lesson 10: SRE is the whole organization's responsibility
-> Not just the SRE team's job
-> Development teams must share reliability responsibility
11. SRE Tool Ecosystem
11.1 Incident Management Tools
Tool Comparison:
PagerDuty:
- Market leader
- Strong on-call management
- 800+ integrations
- AI-based incident classification
- Pricing: from $21/user/month
OpsGenie (Atlassian):
- Native Jira/Confluence integration
- Flexible alert routing
- Team collaboration features
- Pricing: from $9/user/month
Incident.io:
- Slack-native
- Automated workflows
- Postmortem generation automation
- Pricing: from $16/user/month
Rootly:
- Slack-based incident management
- Auto-execute runbooks
- Rich integrations
- Pricing: from $15/user/month
FireHydrant:
- Full incident lifecycle management
- Status page integration
- Automatic escalation
- Pricing: Free tier + paid plans
11.2 Observability Tools
Three Pillars of Observability:
Logs:
- ELK Stack (Elasticsearch + Logstash + Kibana)
- Loki + Grafana
- Datadog Logs
- Splunk
Metrics:
- Prometheus + Grafana
- Datadog
- New Relic
- CloudWatch
Traces:
- Jaeger
- Zipkin
- Datadog APM
- OpenTelemetry (unified standard)
12. Building an SRE Team
12.1 Team Models
SRE Team Models:
1. Centralized Model
Traits: One SRE team covers all services
Pros: Consistent practices, easy knowledge sharing
Cons: Bottleneck, lack of per-service depth
Fits: Small organizations (fewer than 10 services)
2. Embedded Model
Traits: SREs embedded within development teams
Pros: Deep service understanding, fast response
Cons: Inconsistency, risk of SRE isolation
Fits: Large organizations (50+ services)
3. Hybrid Model
Traits: Central SRE team + SRE champions in each team
Pros: Balance of consistency and depth
Cons: Coordination overhead
Fits: Medium organizations (10-50 services)
4. Consulting Model
Traits: SRE engages only when needed
Pros: Scalable, cost-efficient
Cons: Lack of continuous involvement
Fits: Organizations early in SRE adoption
12.2 SRE Hiring
SRE Engineer Key Competencies:
Technical:
- System administration (Linux, networking, storage)
- Programming (Python, Go, Bash)
- Cloud platforms (AWS, GCP, Azure)
- Containers/orchestration (Docker, Kubernetes)
- Monitoring/observability (Prometheus, Grafana)
- CI/CD (Jenkins, GitHub Actions, ArgoCD)
- IaC (Terraform, Pulumi)
Soft Skills:
- Problem solving (systematic debugging)
- Communication (clear during incidents)
- Stress management (decisions under on-call pressure)
- Documentation skills (runbooks, postmortems)
- Collaboration (working with dev teams)
12.3 SRE Onboarding
SRE Onboarding Program (12 Weeks):
Week 1-2: Foundations
- Architecture overview
- Core service understanding
- Monitoring tool training
- On-call tool setup
Week 3-4: Observation
- Shadow senior SRE's on-call
- Observe incident response
- Read and understand runbooks
Week 5-6: Practice
- Handle simple incidents (with senior support)
- Update runbooks
- Improve monitoring alerts
Week 7-8: Independence
- Perform Secondary on-call
- Write postmortems
- Start Toil automation project
Week 9-10: Advanced
- Perform Primary on-call
- Experience as Incident Commander
- Participate in SLO reviews
Week 11-12: Contribution
- Complete automation project
- Improve onboarding documentation
- Prepare to mentor next new member
13. SRE Time Allocation
Ideal SRE Time Allocation:
Engineering Work: 50% or more
- Automation development
- Tool improvement
- System design
- Code review
Operational Work (Toil): 50% or less
- On-call response
- Deploy management
- Manual provisioning
- Manual monitoring
Warning Signs:
- Toil over 50%: Need automation investment
- Toil over 70%: Need team expansion or service redesign
- Toil over 90%: Crisis - immediate executive intervention needed
SRE Time Tracking Method:
- Record time by category in 2-week intervals
- Quarterly Toil ratio report
- Set and track Toil reduction targets
14. Quiz
Test your understanding with these questions.
Q1: With an SLO of 99.9%, what is the monthly Error Budget?
Answer: Approximately 43.2 minutes
Calculation: 30 days x 24 hours x 60 minutes x (1 - 0.999) = 30 x 24 x 60 x 0.001 = 43.2 minutes
This means approximately 43 minutes of downtime is allowed per month. If this is exceeded, the Error Budget is exhausted, and per policy, feature deployments should stop to focus on reliability improvements.
Q2: Why must the "5 Whys" analysis in a blameless postmortem end at system/process, not at a person?
Answer:
If the 5 Whys analysis ends at "someone made a mistake," the improvement action becomes just "train that person" -- which is not a fundamental fix.
Reasons it must end at system/process:
- Prevents recurrence: Fixing the system means the same mistake cannot happen regardless of who performs the task.
- Information sharing: When people fear blame, they hide mistakes, making it harder to find the real causes.
- Scalable solutions: "Training" applies to one person, but "automated verification" applies to all deployments.
Bad: "The developer deployed a query without indexes" (blaming person) Good: "There was no query performance review step in the deployment process" (improving process)
Q3: List the 5 characteristics of Toil and explain why it should be kept below 50%.
Answer:
5 characteristics of Toil:
- Manual: Requires a human to perform
- Repetitive: Same task repeated
- Automatable: Can be replaced by machines
- Tactical: Immediate response with no long-term value
- Scales with Service: Increases as the service grows
Why keep below 50%:
- The core value of SRE is improving operations through engineering
- When Toil exceeds 50%, there is insufficient time for automation and system improvement
- This creates a vicious cycle: too much Toil to automate, and no automation means even more Toil
- The Google SRE book defines Toil over 50% as a "danger signal"
Q4: Describe 3 strategies for reducing on-call fatigue.
Answer:
-
Improve alert quality
- Disable unnecessary alerts (weekly alert review)
- Add sufficient context to alerts (what is wrong, how to respond)
- Target: fewer than 20 alerts per on-call week
-
Write and maintain runbooks
- Document step-by-step response procedures for all recurring alerts
- Runbooks enable fast, accurate response even during night calls
- Regularly update runbooks
-
Provide adequate compensation and rest
- Offer on-call stipend and extra compensation for night/weekend calls
- Provide Mental Health Day after difficult incidents
- Ensure complete disconnect when not on-call to prevent burnout
Q5: Explain what "class SRE implements DevOps" means.
Answer:
This expression uses an object-oriented programming analogy to explain the relationship between SRE and DevOps.
-
DevOps is the Interface: It defines culture, philosophy, and values. It prescribes principles like "dev and ops must collaborate," "automate everything," and "continuously improve" -- but does not specify concrete implementation methods.
-
SRE is the Implementation Class: It concretely implements the DevOps philosophy. It quantifies objectives with SLO/SLI, provides a decision-making framework with Error Budgets, and executes through Toil measurement and automation.
In other words, if DevOps defines "what to do," SRE specifies "how to do it." They are complementary, not competitive.
References
- Site Reliability Engineering - Betsy Beyer et al. (Google, O'Reilly)
- The Site Reliability Workbook - Betsy Beyer et al. (Google, O'Reilly)
- Building Secure and Reliable Systems - Heather Adkins et al. (Google, O'Reilly)
- Google SRE Resources - sre.google
- Implementing Service Level Objectives - Alex Hidalgo (O'Reilly)
- Incident Management for Operations - Rob Schnepp et al. (O'Reilly)
- PagerDuty Incident Response Guide - response.pagerduty.com
- Atlassian Incident Management Handbook - atlassian.com/incident-management
- Blameless Postmortem Guide - blameless.com
- Rootly SRE Guide - rootly.com/blog
- Incident.io Blog - incident.io/blog
- Netflix Tech Blog: SRE Practices - netflixtechblog.com
- LinkedIn SRE Practices - engineering.linkedin.com
- Dropbox SRE - dropbox.tech