✍️ 필사 모드: Chaos Engineering Complete Guide 2025: Resilience Testing, Chaos Monkey, Litmus, Game Day
EnglishIntroduction: Why Chaos Engineering
Modern distributed systems consist of dozens to hundreds of microservices, message queues, databases, caches, and load balancers. In these complex systems, failure is not a question of "if" but "when."
In 2012, Netflix experienced a complete service outage due to an AWS US-East-1 region failure. After this incident, Netflix adopted the philosophy of "instead of avoiding failure, build systems resilient to failure" and gave birth to Chaos Engineering.
Chaos Engineering is not simply "randomly killing servers." It is a discipline grounded in scientific experimentation methodology that proactively discovers system vulnerabilities and strengthens resilience so services continue operating normally even during failures.
What This Post Covers
- Core principles and experiment design of Chaos Engineering
- Netflix Chaos Monkey origin story and the Simian Army
- Tool comparison: Litmus Chaos, Chaos Mesh, Gremlin, AWS FIS
- Fault type experiments: network, CPU, memory, Pod, node, AZ
- Kubernetes chaos experiment YAML examples
- Game Day operations guide
- Progressive adoption strategy (dev to staging to production)
- Netflix and Amazon case studies
- CI/CD integration and observability
1. Core Principles of Chaos Engineering
The principles of Chaos Engineering are defined at principlesofchaos.org.
1.1 Steady State Hypothesis
Every chaos experiment begins with defining "steady state."
Steady State Examples:
- API response time p99 < 200ms
- Error rate < 0.1%
- Order processing success rate > 99.9%
- Database replication lag < 1 second
The steady state hypothesis is: "Even after injecting a specific failure, the system will maintain its steady state."
1.2 Vary Real-World Events
You must simulate failures that could actually occur in production.
| Failure Type | Real-World Case | Experiment Method |
|---|---|---|
| Server Down | Hardware failure, OOM Kill | Force-terminate Pod/instance |
| Network Latency | Cross-region communication delay | Inject latency via tc command |
| Network Partition | Switch failure | Block traffic via iptables rules |
| CPU Overload | Sudden traffic spike | Generate CPU load with stress-ng |
| Disk Full | Log explosion | Fill disk with dd command |
| DNS Failure | DNS server down | Manipulate DNS responses |
| AZ Failure | Data center outage | Block all AZ traffic |
1.3 Run Experiments in Production
True confidence can only be validated in production environments. Staging environments differ from production in traffic patterns, data volumes, and infrastructure configuration.
Of course, this does not mean starting in production right away. Adopt progressively, but the ultimate goal should be production experimentation.
1.4 Minimize Blast Radius
Blast Radius Control Strategies:
1. Start small: Single instance -> instance group -> AZ -> region
2. Time limit: Set experiment duration (e.g., 5 minutes)
3. Auto-abort: Stop experiment immediately when key metrics exceed thresholds
4. Traffic limit: Apply experiment to only 1% of total traffic
5. Rollback plan: Verify rollback procedure before experiment
1.5 Automate Experiments
Manual experiments cannot scale. Automation includes:
- Experiment scheduling (running at specific times each week)
- CI/CD pipeline integration
- Automated result collection and report generation
- Automatic abort condition monitoring
2. Chaos Monkey: Netflix's Beginning
2.1 Origin Story
In 2010, as Netflix migrated entirely from its own data centers to AWS, the team faced a fundamental question:
"In cloud infrastructure, servers can disappear at any time. Can our system handle this?"
Chaos Monkey was born to answer this question. It randomly terminates instances in production, forcing every service to be resilient against single-instance failures.
2.2 The Simian Army
After the success of Chaos Monkey, Netflix created the Simian Army to experiment with a wider variety of failure scenarios.
Simian Army Members:
+----------------------+--------------------------------------+
| Name | Role |
+----------------------+--------------------------------------+
| Chaos Monkey | Random instance termination |
| Latency Monkey | Network latency injection |
| Conformity Monkey | Detect non-compliant instances |
| Doctor Monkey | Unhealthy instance health checks |
| Janitor Monkey | Clean up unused resources |
| Security Monkey | Detect security vulnerabilities |
| Chaos Gorilla | Full AZ failure simulation |
| Chaos Kong | Full region failure simulation |
+----------------------+--------------------------------------+
2.3 Chaos Monkey Configuration Example
# Chaos Monkey Configuration (Spinnaker Integration)
chaos_monkey:
enabled: true
leashed: false
schedule:
frequency: "weekday"
start_hour: 10
end_hour: 16
timezone: "America/New_York"
grouping: "cluster"
probability: 1.0
exceptions:
- account: "prod-critical"
region: "us-east-1"
stack: "payment"
3. Chaos Tool Comparison
3.1 Tool Overview
| Tool | Developer | Environment | License | Features |
|---|---|---|---|---|
| Litmus Chaos | CNCF | Kubernetes | Apache 2.0 | ChaosHub, GitOps native |
| Chaos Mesh | CNCF | Kubernetes | Apache 2.0 | Powerful dashboard, TimeChaos |
| Gremlin | Gremlin Inc. | All | Commercial | SaaS-based, enterprise features |
| AWS FIS | AWS | AWS | Pay-per-use | Native AWS service integration |
| Azure Chaos Studio | Microsoft | Azure | Pay-per-use | Native Azure service integration |
| Steadybit | Steadybit | All | Commercial | Safety guardrails emphasis |
3.2 Litmus Chaos in Detail
Litmus is a CNCF Incubating project and a Kubernetes-native chaos engineering framework.
# Litmus Chaos Installation
apiVersion: v1
kind: Namespace
metadata:
name: litmus
---
# LitmusChaos Operator Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: chaos-operator-ce
namespace: litmus
spec:
replicas: 1
selector:
matchLabels:
name: chaos-operator
template:
metadata:
labels:
name: chaos-operator
spec:
serviceAccountName: litmus
containers:
- name: chaos-operator
image: litmuschaos/chaos-operator:3.0.0
env:
- name: CHAOS_RUNNER_IMAGE
value: "litmuschaos/chaos-runner:3.0.0"
- name: WATCH_NAMESPACE
value: ""
The core concepts of Litmus are ChaosExperiment, ChaosEngine, and ChaosResult.
# Pod Delete Experiment Definition
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: pod-delete
namespace: litmus
spec:
definition:
scope: Namespaced
permissions:
- apiGroups: [""]
resources: ["pods"]
verbs: ["delete", "list", "get"]
image: "litmuschaos/go-runner:3.0.0"
args:
- -c
- ./experiments -name pod-delete
command:
- /bin/bash
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "false"
3.3 Chaos Mesh in Detail
Chaos Mesh is a Kubernetes chaos platform supporting diverse fault types.
# Chaos Mesh Installation (Helm)
# helm repo add chaos-mesh https://charts.chaos-mesh.org
# helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-mesh --create-namespace
# Network Delay Experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay-example
namespace: default
spec:
action: delay
mode: all
selector:
namespaces:
- default
labelSelectors:
app: my-service
delay:
latency: "200ms"
jitter: "50ms"
correlation: "50"
duration: "5m"
scheduler:
cron: "@every 1h"
3.4 AWS Fault Injection Service (FIS)
AWS FIS is a chaos engineering service natively integrated with AWS services.
{
"description": "EC2 instance termination experiment",
"targets": {
"myInstances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Environment": "staging",
"ChaosReady": "true"
},
"selectionMode": "COUNT(1)"
}
},
"actions": {
"stopInstances": {
"actionId": "aws:ec2:stop-instances",
"parameters": {},
"targets": {
"Instances": "myInstances"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:123456789:alarm:HighErrorRate"
}
],
"roleArn": "arn:aws:iam::123456789:role/FISRole"
}
3.5 Gremlin
Gremlin is a commercial SaaS-based chaos engineering platform.
# Gremlin Agent Installation (Kubernetes)
helm repo add gremlin https://helm.gremlin.com
helm install gremlin gremlin/gremlin \
--set gremlin.secret.managed=true \
--set gremlin.secret.type=secret \
--set gremlin.secret.teamID=YOUR_TEAM_ID \
--set gremlin.secret.clusterID=my-cluster \
--set gremlin.secret.teamSecret=YOUR_SECRET
Gremlin's advantages include an intuitive UI, fine-grained control, and safety features (Safety Net). Basic chaos experiments are available even on the free tier.
4. Fault Type Experiments
4.1 Network Failures
Network failures are the most common and critical issues in distributed systems.
# Chaos Mesh: Network Partition
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-partition
namespace: default
spec:
action: partition
mode: all
selector:
namespaces:
- default
labelSelectors:
app: order-service
direction: both
target:
selector:
namespaces:
- default
labelSelectors:
app: payment-service
mode: all
duration: "2m"
# Chaos Mesh: Packet Loss
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: packet-loss
namespace: default
spec:
action: loss
mode: all
selector:
labelSelectors:
app: api-gateway
loss:
loss: "25"
correlation: "50"
duration: "3m"
# Chaos Mesh: Bandwidth Limitation
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: bandwidth-limit
namespace: default
spec:
action: bandwidth
mode: all
selector:
labelSelectors:
app: file-service
bandwidth:
rate: "1mbps"
limit: 20971520
buffer: 10000
duration: "5m"
4.2 CPU Stress
# Chaos Mesh: CPU Stress
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-stress
namespace: default
spec:
mode: one
selector:
labelSelectors:
app: compute-service
stressors:
cpu:
workers: 4
load: 80
duration: "5m"
4.3 Memory Pressure
# Chaos Mesh: Memory Stress
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: memory-stress
namespace: default
spec:
mode: one
selector:
labelSelectors:
app: cache-service
stressors:
memory:
workers: 2
size: "512MB"
duration: "3m"
4.4 Pod Kill
# Chaos Mesh: Pod Kill
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-example
namespace: default
spec:
action: pod-kill
mode: fixed
value: "2"
selector:
namespaces:
- default
labelSelectors:
app: web-frontend
gracePeriod: 0
duration: "1m"
# Chaos Mesh: Container Kill
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: container-kill
namespace: default
spec:
action: container-kill
mode: one
selector:
labelSelectors:
app: multi-container-app
containerNames:
- sidecar-proxy
duration: "30s"
4.5 Node Drain
# Litmus: Node Drain Experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: node-drain-engine
namespace: litmus
spec:
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: node-drain
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: TARGET_NODE
value: "worker-node-02"
- name: APP_NAMESPACE
value: "default"
- name: APP_LABEL
value: "app=critical-service"
4.6 AZ Failure Simulation
AZ (Availability Zone) failure is the experiment with the largest blast radius.
# AWS FIS: AZ Failure Simulation
experiment_template:
description: "AZ-a failure simulation"
targets:
azSubnets:
resourceType: "aws:ec2:subnet"
resourceTags:
AvailabilityZone: "us-east-1a"
selectionMode: "ALL"
actions:
disruptConnectivity:
actionId: "aws:network:disrupt-connectivity"
parameters:
scope: "all"
targets:
Subnets: "azSubnets"
duration: "PT10M"
stopConditions:
- source: "aws:cloudwatch:alarm"
value: "arn:aws:cloudwatch:region:account:alarm:CriticalErrorRate"
4.7 DNS Failure
# Chaos Mesh: DNS Failure
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
name: dns-failure
namespace: default
spec:
action: error
mode: all
selector:
labelSelectors:
app: external-api-client
patterns:
- "external-api.example.com"
duration: "2m"
5. Experiment Design Methodology
5.1 Systematic Experiment Design
Chaos experiments follow the same structure as scientific experiments.
Chaos Experiment Design Framework:
1. Hypothesis
"Even if 1 payment-service Pod is terminated,
order success rate will remain above 99.9%"
2. Steady State Definition
- Order success rate: 99.95%
- API response time p99: 180ms
- Error rate: 0.05%
3. Variables
- Independent variable: Force-terminate 1 Pod
- Dependent variables: Order success rate, response time, error rate
4. Blast Radius
- Target: 1 of 3 payment-service Pods
- Impact scope: Approximately 33% of payment requests
5. Abort Conditions
- Order success rate < 99.0%
- API response time p99 > 1000ms
- Concurrent errors > 100/min
6. Execute and Observe
- Collect baseline for 5 minutes before experiment
- Execute experiment (2 minutes)
- Observe recovery for 5 minutes after experiment
7. Analyze and Improve
- Verify/falsify hypothesis
- Document discovered vulnerabilities
- Create improvement work tickets
5.2 Experiment Checklist
Pre-Experiment Checklist:
[ ] Is the hypothesis clearly defined?
[ ] Are steady state metrics being collected?
[ ] Is the blast radius appropriately limited?
[ ] Are abort conditions configured?
[ ] Is the rollback procedure documented?
[ ] Have relevant teams been notified?
[ ] Is the monitoring dashboard ready?
[ ] Is an on-call engineer standing by?
During-Experiment Checklist:
[ ] Are metrics being monitored in real-time?
[ ] Are abort conditions being continuously checked?
[ ] Are there any unexpected side effects?
Post-Experiment Checklist:
[ ] Has the system returned to steady state?
[ ] Have experiment results been recorded?
[ ] Have action items been created for discovered issues?
[ ] Have results been shared with the team?
5.3 Experiment Result Template
Experiment ID: CE-2025-042
Date: 2025-04-14
Experimenter: SRE Team (Jane Doe)
Hypothesis: Even when 1 payment-service Pod is terminated,
order success rate remains above 99.9%
Steady State Baseline:
- Order success rate: 99.97%
- p99 response time: 165ms
- Error rate: 0.03%
Experiment Results:
- Order success rate: 99.12% (Hypothesis FALSIFIED!)
- p99 response time: 892ms
- Error rate: 0.88%
- Recovery time: 45 seconds
Findings:
1. Kubernetes readiness probe interval set to 30s,
causing traffic to continue routing to terminating Pod
2. Circuit breaker timeout set too high (10 seconds)
3. Retry logic retrying to the same Pod
Improvement Actions:
- [HIGH] Reduce readiness probe interval to 5 seconds
- [HIGH] Adjust circuit breaker timeout to 2 seconds
- [MED] Modify retry logic to route to different instances
6. Game Day Operations Guide
6.1 What is Game Day?
Game Day is a training session where the team gathers to execute planned failure scenarios, respond in real-time, and learn. Inspired by military "War Games," it improves response capabilities during actual incidents.
6.2 Game Day Planning
Game Day Planning Template:
1. Purpose
- Verify payment system AZ failure resilience
- Train incident response process
- Onboard new team members
2. Schedule
- Date: Monday, April 14, 2025
- Time: 14:00 - 17:00 EST
- Location: Conference Room + Zoom
3. Participants
- Game Master: SRE Lead (injects scenarios)
- Incident Commander: Backend Lead
- Response Team: 3 Backend, 1 Frontend, 1 DBA
- Observers: CTO, PM
4. Scenarios (hidden from participants)
Scenario 1: Kill 50% of payment-service Pods
Scenario 2: Database read replica connection lost
Scenario 3: External payment gateway 5-second delay
5. Success Criteria
- Detection time < 5 minutes
- Mitigation time < 15 minutes
- Zero customer impact
6. Safety Measures
- Kill switch: Game master can stop all experiments immediately
- Run in staging, not production
- No impact on actual customer traffic
6.3 Game Day Roles
Role Definitions:
Game Master
- Design scenarios and inject failures
- Authority to abort experiments
- Provide hints (when needed)
- Time management
Incident Commander
- Coordinate response team
- Decision making
- Communication management
- Escalation judgment
Response Team (Responders)
- Diagnose and resolve issues
- Reference and execute runbooks
- Report status in real-time
Observers
- Record response process
- Note process improvements
- Do NOT intervene (important!)
Scribe
- Record timeline
- Document key decisions
- Organize findings
6.4 Game Day Execution Timeline
14:00 - 14:15 Kickoff Briefing
- Explain rules
- Confirm safety measures
- Verify tool access
14:15 - 14:20 Scenario 1 Injection (Game Master)
14:20 - 14:50 Response Team Detection and Response
14:50 - 15:00 Scenario 1 Review
15:00 - 15:10 Break
15:10 - 15:15 Scenario 2 Injection
15:15 - 15:45 Response Team Detection and Response
15:45 - 15:55 Scenario 2 Review
15:55 - 16:00 Scenario 3 Injection
16:00 - 16:30 Response Team Detection and Response
16:30 - 16:40 Scenario 3 Review
16:40 - 17:00 Comprehensive Retrospective
- What went well
- What to improve
- Derive action items
6.5 Game Day Retrospective
Retrospective Questions:
Detection
- Which alert fired first?
- Could you accurately identify the problem from the alert?
- Were there any missed alerts?
- How can we reduce detection time?
Response
- Was the runbook helpful?
- Were communication channels effective?
- Was role assignment clear?
- Was escalation needed?
Recovery
- Was recovery time acceptable?
- Did auto-recovery mechanisms work?
- Which parts required manual intervention?
System
- Which parts did not behave as expected?
- What monitoring needs improvement?
- What tasks need automation?
7. Progressive Adoption Strategy
7.1 Phase 1: Development Environment (1-2 weeks)
Goal: Build team chaos engineering capabilities
Activities:
1. Install chaos tools (Chaos Mesh or Litmus)
2. Run simple experiments
- Pod Kill in development environment
- Network latency injection
3. Experiment design methodology training
4. Write first experiment report
Success Metrics:
- Every team member has run at least 1 experiment
- Experiment design process documented
7.2 Phase 2: Staging Environment (2-4 weeks)
Goal: Systematic experiments in production-like environment
Activities:
1. Deploy chaos tools to staging
2. Systematic experiments on core services
- Inter-service communication failures
- Database connection pool exhaustion
- Full cache failure
3. Execute first Game Day
4. Integrate basic experiments into CI/CD pipeline
Success Metrics:
- At least 1 regular experiment per week
- 1 Game Day completed
- 3+ discovered vulnerabilities improved
7.3 Phase 3: Production Canary (4-8 weeks)
Goal: Begin safe production experiments
Activities:
1. Build production chaos experiment safety measures
- Configure auto-abort conditions
- Implement kill switch
- Limit experiment impact scope
2. Small-scale production experiments
- Single Pod Kill on 1% of total traffic
- Network latency on non-critical services
3. Enhance observability dashboards
Success Metrics:
- 5+ production experiments completed safely
- Zero customer impact
- Auto-abort mechanism verified at least once
7.4 Phase 4: Production Normalization (8+ weeks)
Goal: Establish chaos engineering as routine practice
Activities:
1. Automated periodic experiment execution
2. Mandatory chaos tests for new service deployments
3. Quarterly large-scale Game Days
4. AZ/region-level failure simulation
5. Chaos culture evangelization (company-wide training)
Success Metrics:
- 10+ automated experiments per month
- Quarterly Game Days established
- MTTR reduced by 50%
- Failure-related incidents reduced by 30%
8. Netflix Case Study
8.1 Chaos Kong: Region Evacuation Drill
Netflix uses an experiment called Chaos Kong to simulate full AWS region failure.
Chaos Kong Process:
1. Preparation Phase
- Verify traffic routing
- Confirm alternative region capacity
- Enhance monitoring
- Notify relevant teams in advance
2. Execution Phase
- Route all US-East-1 traffic to US-West-2
- Leverage DNS-based global load balancing
- Real-time metric monitoring during transition
3. Observation Points
- Traffic transition completion time
- Alternative region auto-scaling response time
- User experience impact (playback quality, start time)
- Data consistency
4. Recovery Phase
- Restore traffic to original region
- Verify data synchronization
- Full results analysis
8.2 Netflix Chaos Maturity Model
Level 0: No Chaos
- Manual response to failures
- Hope that "our system works fine"
Level 1: Basic Chaos
- Instance termination with Chaos Monkey
- Individual service resilience verification
Level 2: Systematic Chaos
- Diverse fault type experiments
- Regular Game Day execution
- Results-based improvements
Level 3: Advanced Chaos
- AZ failure simulation
- CI/CD integration
- Automated experiments
Level 4: Chaos Culture
- Region evacuation drill (Chaos Kong)
- All teams performing chaos experiments
- Resilience as core architecture decision criterion
9. Amazon Case Study
9.1 Amazon GameDay
Amazon has been running failure drills called GameDay since 2004.
Amazon GameDay Principles:
1. "Everything fails, all the time" - Werner Vogels
- Treat failure as a normal part of operations
2. Progressive complexity increase
- Single service -> service chain -> full system
3. Blameless culture
- Turn failure response failures into learning opportunities
4. Documentation
- Record all GameDay results in detail
5. Repetition
- Repeat same scenarios to verify improvements
9.2 AWS FIS Integration Example
{
"description": "Multi-AZ resilience test",
"targets": {
"ec2Instances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"chaos-ready": "true"
},
"filters": [
{
"path": "Placement.AvailabilityZone",
"values": ["us-east-1a"]
}
],
"selectionMode": "ALL"
},
"rdsInstances": {
"resourceType": "aws:rds:db",
"resourceTags": {
"chaos-ready": "true"
},
"selectionMode": "ALL"
}
},
"actions": {
"stopEC2": {
"actionId": "aws:ec2:stop-instances",
"targets": {
"Instances": "ec2Instances"
},
"startAfter": []
},
"failoverRDS": {
"actionId": "aws:rds:failover-db-cluster",
"targets": {
"Clusters": "rdsInstances"
},
"startAfter": ["stopEC2"]
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:region:account:alarm:HighErrorRate"
}
]
}
10. CI/CD Integration
10.1 Pipeline Integration Strategy
# GitHub Actions: Chaos Test Integration
name: Chaos Testing Pipeline
on:
push:
branches: [main]
schedule:
- cron: '0 2 * * 1-5' # Weekdays at 2 AM
jobs:
deploy-staging:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Deploy to Staging
run: kubectl apply -f k8s/staging/
chaos-test:
needs: deploy-staging
runs-on: ubuntu-latest
steps:
- name: Install Litmus
run: |
kubectl apply -f https://litmuschaos.github.io/litmus/3.0.0/litmus-3.0.0.yaml
- name: Run Pod Delete Experiment
run: |
kubectl apply -f chaos/experiments/pod-delete.yaml
sleep 120
kubectl get chaosresult pod-delete-result -o jsonpath='{.status.experimentStatus.verdict}'
- name: Run Network Delay Experiment
run: |
kubectl apply -f chaos/experiments/network-delay.yaml
sleep 180
kubectl get chaosresult network-delay-result -o jsonpath='{.status.experimentStatus.verdict}'
- name: Verify System Recovery
run: |
./scripts/verify-steady-state.sh
- name: Collect Results
if: always()
run: |
kubectl get chaosresults -o yaml > chaos-results.yaml
- name: Upload Results
if: always()
uses: actions/upload-artifact@v4
with:
name: chaos-results
path: chaos-results.yaml
10.2 Canary Deployment with Chaos Integration
# Argo Rollouts + Chaos Integration
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-service
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 10
- pause:
duration: 5m
# Run chaos experiment on canary
- analysis:
templates:
- templateName: chaos-analysis
args:
- name: service-name
value: my-service
- setWeight: 50
- pause:
duration: 10m
- setWeight: 100
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: chaos-analysis
spec:
metrics:
- name: success-rate-during-chaos
interval: 30s
successCondition: result[0] >= 99.5
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{status="200",service="my-service"}[5m]))
/
sum(rate(http_requests_total{service="my-service"}[5m]))
* 100
11. Observability and Chaos
11.1 Monitoring Dashboard
Metrics that must be monitored during chaos experiments.
Key Metrics (Golden Signals):
1. Latency
- p50, p90, p99 response times
- Inter-service call latency
2. Traffic
- Requests per second (RPS)
- Traffic distribution
3. Errors
- HTTP 5xx rate
- Application error rate
- Timeout rate
4. Saturation
- CPU utilization
- Memory utilization
- Network bandwidth
- Disk I/O
11.2 Grafana Dashboard Example
{
"dashboard": {
"title": "Chaos Engineering Dashboard",
"panels": [
{
"title": "Request Success Rate",
"type": "gauge",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"2..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}
],
"thresholds": {
"steps": [
{ "color": "red", "value": 0 },
{ "color": "yellow", "value": 99 },
{ "color": "green", "value": 99.9 }
]
}
},
{
"title": "P99 Latency",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
}
]
},
{
"title": "Error Rate",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}
]
},
{
"title": "Active Chaos Experiments",
"type": "stat",
"targets": [
{
"expr": "count(chaos_experiment_status{phase=\"Running\"})"
}
]
}
]
}
}
11.3 Alert Configuration
# Prometheus AlertManager Rules
groups:
- name: chaos-safety-alerts
rules:
- alert: ChaosExperimentHighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[2m]))
/
sum(rate(http_requests_total[2m]))
> 0.05
for: 1m
labels:
severity: critical
team: sre
annotations:
summary: "Error rate exceeds 5% during chaos experiment"
description: "Immediately abort experiment and check status"
- alert: ChaosExperimentHighLatency
expr: |
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[2m]))
> 2.0
for: 1m
labels:
severity: warning
team: sre
annotations:
summary: "P99 latency exceeds 2 seconds during chaos experiment"
12. Relationship with SRE
Chaos Engineering is one of the core practices of SRE (Site Reliability Engineering).
Relationship between SRE and Chaos Engineering:
SRE Goal: Operate reliable and scalable systems
|
+-- SLO/SLI/SLA Definition -> Success criteria for chaos experiments
|
+-- Error Budget -> Determines whether to run chaos experiments
| (Aggressive experiments only when budget is sufficient)
|
+-- Incident Management -> Source of Game Day scenarios
| (Reproduce past failures to verify improvements)
|
+-- Postmortems -> New chaos experiment ideas
| (Experiment on vulnerabilities found in failure analysis)
|
+-- Toil Elimination -> Chaos experiment automation
(Automate repetitive manual experiments)
13. Quiz
Test your understanding with these questions.
Q1: What is the first step in Chaos Engineering?
Answer: Establishing a Steady State Hypothesis
Every chaos experiment begins with defining the system's steady state and forming the hypothesis: "Even after injecting this failure, the system will maintain steady state." Injecting failures without a hypothesis is not chaos engineering -- it is just destruction.
Q2: What level of failure does Netflix's Chaos Kong simulate?
Answer: Full AWS region failure
Chaos Kong is Netflix's most extreme chaos experiment, simulating a Region Evacuation where all traffic from an entire AWS region is redirected to another region. This verifies that Netflix's service remains operational even during region-level failures.
Q3: Describe 3 strategies for minimizing blast radius.
Answer:
- Start small: Begin with a single instance and progressively expand to instance groups, AZs, and regions.
- Time limits: Explicitly set experiment duration to prevent prolonged impact.
- Auto-abort conditions: Automatically stop the experiment when key metrics exceed thresholds. For example, immediately abort if error rate exceeds 5%.
Additional important strategies include traffic limiting (applying experiments to only a portion of traffic) and having a rollback plan.
Q4: What is the role of the Game Master during a Game Day?
Answer:
The Game Master is the operator of the Game Day and performs the following roles:
- Scenario design: Design failure scenarios in advance (kept secret from participants).
- Failure injection: Use chaos tools to inject failures at planned moments.
- Experiment abort authority: Use the kill switch to immediately stop if situation becomes uncontrollable.
- Hint provision: Provide hints if the response team gets stuck.
- Time management: Manage the overall timeline.
Q5: List the 4 phases of progressive Chaos Engineering adoption in order.
Answer:
- Phase 1: Development Environment (1-2 weeks) - Tool installation, basic experiments, team training
- Phase 2: Staging Environment (2-4 weeks) - Systematic experiments, first Game Day, CI/CD integration
- Phase 3: Production Canary (4-8 weeks) - Small-scale production experiments, safety measures
- Phase 4: Production Normalization (8+ weeks) - Automated regular experiments, AZ/region failure simulation
The key principle is "slowly, safely, progressively" expanding scope.
References
- Principles of Chaos Engineering - principlesofchaos.org
- Chaos Engineering: System Resiliency in Practice - Casey Rosenthal, Nora Jones (O'Reilly)
- Netflix Tech Blog: Chaos Engineering - netflixtechblog.com
- Litmus Chaos Documentation - litmuschaos.io/docs
- Chaos Mesh Documentation - chaos-mesh.org/docs
- AWS Fault Injection Service - AWS Official Documentation
- Azure Chaos Studio - Microsoft Official Documentation
- Gremlin Documentation - gremlin.com/docs
- Google SRE Book - Chapter 17: Testing for Reliability - sre.google/sre-book
- The Practice of Cloud System Administration - Thomas Limoncelli et al.
- Awesome Chaos Engineering - GitHub Repository
- CNCF Chaos Engineering Whitepaper - cncf.io
- Learning Chaos Engineering - Russ Miles (O'Reilly)
- Chaos Engineering Adoption Guide - Gremlin Blog
현재 단락 (1/918)
Modern distributed systems consist of dozens to hundreds of microservices, message queues, databases...