✍️ 필사 모드: Chaos Engineering Deep Dive — Netflix Simian Army, LitmusChaos/Chaos Mesh, AWS FIS, Game Day
EnglishIntro — "You deliberately kill production servers?"
Newcomers to Chaos Engineering usually react:
"Outages are bad — why manufacture them?"
Answer: Outages happen anyway. It hurts less when you meet them in a controlled setting first. In 2010 Netflix learned this paradox while migrating to AWS and unleashed Chaos Monkey on the world.
This post covers:
- Why Netflix started killing servers — the origin
- The 4 principles of Chaos Engineering
- Simian Army — from Chaos Monkey to Chaos Kong
- LitmusChaos / Chaos Mesh for Kubernetes
- AWS Fault Injection Simulator
- Game Day design and execution
- Combining observability with chaos
- Blameless postmortems — the culture pillar
- 10 real-world chaos recipes
- Maturity model — where is your org?
1. Origins — Netflix's 2010 decision
Problem: from monolith to cloud
Netflix suffered a 3-day DVD shipping outage in 2008. The decision: "We can't solve this in our own DC. Move to the cloud." The AWS migration taught:
- Cloud individual instance reliability is low (commodity VMs)
- Networks partition (across AZs and regions)
- Dependent services are always failing somewhere
Two choices:
- Write perfect code (impossible)
- Design assuming failure (realistic)
Chaos Monkey is born (2010)
A simple tool: randomly terminate EC2 instances. Internal pushback was heavy — "You're crazy." But engineers soon designed their services to survive single-instance loss. Results:
- Restart-tolerant architecture
- Rolling replacement becomes normal
- Outages become routine, not news
Open-sourced in 2012. The term Chaos Engineering enters the world.
2. The 4 Principles
From principlesofchaos.org, formalized by Netflix and Google SRE:
1. Define "Steady State"
There must be a measurable indicator of "healthy." Examples:
- Requests per second
- P99 latency
- Error rate
- Business metrics (checkout success rate)
Without a clear definition, you can't tell if the experiment succeeded. Netflix's "Starts Per Second (SPS)" is a classic business metric.
2. Vary real-world events
Inject events that actually happen:
- Instance down
- Network latency/partition
- DNS failure
- Dependency timeout
- Region outage
- Disk full
- Clock skew
Focus on environmental events, not code bugs.
3. Experiment in production
Staging is not production:
- Different traffic patterns
- Different data sizes
- Different third-party dependencies
Introduce gradually — start at 10 percent traffic. But production is the goal.
4. Automate and run continuously
One-off tests catch nothing. Only continuous experimentation catches regressions. Chaos belongs in the CI/CD pipeline.
3. Simian Army — the full Netflix lineup
Chaos Monkey (2010)
- Random EC2 termination
- Business hours only (engineers available)
Latency Monkey (2012)
- Inject latency into service-to-service calls
- Validate handling of slow external APIs
Conformity Monkey
- Terminate instances not matching standards (old AMIs, bad tags)
Doctor Monkey
- Detect and act on unhealthy instances (CPU 100 percent, unresponsive)
Janitor Monkey
- Clean unused resources (orphan EBS, unused IPs) — also cost control
Security Monkey
- Auto-detect bad IAM / security group configuration
Chaos Gorilla (2011)
- Simulate full AZ outage
Chaos Kong (2013)
- Take down an entire AWS region
- Netflix survived the real 2016 region outage thanks to this
FIT (Failure Injection Testing)
- Targeted fault injection by service / user / request
- Fine-grained chaos
Most were open-sourced; many have been succeeded by newer tools.
4. Kubernetes era — LitmusChaos vs Chaos Mesh
LitmusChaos (CNCF Graduated 2024)
- CNCF graduated project
- Kubernetes-native (CRD-based)
- ChaosHub — dozens of pre-built experiments
- GitOps-friendly
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-delete-chaos
spec:
appinfo:
appns: default
applabel: app=nginx
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: CHAOS_INTERVAL
value: "10"
Experiment types: Pod delete, container kill, network loss/latency/corruption, CPU/memory/IO stress, node drain, AWS/Azure/GCP resource manipulation.
Chaos Mesh (CNCF Incubating)
- Built by PingCAP (TiDB company)
- Excellent UI dashboard
- Strong scheduling — cron-based recurring experiments
- Workflow support — multi-step experiments
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: delay-example
spec:
action: delay
mode: one
selector:
namespaces:
- default
labelSelectors:
app: nginx
delay:
latency: "100ms"
correlation: "25"
jitter: "10ms"
duration: "5m"
Comparison
| Aspect | LitmusChaos | Chaos Mesh |
|---|---|---|
| Maturity | CNCF Graduated | CNCF Incubating |
| UI | Good | Excellent |
| Experiment count | ChaosHub 50+ | Built-in 30+ |
| GitOps | Strong | Medium |
| Scheduling | Basic cron | Advanced workflow |
| Learning curve | Medium | Low |
Quick pick:
- Fast onboarding — Chaos Mesh
- Enterprise policy + GitOps — LitmusChaos
5. AWS Fault Injection Simulator (FIS)
AWS's official chaos service (GA 2021).
Key features
- EC2 stop/terminate
- EBS volume detach
- Network disruption (latency, packet loss)
- RDS failover trigger
- CloudWatch alarm auto-abort
- IAM-based safety
Region shutdown experiment (careful)
FIS Template
├── Action: aws:network:disrupt-connectivity
├── Target: subnet in us-east-1
├── Stop Condition: CloudWatch alarm 'business metric below threshold'
└── IAM Role: restricted permissions
Essentially Netflix's Chaos Kong as an AWS-managed service.
Azure Chaos Studio / GCP
- Azure Chaos Studio launched 2021, AKS integration
- GCP: no official product, Chaos Mesh recommended
6. Game Day — organizational disaster drill
Why Game Day
Automated chaos is "tech validation." Game Day is "organizational validation." It tests:
- Who responds? (on-call rotation)
- Which runbooks actually work?
- Do Slack/PagerDuty alerts fire correctly?
- Where does communication slow down?
- Does the escalation chain work?
Template
- Set objective — e.g. "Recover to us-west-2 within 15 min of us-east-1 shutdown"
- Choose participants — SRE, backend, DBA, frontend, PM; share schedule
- Write scenario — "10:00 RDS primary failover; 10:05 +10 percent write traffic; 10:10 restart half of cache nodes"
- Execute — separate observers from operators; live timeline in Slack
- Retrospective — what worked, what surprised you, which runbooks to update, what to automate
Practical tips
- First Game Day: low risk — weekday afternoon, low traffic
- Keep a chat channel open — decisions logged
- Time-box — "1 hour, abort if not recovered"
- Control blast radius — start with 1–5 percent
7. Observability meets chaos
Chaos without Observability = Guessing
During experiments, watch:
- Steady state metrics
- Error propagation in dependencies
- Cascading failure signals
The three pillars (see the OpenTelemetry deep dive):
- Metrics — continuous numerics
- Logs — anomalous events
- Traces — how failure propagates
Canary + Chaos pattern
- Deploy to 10 percent canary
- Inject chaos only on the canary
- Verify graceful failure handling
- Pass → full rollout
Auto-abort
Stop automatically if a business metric drops:
stopConditions:
- source: cloudwatch
alarm: checkout-success-rate-below-95
Prevents accidental major outages.
8. Blameless postmortems — the culture pillar
Why "blameless"
Blaming individuals leads to:
- Hidden incidents
- Halted learning
- Destroyed psychological safety
Focus on "how did the system allow failure?" — not who erred.
Template
- Impact — users, duration, revenue
- Timeline — minute-by-minute
- Root cause — Five Whys
- What went well — detection, teamwork
- What went badly — missed alerts, weak runbook
- Action items — owner + due date
Five Whys example
Problem: payments failed for 15 min
- Why? Payment service unresponsive
- Why? DB connection pool exhausted
- Why? A query scanned without an index
- Why? Recent deploy added the query, missed the index
- Why? PR review checklist lacked "verify index"
Action: add index check to PR template. Root cause lands on process gap, not developer error.
Just Culture
Blameless ≠ no accountability. Willful negligence / repeated carelessness warrants action. See John Allspaw (Etsy) on Just Culture.
9. 10 chaos recipes
1. Random pod delete
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
spec:
action: pod-kill
mode: random-max-percent
value: "25"
selector:
namespaces: [production]
labelSelectors: { tier: backend }
scheduler:
cron: "0 */6 * * *"
Validate: readinessProbe, graceful termination, PodDisruptionBudget.
2. CPU stress
Validate: HPA scale-out, throttling response.
3. Network latency
networkChaos:
action: delay
delay: { latency: "500ms" }
Validate: timeouts, circuit breaker.
4. DNS failure
Half of all outages are DNS. Validate: DNS cache policy, reconnection.
5. Disk full
dd if=/dev/zero of=/tmp/fill bs=1M count=10000
Validate: log rotation, disk alarms.
6. Dependency 5xx
Envoy fault injection filter:
fault:
abort:
percentage: 50
httpStatus: 503
Validate: retry policy, fallback.
7. DB failover
Validate: pool reconnect, client retry.
8. Region/AZ shutdown
AWS FIS subnet network disruption. Validate: multi-AZ, auto-failover.
9. Clock skew
Break NTP, skew time. Validate: JWT validation, event timestamps, TLS certs.
10. Traffic surge
k6/Locust at 3x expected. Validate: rate limiting, HPA, CDN hit ratio.
10. Chaos maturity model
Level 1 — Chaos-curious
Interested, no tooling, reactive incident handling, blameful postmortems.
Level 2 — Staging chaos
Occasional staging experiments, runbooks, on-call rotation.
Level 3 — Production chaos
Controlled production experiments, defined steady state, quarterly Game Day.
Level 4 — Continuous chaos
Chaos in CI/CD, auto-expanding blast radius, auto-abort + observability.
Level 5 — Engineering culture
Every PR considers "what is this change's chaos scenario?"; blameless postmortems are natural; failure is seen as learning.
Netflix/Google/Amazon are at Level 5. Most are at 1–2. Climb step by step.
11. Architecture antipatterns chaos reveals
- Hidden dependency — "That service can't die, right?" Chaos says otherwise.
- Cascading failure — A depends on B; B down triggers A retry storm; C falls too.
- Single point of failure — single DB primary, single LB, single DNS.
- Inadequate timeouts — 30 s client wait looks fine until chaos shows the UX pain.
- Incomplete retries — retry without backoff amplifies load.
12. Regulated environments (finance, health, gov)
"We're too regulated to do prod chaos."
Approach 1: Regulatory Sandboxing
Many regs (GDPR, PCI-DSS, HIPAA) allow prod-mirror testing. Mask real data, run chaos.
Approach 2: Minimal Blast Radius
Start at 0.1 percent. Internal users only. Audit logs strict.
Approach 3: Game Day only in prod
1–2 times/year, pre-approved, formal report.
Approach 4: Real incidents as chaos
Structured learning on every incident — you are already doing chaos.
13. 12-point checklist
- Define steady state first (business metric)
- Start in staging, then 10 percent prod
- Declare blast radius
- Always set stop conditions
- Observability first
- Pair with runbooks
- PDB / preStop hooks
- On-call must know
- Document results
- Blameless postmortem culture
- Regular Game Days (quarterly)
- Executive sponsorship — chaos is a culture investment
Next post — Feature Flags and Progressive Delivery
If chaos explores failures in production, Feature Flags explore new features there. Next time:
- History — from Facebook "Dark Launch" to LaunchDarkly
- Flag types — release, experiment, ops, permission
- Progressive Delivery — Canary → Rolling → Blue/Green
- A/B Testing vs flags
- Trunk-based development
- Flag debt
- Unleash, LaunchDarkly, GrowthBook, Flagsmith
- OpenFeature standardization
- Rollout strategies (bucket, geo, time)
"Separate deploy from release. The code is already there, the feature just isn't on yet — that is the modern default."
현재 단락 (1/249)
Newcomers to Chaos Engineering usually react: