Skip to content
Published on

Chaos Engineering Deep Dive — Netflix Simian Army, LitmusChaos/Chaos Mesh, AWS FIS, Game Day

Authors

Intro — "You deliberately kill production servers?"

Newcomers to Chaos Engineering usually react:

"Outages are bad — why manufacture them?"

Answer: Outages happen anyway. It hurts less when you meet them in a controlled setting first. In 2010 Netflix learned this paradox while migrating to AWS and unleashed Chaos Monkey on the world.

This post covers:

  • Why Netflix started killing servers — the origin
  • The 4 principles of Chaos Engineering
  • Simian Army — from Chaos Monkey to Chaos Kong
  • LitmusChaos / Chaos Mesh for Kubernetes
  • AWS Fault Injection Simulator
  • Game Day design and execution
  • Combining observability with chaos
  • Blameless postmortems — the culture pillar
  • 10 real-world chaos recipes
  • Maturity model — where is your org?

1. Origins — Netflix's 2010 decision

Problem: from monolith to cloud

Netflix suffered a 3-day DVD shipping outage in 2008. The decision: "We can't solve this in our own DC. Move to the cloud." The AWS migration taught:

  • Cloud individual instance reliability is low (commodity VMs)
  • Networks partition (across AZs and regions)
  • Dependent services are always failing somewhere

Two choices:

  1. Write perfect code (impossible)
  2. Design assuming failure (realistic)

Chaos Monkey is born (2010)

A simple tool: randomly terminate EC2 instances. Internal pushback was heavy — "You're crazy." But engineers soon designed their services to survive single-instance loss. Results:

  • Restart-tolerant architecture
  • Rolling replacement becomes normal
  • Outages become routine, not news

Open-sourced in 2012. The term Chaos Engineering enters the world.


2. The 4 Principles

From principlesofchaos.org, formalized by Netflix and Google SRE:

1. Define "Steady State"

There must be a measurable indicator of "healthy." Examples:

  • Requests per second
  • P99 latency
  • Error rate
  • Business metrics (checkout success rate)

Without a clear definition, you can't tell if the experiment succeeded. Netflix's "Starts Per Second (SPS)" is a classic business metric.

2. Vary real-world events

Inject events that actually happen:

  • Instance down
  • Network latency/partition
  • DNS failure
  • Dependency timeout
  • Region outage
  • Disk full
  • Clock skew

Focus on environmental events, not code bugs.

3. Experiment in production

Staging is not production:

  • Different traffic patterns
  • Different data sizes
  • Different third-party dependencies

Introduce gradually — start at 10 percent traffic. But production is the goal.

4. Automate and run continuously

One-off tests catch nothing. Only continuous experimentation catches regressions. Chaos belongs in the CI/CD pipeline.


3. Simian Army — the full Netflix lineup

Chaos Monkey (2010)

  • Random EC2 termination
  • Business hours only (engineers available)

Latency Monkey (2012)

  • Inject latency into service-to-service calls
  • Validate handling of slow external APIs

Conformity Monkey

  • Terminate instances not matching standards (old AMIs, bad tags)

Doctor Monkey

  • Detect and act on unhealthy instances (CPU 100 percent, unresponsive)

Janitor Monkey

  • Clean unused resources (orphan EBS, unused IPs) — also cost control

Security Monkey

  • Auto-detect bad IAM / security group configuration

Chaos Gorilla (2011)

  • Simulate full AZ outage

Chaos Kong (2013)

  • Take down an entire AWS region
  • Netflix survived the real 2016 region outage thanks to this

FIT (Failure Injection Testing)

  • Targeted fault injection by service / user / request
  • Fine-grained chaos

Most were open-sourced; many have been succeeded by newer tools.


4. Kubernetes era — LitmusChaos vs Chaos Mesh

LitmusChaos (CNCF Graduated 2024)

  • CNCF graduated project
  • Kubernetes-native (CRD-based)
  • ChaosHub — dozens of pre-built experiments
  • GitOps-friendly
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-chaos
spec:
  appinfo:
    appns: default
    applabel: app=nginx
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: "30"
        - name: CHAOS_INTERVAL
          value: "10"

Experiment types: Pod delete, container kill, network loss/latency/corruption, CPU/memory/IO stress, node drain, AWS/Azure/GCP resource manipulation.

Chaos Mesh (CNCF Incubating)

  • Built by PingCAP (TiDB company)
  • Excellent UI dashboard
  • Strong scheduling — cron-based recurring experiments
  • Workflow support — multi-step experiments
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: delay-example
spec:
  action: delay
  mode: one
  selector:
    namespaces:
    - default
    labelSelectors:
      app: nginx
  delay:
    latency: "100ms"
    correlation: "25"
    jitter: "10ms"
  duration: "5m"

Comparison

AspectLitmusChaosChaos Mesh
MaturityCNCF GraduatedCNCF Incubating
UIGoodExcellent
Experiment countChaosHub 50+Built-in 30+
GitOpsStrongMedium
SchedulingBasic cronAdvanced workflow
Learning curveMediumLow

Quick pick:

  • Fast onboarding — Chaos Mesh
  • Enterprise policy + GitOps — LitmusChaos

5. AWS Fault Injection Simulator (FIS)

AWS's official chaos service (GA 2021).

Key features

  • EC2 stop/terminate
  • EBS volume detach
  • Network disruption (latency, packet loss)
  • RDS failover trigger
  • CloudWatch alarm auto-abort
  • IAM-based safety

Region shutdown experiment (careful)

FIS Template
  ├── Action: aws:network:disrupt-connectivity
  ├── Target: subnet in us-east-1
  ├── Stop Condition: CloudWatch alarm 'business metric below threshold'
  └── IAM Role: restricted permissions

Essentially Netflix's Chaos Kong as an AWS-managed service.

Azure Chaos Studio / GCP

  • Azure Chaos Studio launched 2021, AKS integration
  • GCP: no official product, Chaos Mesh recommended

6. Game Day — organizational disaster drill

Why Game Day

Automated chaos is "tech validation." Game Day is "organizational validation." It tests:

  • Who responds? (on-call rotation)
  • Which runbooks actually work?
  • Do Slack/PagerDuty alerts fire correctly?
  • Where does communication slow down?
  • Does the escalation chain work?

Template

  1. Set objective — e.g. "Recover to us-west-2 within 15 min of us-east-1 shutdown"
  2. Choose participants — SRE, backend, DBA, frontend, PM; share schedule
  3. Write scenario — "10:00 RDS primary failover; 10:05 +10 percent write traffic; 10:10 restart half of cache nodes"
  4. Execute — separate observers from operators; live timeline in Slack
  5. Retrospective — what worked, what surprised you, which runbooks to update, what to automate

Practical tips

  • First Game Day: low risk — weekday afternoon, low traffic
  • Keep a chat channel open — decisions logged
  • Time-box — "1 hour, abort if not recovered"
  • Control blast radius — start with 1–5 percent

7. Observability meets chaos

Chaos without Observability = Guessing

During experiments, watch:

  • Steady state metrics
  • Error propagation in dependencies
  • Cascading failure signals

The three pillars (see the OpenTelemetry deep dive):

  • Metrics — continuous numerics
  • Logs — anomalous events
  • Traces — how failure propagates

Canary + Chaos pattern

  • Deploy to 10 percent canary
  • Inject chaos only on the canary
  • Verify graceful failure handling
  • Pass → full rollout

Auto-abort

Stop automatically if a business metric drops:

stopConditions:
- source: cloudwatch
  alarm: checkout-success-rate-below-95

Prevents accidental major outages.


8. Blameless postmortems — the culture pillar

Why "blameless"

Blaming individuals leads to:

  • Hidden incidents
  • Halted learning
  • Destroyed psychological safety

Focus on "how did the system allow failure?" — not who erred.

Template

  1. Impact — users, duration, revenue
  2. Timeline — minute-by-minute
  3. Root cause — Five Whys
  4. What went well — detection, teamwork
  5. What went badly — missed alerts, weak runbook
  6. Action items — owner + due date

Five Whys example

Problem: payments failed for 15 min

  1. Why? Payment service unresponsive
  2. Why? DB connection pool exhausted
  3. Why? A query scanned without an index
  4. Why? Recent deploy added the query, missed the index
  5. Why? PR review checklist lacked "verify index"

Action: add index check to PR template. Root cause lands on process gap, not developer error.

Just Culture

Blameless ≠ no accountability. Willful negligence / repeated carelessness warrants action. See John Allspaw (Etsy) on Just Culture.


9. 10 chaos recipes

1. Random pod delete

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
spec:
  action: pod-kill
  mode: random-max-percent
  value: "25"
  selector:
    namespaces: [production]
    labelSelectors: { tier: backend }
  scheduler:
    cron: "0 */6 * * *"

Validate: readinessProbe, graceful termination, PodDisruptionBudget.

2. CPU stress

Validate: HPA scale-out, throttling response.

3. Network latency

networkChaos:
  action: delay
  delay: { latency: "500ms" }

Validate: timeouts, circuit breaker.

4. DNS failure

Half of all outages are DNS. Validate: DNS cache policy, reconnection.

5. Disk full

dd if=/dev/zero of=/tmp/fill bs=1M count=10000

Validate: log rotation, disk alarms.

6. Dependency 5xx

Envoy fault injection filter:

fault:
  abort:
    percentage: 50
    httpStatus: 503

Validate: retry policy, fallback.

7. DB failover

Validate: pool reconnect, client retry.

8. Region/AZ shutdown

AWS FIS subnet network disruption. Validate: multi-AZ, auto-failover.

9. Clock skew

Break NTP, skew time. Validate: JWT validation, event timestamps, TLS certs.

10. Traffic surge

k6/Locust at 3x expected. Validate: rate limiting, HPA, CDN hit ratio.


10. Chaos maturity model

Level 1 — Chaos-curious

Interested, no tooling, reactive incident handling, blameful postmortems.

Level 2 — Staging chaos

Occasional staging experiments, runbooks, on-call rotation.

Level 3 — Production chaos

Controlled production experiments, defined steady state, quarterly Game Day.

Level 4 — Continuous chaos

Chaos in CI/CD, auto-expanding blast radius, auto-abort + observability.

Level 5 — Engineering culture

Every PR considers "what is this change's chaos scenario?"; blameless postmortems are natural; failure is seen as learning.

Netflix/Google/Amazon are at Level 5. Most are at 1–2. Climb step by step.


11. Architecture antipatterns chaos reveals

  1. Hidden dependency — "That service can't die, right?" Chaos says otherwise.
  2. Cascading failure — A depends on B; B down triggers A retry storm; C falls too.
  3. Single point of failure — single DB primary, single LB, single DNS.
  4. Inadequate timeouts — 30 s client wait looks fine until chaos shows the UX pain.
  5. Incomplete retries — retry without backoff amplifies load.

12. Regulated environments (finance, health, gov)

"We're too regulated to do prod chaos."

Approach 1: Regulatory Sandboxing

Many regs (GDPR, PCI-DSS, HIPAA) allow prod-mirror testing. Mask real data, run chaos.

Approach 2: Minimal Blast Radius

Start at 0.1 percent. Internal users only. Audit logs strict.

Approach 3: Game Day only in prod

1–2 times/year, pre-approved, formal report.

Approach 4: Real incidents as chaos

Structured learning on every incident — you are already doing chaos.


13. 12-point checklist

  1. Define steady state first (business metric)
  2. Start in staging, then 10 percent prod
  3. Declare blast radius
  4. Always set stop conditions
  5. Observability first
  6. Pair with runbooks
  7. PDB / preStop hooks
  8. On-call must know
  9. Document results
  10. Blameless postmortem culture
  11. Regular Game Days (quarterly)
  12. Executive sponsorship — chaos is a culture investment

Next post — Feature Flags and Progressive Delivery

If chaos explores failures in production, Feature Flags explore new features there. Next time:

  • History — from Facebook "Dark Launch" to LaunchDarkly
  • Flag types — release, experiment, ops, permission
  • Progressive Delivery — Canary → Rolling → Blue/Green
  • A/B Testing vs flags
  • Trunk-based development
  • Flag debt
  • Unleash, LaunchDarkly, GrowthBook, Flagsmith
  • OpenFeature standardization
  • Rollout strategies (bucket, geo, time)

"Separate deploy from release. The code is already there, the feature just isn't on yet — that is the modern default."