Chaos Engineering Deep Dive — Netflix Simian Army, LitmusChaos/Chaos Mesh, AWS FIS, Game Day

Intro — "You deliberately kill production servers?"

Newcomers to Chaos Engineering usually react:

"Outages are bad — why manufacture them?"

Answer: Outages happen anyway. It hurts less when you meet them in a controlled setting first. In 2010 Netflix learned this paradox while migrating to AWS and unleashed Chaos Monkey on the world.

This post covers:

Why Netflix started killing servers — the origin
The 4 principles of Chaos Engineering
Simian Army — from Chaos Monkey to Chaos Kong
LitmusChaos / Chaos Mesh for Kubernetes
AWS Fault Injection Simulator
Game Day design and execution
Combining observability with chaos
Blameless postmortems — the culture pillar
10 real-world chaos recipes
Maturity model — where is your org?

1. Origins — Netflix's 2010 decision

Problem: from monolith to cloud

Netflix suffered a 3-day DVD shipping outage in 2008. The decision: "We can't solve this in our own DC. Move to the cloud." The AWS migration taught:

Cloud individual instance reliability is low (commodity VMs)
Networks partition (across AZs and regions)
Dependent services are always failing somewhere

Two choices:

Write perfect code (impossible)
Design assuming failure (realistic)

Chaos Monkey is born (2010)

A simple tool: randomly terminate EC2 instances. Internal pushback was heavy — "You're crazy." But engineers soon designed their services to survive single-instance loss. Results:

Restart-tolerant architecture
Rolling replacement becomes normal
Outages become routine, not news

Open-sourced in 2012. The term Chaos Engineering enters the world.

2. The 4 Principles

From principlesofchaos.org, formalized by Netflix and Google SRE:

1. Define "Steady State"

There must be a measurable indicator of "healthy." Examples:

Requests per second
P99 latency
Error rate
Business metrics (checkout success rate)

Without a clear definition, you can't tell if the experiment succeeded. Netflix's "Starts Per Second (SPS)" is a classic business metric.

2. Vary real-world events

Inject events that actually happen:

Instance down
Network latency/partition
DNS failure
Dependency timeout
Region outage
Disk full
Clock skew

Focus on environmental events, not code bugs.

3. Experiment in production

Staging is not production:

Different traffic patterns
Different data sizes
Different third-party dependencies

Introduce gradually — start at 10 percent traffic. But production is the goal.

4. Automate and run continuously

One-off tests catch nothing. Only continuous experimentation catches regressions. Chaos belongs in the CI/CD pipeline.

3. Simian Army — the full Netflix lineup

Chaos Monkey (2010)

Random EC2 termination
Business hours only (engineers available)

Latency Monkey (2012)

Inject latency into service-to-service calls
Validate handling of slow external APIs

Conformity Monkey

Terminate instances not matching standards (old AMIs, bad tags)

Doctor Monkey

Detect and act on unhealthy instances (CPU 100 percent, unresponsive)

Janitor Monkey

Clean unused resources (orphan EBS, unused IPs) — also cost control

Security Monkey

Auto-detect bad IAM / security group configuration

Chaos Gorilla (2011)

Simulate full AZ outage

Chaos Kong (2013)

Take down an entire AWS region
Netflix survived the real 2016 region outage thanks to this

FIT (Failure Injection Testing)

Targeted fault injection by service / user / request
Fine-grained chaos

Most were open-sourced; many have been succeeded by newer tools.

4. Kubernetes era — LitmusChaos vs Chaos Mesh

LitmusChaos (CNCF Graduated 2024)

CNCF graduated project
Kubernetes-native (CRD-based)
ChaosHub — dozens of pre-built experiments
GitOps-friendly

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-chaos
spec:
  appinfo:
    appns: default
    applabel: app=nginx
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: "30"
        - name: CHAOS_INTERVAL
          value: "10"

Experiment types: Pod delete, container kill, network loss/latency/corruption, CPU/memory/IO stress, node drain, AWS/Azure/GCP resource manipulation.

Chaos Mesh (CNCF Incubating)

Built by PingCAP (TiDB company)
Excellent UI dashboard
Strong scheduling — cron-based recurring experiments
Workflow support — multi-step experiments

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: delay-example
spec:
  action: delay
  mode: one
  selector:
    namespaces:
    - default
    labelSelectors:
      app: nginx
  delay:
    latency: "100ms"
    correlation: "25"
    jitter: "10ms"
  duration: "5m"

Comparison

Aspect	LitmusChaos	Chaos Mesh
Maturity	CNCF Graduated	CNCF Incubating
UI	Good	Excellent
Experiment count	ChaosHub 50+	Built-in 30+
GitOps	Strong	Medium
Scheduling	Basic cron	Advanced workflow
Learning curve	Medium	Low

Quick pick:

Fast onboarding — Chaos Mesh
Enterprise policy + GitOps — LitmusChaos

5. AWS Fault Injection Simulator (FIS)

AWS's official chaos service (GA 2021).

Key features

EC2 stop/terminate
EBS volume detach
Network disruption (latency, packet loss)
RDS failover trigger
CloudWatch alarm auto-abort
IAM-based safety

Region shutdown experiment (careful)

FIS Template
  ├── Action: aws:network:disrupt-connectivity
  ├── Target: subnet in us-east-1
  ├── Stop Condition: CloudWatch alarm 'business metric below threshold'
  └── IAM Role: restricted permissions

Essentially Netflix's Chaos Kong as an AWS-managed service.

Azure Chaos Studio / GCP

Azure Chaos Studio launched 2021, AKS integration
GCP: no official product, Chaos Mesh recommended

6. Game Day — organizational disaster drill

Why Game Day

Automated chaos is "tech validation." Game Day is "organizational validation." It tests:

Who responds? (on-call rotation)
Which runbooks actually work?
Do Slack/PagerDuty alerts fire correctly?
Where does communication slow down?
Does the escalation chain work?

Template

Set objective — e.g. "Recover to us-west-2 within 15 min of us-east-1 shutdown"
Choose participants — SRE, backend, DBA, frontend, PM; share schedule
Write scenario — "10:00 RDS primary failover; 10:05 +10 percent write traffic; 10:10 restart half of cache nodes"
Execute — separate observers from operators; live timeline in Slack
Retrospective — what worked, what surprised you, which runbooks to update, what to automate

Practical tips

First Game Day: low risk — weekday afternoon, low traffic
Keep a chat channel open — decisions logged
Time-box — "1 hour, abort if not recovered"
Control blast radius — start with 1–5 percent

7. Observability meets chaos

Chaos without Observability = Guessing

During experiments, watch:

Steady state metrics
Error propagation in dependencies
Cascading failure signals

The three pillars (see the OpenTelemetry deep dive):

Metrics — continuous numerics
Logs — anomalous events
Traces — how failure propagates

Canary + Chaos pattern

Deploy to 10 percent canary
Inject chaos only on the canary
Verify graceful failure handling
Pass → full rollout

Auto-abort

Stop automatically if a business metric drops:

stopConditions:
- source: cloudwatch
  alarm: checkout-success-rate-below-95

Prevents accidental major outages.

8. Blameless postmortems — the culture pillar

Why "blameless"

Blaming individuals leads to:

Hidden incidents
Halted learning
Destroyed psychological safety

Focus on "how did the system allow failure?" — not who erred.

Template

Impact — users, duration, revenue
Timeline — minute-by-minute
Root cause — Five Whys
What went well — detection, teamwork
What went badly — missed alerts, weak runbook
Action items — owner + due date

Five Whys example

Problem: payments failed for 15 min

Why? Payment service unresponsive
Why? DB connection pool exhausted
Why? A query scanned without an index
Why? Recent deploy added the query, missed the index
Why? PR review checklist lacked "verify index"

Action: add index check to PR template. Root cause lands on process gap, not developer error.

Just Culture

Blameless ≠ no accountability. Willful negligence / repeated carelessness warrants action. See John Allspaw (Etsy) on Just Culture.

9. 10 chaos recipes

1. Random pod delete

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
spec:
  action: pod-kill
  mode: random-max-percent
  value: "25"
  selector:
    namespaces: [production]
    labelSelectors: { tier: backend }
  scheduler:
    cron: "0 */6 * * *"

Validate: readinessProbe, graceful termination, PodDisruptionBudget.

2. CPU stress

Validate: HPA scale-out, throttling response.

3. Network latency

networkChaos:
  action: delay
  delay: { latency: "500ms" }

Validate: timeouts, circuit breaker.

4. DNS failure

Half of all outages are DNS. Validate: DNS cache policy, reconnection.

5. Disk full

dd if=/dev/zero of=/tmp/fill bs=1M count=10000

Validate: log rotation, disk alarms.

6. Dependency 5xx

Envoy fault injection filter:

fault:
  abort:
    percentage: 50
    httpStatus: 503

Validate: retry policy, fallback.

7. DB failover

Validate: pool reconnect, client retry.

8. Region/AZ shutdown

AWS FIS subnet network disruption. Validate: multi-AZ, auto-failover.

9. Clock skew

Break NTP, skew time. Validate: JWT validation, event timestamps, TLS certs.

10. Traffic surge

k6/Locust at 3x expected. Validate: rate limiting, HPA, CDN hit ratio.

10. Chaos maturity model

Level 1 — Chaos-curious

Interested, no tooling, reactive incident handling, blameful postmortems.

Level 2 — Staging chaos

Occasional staging experiments, runbooks, on-call rotation.

Level 3 — Production chaos

Controlled production experiments, defined steady state, quarterly Game Day.

Level 4 — Continuous chaos

Chaos in CI/CD, auto-expanding blast radius, auto-abort + observability.

Level 5 — Engineering culture

Every PR considers "what is this change's chaos scenario?"; blameless postmortems are natural; failure is seen as learning.

Netflix/Google/Amazon are at Level 5. Most are at 1–2. Climb step by step.

11. Architecture antipatterns chaos reveals

Hidden dependency — "That service can't die, right?" Chaos says otherwise.
Cascading failure — A depends on B; B down triggers A retry storm; C falls too.
Single point of failure — single DB primary, single LB, single DNS.
Inadequate timeouts — 30 s client wait looks fine until chaos shows the UX pain.
Incomplete retries — retry without backoff amplifies load.

12. Regulated environments (finance, health, gov)

"We're too regulated to do prod chaos."

Approach 1: Regulatory Sandboxing

Many regs (GDPR, PCI-DSS, HIPAA) allow prod-mirror testing. Mask real data, run chaos.

Approach 2: Minimal Blast Radius

Start at 0.1 percent. Internal users only. Audit logs strict.

Approach 3: Game Day only in prod

1–2 times/year, pre-approved, formal report.

Approach 4: Real incidents as chaos

Structured learning on every incident — you are already doing chaos.

13. 12-point checklist

Define steady state first (business metric)
Start in staging, then 10 percent prod
Declare blast radius
Always set stop conditions
Observability first
Pair with runbooks
PDB / preStop hooks
On-call must know
Document results
Blameless postmortem culture
Regular Game Days (quarterly)
Executive sponsorship — chaos is a culture investment

Next post — Feature Flags and Progressive Delivery

If chaos explores failures in production, Feature Flags explore new features there. Next time:

History — from Facebook "Dark Launch" to LaunchDarkly
Flag types — release, experiment, ops, permission
Progressive Delivery — Canary → Rolling → Blue/Green
A/B Testing vs flags
Trunk-based development
Flag debt
Unleash, LaunchDarkly, GrowthBook, Flagsmith
OpenFeature standardization
Rollout strategies (bucket, geo, time)

"Separate deploy from release. The code is already there, the feature just isn't on yet — that is the modern default."