Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Introduction: Why Chaos Engineering

Modern distributed systems consist of dozens to hundreds of microservices, message queues, databases, caches, and load balancers. In these complex systems, failure is not a question of "if" but "when."

In 2012, Netflix experienced a complete service outage due to an AWS US-East-1 region failure. After this incident, Netflix adopted the philosophy of "instead of avoiding failure, build systems resilient to failure" and gave birth to Chaos Engineering.

Chaos Engineering is not simply "randomly killing servers." It is a discipline grounded in scientific experimentation methodology that proactively discovers system vulnerabilities and strengthens resilience so services continue operating normally even during failures.

What This Post Covers

Core principles and experiment design of Chaos Engineering
Netflix Chaos Monkey origin story and the Simian Army
Tool comparison: Litmus Chaos, Chaos Mesh, Gremlin, AWS FIS
Fault type experiments: network, CPU, memory, Pod, node, AZ
Kubernetes chaos experiment YAML examples
Game Day operations guide
Progressive adoption strategy (dev to staging to production)
Netflix and Amazon case studies
CI/CD integration and observability

1. Core Principles of Chaos Engineering

The principles of Chaos Engineering are defined at principlesofchaos.org.

1.1 Steady State Hypothesis

Every chaos experiment begins with defining "steady state."

Steady State Examples:
- API response time p99 < 200ms
- Error rate < 0.1%
- Order processing success rate > 99.9%
- Database replication lag < 1 second

The steady state hypothesis is: "Even after injecting a specific failure, the system will maintain its steady state."

1.2 Vary Real-World Events

You must simulate failures that could actually occur in production.

Failure Type	Real-World Case	Experiment Method
Server Down	Hardware failure, OOM Kill	Force-terminate Pod/instance
Network Latency	Cross-region communication delay	Inject latency via tc command
Network Partition	Switch failure	Block traffic via iptables rules
CPU Overload	Sudden traffic spike	Generate CPU load with stress-ng
Disk Full	Log explosion	Fill disk with dd command
DNS Failure	DNS server down	Manipulate DNS responses
AZ Failure	Data center outage	Block all AZ traffic

1.3 Run Experiments in Production

True confidence can only be validated in production environments. Staging environments differ from production in traffic patterns, data volumes, and infrastructure configuration.

Of course, this does not mean starting in production right away. Adopt progressively, but the ultimate goal should be production experimentation.

1.4 Minimize Blast Radius

Blast Radius Control Strategies:
1. Start small: Single instance -> instance group -> AZ -> region
2. Time limit: Set experiment duration (e.g., 5 minutes)
3. Auto-abort: Stop experiment immediately when key metrics exceed thresholds
4. Traffic limit: Apply experiment to only 1% of total traffic
5. Rollback plan: Verify rollback procedure before experiment

1.5 Automate Experiments

Manual experiments cannot scale. Automation includes:

Experiment scheduling (running at specific times each week)
CI/CD pipeline integration
Automated result collection and report generation
Automatic abort condition monitoring

2. Chaos Monkey: Netflix's Beginning

2.1 Origin Story

In 2010, as Netflix migrated entirely from its own data centers to AWS, the team faced a fundamental question:

"In cloud infrastructure, servers can disappear at any time. Can our system handle this?"

Chaos Monkey was born to answer this question. It randomly terminates instances in production, forcing every service to be resilient against single-instance failures.

2.2 The Simian Army

After the success of Chaos Monkey, Netflix created the Simian Army to experiment with a wider variety of failure scenarios.

Simian Army Members:
+----------------------+--------------------------------------+
| Name                 | Role                                 |
+----------------------+--------------------------------------+
| Chaos Monkey         | Random instance termination          |
| Latency Monkey       | Network latency injection            |
| Conformity Monkey    | Detect non-compliant instances       |
| Doctor Monkey        | Unhealthy instance health checks     |
| Janitor Monkey       | Clean up unused resources            |
| Security Monkey      | Detect security vulnerabilities      |
| Chaos Gorilla        | Full AZ failure simulation           |
| Chaos Kong           | Full region failure simulation       |
+----------------------+--------------------------------------+

2.3 Chaos Monkey Configuration Example

# Chaos Monkey Configuration (Spinnaker Integration)
chaos_monkey:
  enabled: true
  leashed: false
  schedule:
    frequency: "weekday"
    start_hour: 10
    end_hour: 16
    timezone: "America/New_York"
  grouping: "cluster"
  probability: 1.0
  exceptions:
    - account: "prod-critical"
      region: "us-east-1"
      stack: "payment"

3. Chaos Tool Comparison

3.1 Tool Overview

Tool	Developer	Environment	License	Features
Litmus Chaos	CNCF	Kubernetes	Apache 2.0	ChaosHub, GitOps native
Chaos Mesh	CNCF	Kubernetes	Apache 2.0	Powerful dashboard, TimeChaos
Gremlin	Gremlin Inc.	All	Commercial	SaaS-based, enterprise features
AWS FIS	AWS	AWS	Pay-per-use	Native AWS service integration
Azure Chaos Studio	Microsoft	Azure	Pay-per-use	Native Azure service integration
Steadybit	Steadybit	All	Commercial	Safety guardrails emphasis

3.2 Litmus Chaos in Detail

Litmus is a CNCF Incubating project and a Kubernetes-native chaos engineering framework.

# Litmus Chaos Installation
apiVersion: v1
kind: Namespace
metadata:
  name: litmus
---
# LitmusChaos Operator Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: chaos-operator-ce
  namespace: litmus
spec:
  replicas: 1
  selector:
    matchLabels:
      name: chaos-operator
  template:
    metadata:
      labels:
        name: chaos-operator
    spec:
      serviceAccountName: litmus
      containers:
        - name: chaos-operator
          image: litmuschaos/chaos-operator:3.0.0
          env:
            - name: CHAOS_RUNNER_IMAGE
              value: "litmuschaos/chaos-runner:3.0.0"
            - name: WATCH_NAMESPACE
              value: ""

The core concepts of Litmus are ChaosExperiment, ChaosEngine, and ChaosResult.

# Pod Delete Experiment Definition
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: pod-delete
  namespace: litmus
spec:
  definition:
    scope: Namespaced
    permissions:
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["delete", "list", "get"]
    image: "litmuschaos/go-runner:3.0.0"
    args:
      - -c
      - ./experiments -name pod-delete
    command:
      - /bin/bash
    env:
      - name: TOTAL_CHAOS_DURATION
        value: "30"
      - name: CHAOS_INTERVAL
        value: "10"
      - name: FORCE
        value: "false"

3.3 Chaos Mesh in Detail

Chaos Mesh is a Kubernetes chaos platform supporting diverse fault types.

# Chaos Mesh Installation (Helm)
# helm repo add chaos-mesh https://charts.chaos-mesh.org
# helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-mesh --create-namespace

# Network Delay Experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay-example
  namespace: default
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: my-service
  delay:
    latency: "200ms"
    jitter: "50ms"
    correlation: "50"
  duration: "5m"
  scheduler:
    cron: "@every 1h"

3.4 AWS Fault Injection Service (FIS)

AWS FIS is a chaos engineering service natively integrated with AWS services.

{
  "description": "EC2 instance termination experiment",
  "targets": {
    "myInstances": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {
        "Environment": "staging",
        "ChaosReady": "true"
      },
      "selectionMode": "COUNT(1)"
    }
  },
  "actions": {
    "stopInstances": {
      "actionId": "aws:ec2:stop-instances",
      "parameters": {},
      "targets": {
        "Instances": "myInstances"
      }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:us-east-1:123456789:alarm:HighErrorRate"
    }
  ],
  "roleArn": "arn:aws:iam::123456789:role/FISRole"
}

3.5 Gremlin

Gremlin is a commercial SaaS-based chaos engineering platform.

# Gremlin Agent Installation (Kubernetes)
helm repo add gremlin https://helm.gremlin.com
helm install gremlin gremlin/gremlin \
  --set gremlin.secret.managed=true \
  --set gremlin.secret.type=secret \
  --set gremlin.secret.teamID=YOUR_TEAM_ID \
  --set gremlin.secret.clusterID=my-cluster \
  --set gremlin.secret.teamSecret=YOUR_SECRET

Gremlin's advantages include an intuitive UI, fine-grained control, and safety features (Safety Net). Basic chaos experiments are available even on the free tier.

4. Fault Type Experiments

4.1 Network Failures

Network failures are the most common and critical issues in distributed systems.

# Chaos Mesh: Network Partition
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-partition
  namespace: default
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: order-service
  direction: both
  target:
    selector:
      namespaces:
        - default
      labelSelectors:
        app: payment-service
    mode: all
  duration: "2m"

# Chaos Mesh: Packet Loss
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: packet-loss
  namespace: default
spec:
  action: loss
  mode: all
  selector:
    labelSelectors:
      app: api-gateway
  loss:
    loss: "25"
    correlation: "50"
  duration: "3m"

# Chaos Mesh: Bandwidth Limitation
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: bandwidth-limit
  namespace: default
spec:
  action: bandwidth
  mode: all
  selector:
    labelSelectors:
      app: file-service
  bandwidth:
    rate: "1mbps"
    limit: 20971520
    buffer: 10000
  duration: "5m"

4.2 CPU Stress

# Chaos Mesh: CPU Stress
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress
  namespace: default
spec:
  mode: one
  selector:
    labelSelectors:
      app: compute-service
  stressors:
    cpu:
      workers: 4
      load: 80
  duration: "5m"

4.3 Memory Pressure

# Chaos Mesh: Memory Stress
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-stress
  namespace: default
spec:
  mode: one
  selector:
    labelSelectors:
      app: cache-service
  stressors:
    memory:
      workers: 2
      size: "512MB"
  duration: "3m"

4.4 Pod Kill

# Chaos Mesh: Pod Kill
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-example
  namespace: default
spec:
  action: pod-kill
  mode: fixed
  value: "2"
  selector:
    namespaces:
      - default
    labelSelectors:
      app: web-frontend
  gracePeriod: 0
  duration: "1m"

# Chaos Mesh: Container Kill
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: container-kill
  namespace: default
spec:
  action: container-kill
  mode: one
  selector:
    labelSelectors:
      app: multi-container-app
  containerNames:
    - sidecar-proxy
  duration: "30s"

4.5 Node Drain

# Litmus: Node Drain Experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: node-drain-engine
  namespace: litmus
spec:
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: node-drain
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: TARGET_NODE
              value: "worker-node-02"
            - name: APP_NAMESPACE
              value: "default"
            - name: APP_LABEL
              value: "app=critical-service"

4.6 AZ Failure Simulation

AZ (Availability Zone) failure is the experiment with the largest blast radius.

# AWS FIS: AZ Failure Simulation
experiment_template:
  description: "AZ-a failure simulation"
  targets:
    azSubnets:
      resourceType: "aws:ec2:subnet"
      resourceTags:
        AvailabilityZone: "us-east-1a"
      selectionMode: "ALL"
  actions:
    disruptConnectivity:
      actionId: "aws:network:disrupt-connectivity"
      parameters:
        scope: "all"
      targets:
        Subnets: "azSubnets"
      duration: "PT10M"
  stopConditions:
    - source: "aws:cloudwatch:alarm"
      value: "arn:aws:cloudwatch:region:account:alarm:CriticalErrorRate"

4.7 DNS Failure

# Chaos Mesh: DNS Failure
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
  name: dns-failure
  namespace: default
spec:
  action: error
  mode: all
  selector:
    labelSelectors:
      app: external-api-client
  patterns:
    - "external-api.example.com"
  duration: "2m"

5. Experiment Design Methodology

5.1 Systematic Experiment Design

Chaos experiments follow the same structure as scientific experiments.

Chaos Experiment Design Framework:

1. Hypothesis
   "Even if 1 payment-service Pod is terminated,
    order success rate will remain above 99.9%"

2. Steady State Definition
   - Order success rate: 99.95%
   - API response time p99: 180ms
   - Error rate: 0.05%

3. Variables
   - Independent variable: Force-terminate 1 Pod
   - Dependent variables: Order success rate, response time, error rate

4. Blast Radius
   - Target: 1 of 3 payment-service Pods
   - Impact scope: Approximately 33% of payment requests

5. Abort Conditions
   - Order success rate < 99.0%
   - API response time p99 > 1000ms
   - Concurrent errors > 100/min

6. Execute and Observe
   - Collect baseline for 5 minutes before experiment
   - Execute experiment (2 minutes)
   - Observe recovery for 5 minutes after experiment

7. Analyze and Improve
   - Verify/falsify hypothesis
   - Document discovered vulnerabilities
   - Create improvement work tickets

5.2 Experiment Checklist

Pre-Experiment Checklist:
[ ] Is the hypothesis clearly defined?
[ ] Are steady state metrics being collected?
[ ] Is the blast radius appropriately limited?
[ ] Are abort conditions configured?
[ ] Is the rollback procedure documented?
[ ] Have relevant teams been notified?
[ ] Is the monitoring dashboard ready?
[ ] Is an on-call engineer standing by?

During-Experiment Checklist:
[ ] Are metrics being monitored in real-time?
[ ] Are abort conditions being continuously checked?
[ ] Are there any unexpected side effects?

Post-Experiment Checklist:
[ ] Has the system returned to steady state?
[ ] Have experiment results been recorded?
[ ] Have action items been created for discovered issues?
[ ] Have results been shared with the team?

5.3 Experiment Result Template

Experiment ID: CE-2025-042
Date: 2025-04-14
Experimenter: SRE Team (Jane Doe)

Hypothesis: Even when 1 payment-service Pod is terminated,
            order success rate remains above 99.9%

Steady State Baseline:
- Order success rate: 99.97%
- p99 response time: 165ms
- Error rate: 0.03%

Experiment Results:
- Order success rate: 99.12% (Hypothesis FALSIFIED!)
- p99 response time: 892ms
- Error rate: 0.88%
- Recovery time: 45 seconds

Findings:
1. Kubernetes readiness probe interval set to 30s,
   causing traffic to continue routing to terminating Pod
2. Circuit breaker timeout set too high (10 seconds)
3. Retry logic retrying to the same Pod

Improvement Actions:
- [HIGH] Reduce readiness probe interval to 5 seconds
- [HIGH] Adjust circuit breaker timeout to 2 seconds
- [MED] Modify retry logic to route to different instances

6. Game Day Operations Guide

6.1 What is Game Day?

Game Day is a training session where the team gathers to execute planned failure scenarios, respond in real-time, and learn. Inspired by military "War Games," it improves response capabilities during actual incidents.

6.2 Game Day Planning

Game Day Planning Template:

1. Purpose
   - Verify payment system AZ failure resilience
   - Train incident response process
   - Onboard new team members

2. Schedule
   - Date: Monday, April 14, 2025
   - Time: 14:00 - 17:00 EST
   - Location: Conference Room + Zoom

3. Participants
   - Game Master: SRE Lead (injects scenarios)
   - Incident Commander: Backend Lead
   - Response Team: 3 Backend, 1 Frontend, 1 DBA
   - Observers: CTO, PM

4. Scenarios (hidden from participants)
   Scenario 1: Kill 50% of payment-service Pods
   Scenario 2: Database read replica connection lost
   Scenario 3: External payment gateway 5-second delay

5. Success Criteria
   - Detection time < 5 minutes
   - Mitigation time < 15 minutes
   - Zero customer impact

6. Safety Measures
   - Kill switch: Game master can stop all experiments immediately
   - Run in staging, not production
   - No impact on actual customer traffic

6.3 Game Day Roles

Role Definitions:

Game Master
- Design scenarios and inject failures
- Authority to abort experiments
- Provide hints (when needed)
- Time management

Incident Commander
- Coordinate response team
- Decision making
- Communication management
- Escalation judgment

Response Team (Responders)
- Diagnose and resolve issues
- Reference and execute runbooks
- Report status in real-time

Observers
- Record response process
- Note process improvements
- Do NOT intervene (important!)

Scribe
- Record timeline
- Document key decisions
- Organize findings

6.4 Game Day Execution Timeline

14:00 - 14:15  Kickoff Briefing
               - Explain rules
               - Confirm safety measures
               - Verify tool access

14:15 - 14:20  Scenario 1 Injection (Game Master)
14:20 - 14:50  Response Team Detection and Response
14:50 - 15:00  Scenario 1 Review

15:00 - 15:10  Break

15:10 - 15:15  Scenario 2 Injection
15:15 - 15:45  Response Team Detection and Response
15:45 - 15:55  Scenario 2 Review

15:55 - 16:00  Scenario 3 Injection
16:00 - 16:30  Response Team Detection and Response
16:30 - 16:40  Scenario 3 Review

16:40 - 17:00  Comprehensive Retrospective
               - What went well
               - What to improve
               - Derive action items

6.5 Game Day Retrospective

Retrospective Questions:

Detection
- Which alert fired first?
- Could you accurately identify the problem from the alert?
- Were there any missed alerts?
- How can we reduce detection time?

Response
- Was the runbook helpful?
- Were communication channels effective?
- Was role assignment clear?
- Was escalation needed?

Recovery
- Was recovery time acceptable?
- Did auto-recovery mechanisms work?
- Which parts required manual intervention?

System
- Which parts did not behave as expected?
- What monitoring needs improvement?
- What tasks need automation?

7. Progressive Adoption Strategy

7.1 Phase 1: Development Environment (1-2 weeks)

Goal: Build team chaos engineering capabilities

Activities:
1. Install chaos tools (Chaos Mesh or Litmus)
2. Run simple experiments
   - Pod Kill in development environment
   - Network latency injection
3. Experiment design methodology training
4. Write first experiment report

Success Metrics:
- Every team member has run at least 1 experiment
- Experiment design process documented

7.2 Phase 2: Staging Environment (2-4 weeks)

Goal: Systematic experiments in production-like environment

Activities:
1. Deploy chaos tools to staging
2. Systematic experiments on core services
   - Inter-service communication failures
   - Database connection pool exhaustion
   - Full cache failure
3. Execute first Game Day
4. Integrate basic experiments into CI/CD pipeline

Success Metrics:
- At least 1 regular experiment per week
- 1 Game Day completed
- 3+ discovered vulnerabilities improved

7.3 Phase 3: Production Canary (4-8 weeks)

Goal: Begin safe production experiments

Activities:
1. Build production chaos experiment safety measures
   - Configure auto-abort conditions
   - Implement kill switch
   - Limit experiment impact scope
2. Small-scale production experiments
   - Single Pod Kill on 1% of total traffic
   - Network latency on non-critical services
3. Enhance observability dashboards

Success Metrics:
- 5+ production experiments completed safely
- Zero customer impact
- Auto-abort mechanism verified at least once

7.4 Phase 4: Production Normalization (8+ weeks)

Goal: Establish chaos engineering as routine practice

Activities:
1. Automated periodic experiment execution
2. Mandatory chaos tests for new service deployments
3. Quarterly large-scale Game Days
4. AZ/region-level failure simulation
5. Chaos culture evangelization (company-wide training)

Success Metrics:
- 10+ automated experiments per month
- Quarterly Game Days established
- MTTR reduced by 50%
- Failure-related incidents reduced by 30%

8. Netflix Case Study

8.1 Chaos Kong: Region Evacuation Drill

Netflix uses an experiment called Chaos Kong to simulate full AWS region failure.

Chaos Kong Process:

1. Preparation Phase
   - Verify traffic routing
   - Confirm alternative region capacity
   - Enhance monitoring
   - Notify relevant teams in advance

2. Execution Phase
   - Route all US-East-1 traffic to US-West-2
   - Leverage DNS-based global load balancing
   - Real-time metric monitoring during transition

3. Observation Points
   - Traffic transition completion time
   - Alternative region auto-scaling response time
   - User experience impact (playback quality, start time)
   - Data consistency

4. Recovery Phase
   - Restore traffic to original region
   - Verify data synchronization
   - Full results analysis

8.2 Netflix Chaos Maturity Model

Level 0: No Chaos
- Manual response to failures
- Hope that "our system works fine"

Level 1: Basic Chaos
- Instance termination with Chaos Monkey
- Individual service resilience verification

Level 2: Systematic Chaos
- Diverse fault type experiments
- Regular Game Day execution
- Results-based improvements

Level 3: Advanced Chaos
- AZ failure simulation
- CI/CD integration
- Automated experiments

Level 4: Chaos Culture
- Region evacuation drill (Chaos Kong)
- All teams performing chaos experiments
- Resilience as core architecture decision criterion

9. Amazon Case Study

9.1 Amazon GameDay

Amazon has been running failure drills called GameDay since 2004.

Amazon GameDay Principles:

1. "Everything fails, all the time" - Werner Vogels
   - Treat failure as a normal part of operations

2. Progressive complexity increase
   - Single service -> service chain -> full system

3. Blameless culture
   - Turn failure response failures into learning opportunities

4. Documentation
   - Record all GameDay results in detail

5. Repetition
   - Repeat same scenarios to verify improvements

9.2 AWS FIS Integration Example

{
  "description": "Multi-AZ resilience test",
  "targets": {
    "ec2Instances": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {
        "chaos-ready": "true"
      },
      "filters": [
        {
          "path": "Placement.AvailabilityZone",
          "values": ["us-east-1a"]
        }
      ],
      "selectionMode": "ALL"
    },
    "rdsInstances": {
      "resourceType": "aws:rds:db",
      "resourceTags": {
        "chaos-ready": "true"
      },
      "selectionMode": "ALL"
    }
  },
  "actions": {
    "stopEC2": {
      "actionId": "aws:ec2:stop-instances",
      "targets": {
        "Instances": "ec2Instances"
      },
      "startAfter": []
    },
    "failoverRDS": {
      "actionId": "aws:rds:failover-db-cluster",
      "targets": {
        "Clusters": "rdsInstances"
      },
      "startAfter": ["stopEC2"]
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:region:account:alarm:HighErrorRate"
    }
  ]
}

10. CI/CD Integration

10.1 Pipeline Integration Strategy

# GitHub Actions: Chaos Test Integration
name: Chaos Testing Pipeline

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * 1-5'  # Weekdays at 2 AM

jobs:
  deploy-staging:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to Staging
        run: kubectl apply -f k8s/staging/

  chaos-test:
    needs: deploy-staging
    runs-on: ubuntu-latest
    steps:
      - name: Install Litmus
        run: |
          kubectl apply -f https://litmuschaos.github.io/litmus/3.0.0/litmus-3.0.0.yaml

      - name: Run Pod Delete Experiment
        run: |
          kubectl apply -f chaos/experiments/pod-delete.yaml
          sleep 120
          kubectl get chaosresult pod-delete-result -o jsonpath='{.status.experimentStatus.verdict}'

      - name: Run Network Delay Experiment
        run: |
          kubectl apply -f chaos/experiments/network-delay.yaml
          sleep 180
          kubectl get chaosresult network-delay-result -o jsonpath='{.status.experimentStatus.verdict}'

      - name: Verify System Recovery
        run: |
          ./scripts/verify-steady-state.sh

      - name: Collect Results
        if: always()
        run: |
          kubectl get chaosresults -o yaml > chaos-results.yaml

      - name: Upload Results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: chaos-results
          path: chaos-results.yaml

10.2 Canary Deployment with Chaos Integration

# Argo Rollouts + Chaos Integration
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-service
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause:
            duration: 5m
        # Run chaos experiment on canary
        - analysis:
            templates:
              - templateName: chaos-analysis
            args:
              - name: service-name
                value: my-service
        - setWeight: 50
        - pause:
            duration: 10m
        - setWeight: 100
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: chaos-analysis
spec:
  metrics:
    - name: success-rate-during-chaos
      interval: 30s
      successCondition: result[0] >= 99.5
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{status="200",service="my-service"}[5m]))
            /
            sum(rate(http_requests_total{service="my-service"}[5m]))
            * 100

11. Observability and Chaos

11.1 Monitoring Dashboard

Metrics that must be monitored during chaos experiments.

Key Metrics (Golden Signals):

1. Latency
   - p50, p90, p99 response times
   - Inter-service call latency

2. Traffic
   - Requests per second (RPS)
   - Traffic distribution

3. Errors
   - HTTP 5xx rate
   - Application error rate
   - Timeout rate

4. Saturation
   - CPU utilization
   - Memory utilization
   - Network bandwidth
   - Disk I/O

11.2 Grafana Dashboard Example

{
  "dashboard": {
    "title": "Chaos Engineering Dashboard",
    "panels": [
      {
        "title": "Request Success Rate",
        "type": "gauge",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"2..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
          }
        ],
        "thresholds": {
          "steps": [
            { "color": "red", "value": 0 },
            { "color": "yellow", "value": 99 },
            { "color": "green", "value": 99.9 }
          ]
        }
      },
      {
        "title": "P99 Latency",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
          }
        ]
      },
      {
        "title": "Active Chaos Experiments",
        "type": "stat",
        "targets": [
          {
            "expr": "count(chaos_experiment_status{phase=\"Running\"})"
          }
        ]
      }
    ]
  }
}

11.3 Alert Configuration

# Prometheus AlertManager Rules
groups:
  - name: chaos-safety-alerts
    rules:
      - alert: ChaosExperimentHighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[2m]))
          /
          sum(rate(http_requests_total[2m]))
          > 0.05
        for: 1m
        labels:
          severity: critical
          team: sre
        annotations:
          summary: "Error rate exceeds 5% during chaos experiment"
          description: "Immediately abort experiment and check status"

      - alert: ChaosExperimentHighLatency
        expr: |
          histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[2m]))
          > 2.0
        for: 1m
        labels:
          severity: warning
          team: sre
        annotations:
          summary: "P99 latency exceeds 2 seconds during chaos experiment"

12. Relationship with SRE

Chaos Engineering is one of the core practices of SRE (Site Reliability Engineering).

Relationship between SRE and Chaos Engineering:

SRE Goal: Operate reliable and scalable systems
     |
     +-- SLO/SLI/SLA Definition -> Success criteria for chaos experiments
     |
     +-- Error Budget -> Determines whether to run chaos experiments
     |   (Aggressive experiments only when budget is sufficient)
     |
     +-- Incident Management -> Source of Game Day scenarios
     |   (Reproduce past failures to verify improvements)
     |
     +-- Postmortems -> New chaos experiment ideas
     |   (Experiment on vulnerabilities found in failure analysis)
     |
     +-- Toil Elimination -> Chaos experiment automation
         (Automate repetitive manual experiments)

13. Quiz

Test your understanding with these questions.

Q1: What is the first step in Chaos Engineering?

Answer: Establishing a Steady State Hypothesis

Every chaos experiment begins with defining the system's steady state and forming the hypothesis: "Even after injecting this failure, the system will maintain steady state." Injecting failures without a hypothesis is not chaos engineering -- it is just destruction.

Q2: What level of failure does Netflix's Chaos Kong simulate?

Answer: Full AWS region failure

Chaos Kong is Netflix's most extreme chaos experiment, simulating a Region Evacuation where all traffic from an entire AWS region is redirected to another region. This verifies that Netflix's service remains operational even during region-level failures.

Q3: Describe 3 strategies for minimizing blast radius.

Answer:

Start small: Begin with a single instance and progressively expand to instance groups, AZs, and regions.
Time limits: Explicitly set experiment duration to prevent prolonged impact.
Auto-abort conditions: Automatically stop the experiment when key metrics exceed thresholds. For example, immediately abort if error rate exceeds 5%.

Additional important strategies include traffic limiting (applying experiments to only a portion of traffic) and having a rollback plan.

Q4: What is the role of the Game Master during a Game Day?

Answer:

The Game Master is the operator of the Game Day and performs the following roles:

Scenario design: Design failure scenarios in advance (kept secret from participants).
Failure injection: Use chaos tools to inject failures at planned moments.
Experiment abort authority: Use the kill switch to immediately stop if situation becomes uncontrollable.
Hint provision: Provide hints if the response team gets stuck.
Time management: Manage the overall timeline.

Q5: List the 4 phases of progressive Chaos Engineering adoption in order.

Answer:

Phase 1: Development Environment (1-2 weeks) - Tool installation, basic experiments, team training
Phase 2: Staging Environment (2-4 weeks) - Systematic experiments, first Game Day, CI/CD integration
Phase 3: Production Canary (4-8 weeks) - Small-scale production experiments, safety measures
Phase 4: Production Normalization (8+ weeks) - Automated regular experiments, AZ/region failure simulation

The key principle is "slowly, safely, progressively" expanding scope.

References

Principles of Chaos Engineering - principlesofchaos.org
Chaos Engineering: System Resiliency in Practice - Casey Rosenthal, Nora Jones (O'Reilly)
Netflix Tech Blog: Chaos Engineering - netflixtechblog.com
Litmus Chaos Documentation - litmuschaos.io/docs
Chaos Mesh Documentation - chaos-mesh.org/docs
AWS Fault Injection Service - AWS Official Documentation
Azure Chaos Studio - Microsoft Official Documentation
Gremlin Documentation - gremlin.com/docs
Google SRE Book - Chapter 17: Testing for Reliability - sre.google/sre-book
The Practice of Cloud System Administration - Thomas Limoncelli et al.
Awesome Chaos Engineering - GitHub Repository
CNCF Chaos Engineering Whitepaper - cncf.io
Learning Chaos Engineering - Russ Miles (O'Reilly)
Chaos Engineering Adoption Guide - Gremlin Blog