Skip to content
Published on

Practical Guide to Chaos Engineering: Fault Injection and Resilience Validation in Kubernetes with Litmus and Chaos Mesh

Authors
  • Name
    Twitter
Chaos Engineering

Introduction

"Everything fails, all the time." This statement by AWS CTO Werner Vogels captures a fundamental truth about operating distributed systems. No matter how robustly a system is designed, unexpected failures can occur. The question is not whether failures will happen, but how quickly and safely the system recovers when they do.

Chaos Engineering is a methodology rooted in this philosophy that proactively validates system resilience by intentionally injecting faults in a controlled environment. Since Netflix introduced Chaos Monkey in the early 2010s, Chaos Engineering has established itself as a core engineering practice for ensuring the reliability of production systems. In Kubernetes environments in particular, various types of failures can occur -- Pod crashes, network latency, node outages, I/O errors, and more -- making it essential to experiment with these scenarios in advance and build response mechanisms for operational stability.

This post provides an in-depth comparison of the two most widely used open-source Chaos Engineering platforms for Kubernetes environments -- LitmusChaos and Chaos Mesh -- and comprehensively covers everything from designing fault injection experiments in real production environments, to GameDay operations, SLO-based steady state validation, and failure case studies with recovery procedures.

Chaos Engineering Principles

Chaos Engineering is not simply about "breaking systems." It is a systematic approach that follows the scientific experimental method. The core principles of Chaos Engineering, as established by Netflix, are as follows.

1. Build a Hypothesis Around Steady State (Steady State Hypothesis)

Before running an experiment, define the "steady state" of the system. This must be expressed as measurable business metrics or technical indicators. For example, "API response time at p99 is under 500ms," "error rate is below 0.1%," or "order throughput exceeds 100 requests per second" serve as steady state hypotheses.

2. Vary Real-World Events (Real-World Events)

The faults injected in experiments must reflect scenarios that can actually occur. Typical examples include Pod crashes, network partitions, disk I/O latency, CPU overload, and DNS failures.

3. Run Experiments in Production

To most accurately validate actual system behavior, experiments should be run in the production environment. However, this should only be done after sufficient safeguards (abort conditions, blast radius limitations) are in place.

4. Automate Experiments to Run Continuously

Chaos Engineering is not a one-time event. The goal is to integrate experiments into CI/CD pipelines for continuous execution.

5. Minimize Blast Radius

Limit the scope of experimental impact to the absolute minimum, and ensure mechanisms are in place to immediately halt experiments when anomalies occur.

Litmus vs Chaos Mesh Comparison

The following table compares the two platforms across various dimensions to help organizations choose the right tool for their requirements.

CategoryLitmusChaosChaos Mesh
CNCF StatusIncubating (promoted in 2022)Incubating (promoted in 2022)
Developed ByChaosNative (acquired by Harness)PingCAP / Community
Supported EnvsKubernetes, VM, Cloud, Bare MetalKubernetes only
InstallationHelm / kubectl / OperatorHelm / kubectl
Web UIChaosCenter (full dashboard)Chaos Dashboard
Experiment Def.ChaosExperiment + ChaosEngine CRDPer-chaos-type CRDs (PodChaos, NetworkChaos, etc.)
Experiment HubChaosHub (community-shared experiments)Built-in experiments + custom
SLO IntegrationProbe-based steady state validationStatusCheck + Grafana integration
RBACNative support (project/team level)Kubernetes RBAC integration
SchedulingCronWorkflow-basedSchedule CRD-based
Learning CurveModerate (concepts can be complex)Low (intuitive CRD structure)
Community SizeGitHub Stars 4.4k+GitHub Stars 6.5k+
CI/CD IntegrationAPI/SDK, GitHub Actions supportAPI, CLI-based
Commercial SupportHarness Chaos EngineeringCommunity-based

Selection Criteria Summary: If you need comprehensive Chaos Engineering that extends beyond Kubernetes, or if systematic experiment management and ChaosHub-based experiment sharing are important, LitmusChaos is the better fit. If you are working exclusively in Kubernetes and want to get started quickly, or prefer intuitive CRD-based fault injection, Chaos Mesh is an excellent choice.

Installation Guide

LitmusChaos Installation

LitmusChaos consists of ChaosCenter (management dashboard) and Chaos Infrastructure (experiment execution agent).

# Create LitmusChaos namespace
kubectl create namespace litmus

# Add Helm repository
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

# Install ChaosCenter (includes MongoDB, Auth Server, Frontend, Backend)
helm upgrade --install litmus litmuschaos/litmus \
  --namespace litmus \
  --set portal.server.service.type=ClusterIP \
  --set portal.frontend.service.type=ClusterIP \
  --set mongodb.persistence.enabled=true \
  --set mongodb.persistence.storageClass=standard \
  --set mongodb.persistence.size=20Gi

# Verify installation
kubectl get pods -n litmus

# Access ChaosCenter UI (port forwarding)
kubectl port-forward svc/litmus-frontend-service -n litmus 9091:9091

# Default admin credentials: admin / litmus
# Be sure to change the password after first login

Chaos Mesh Installation

Chaos Mesh consists of Controller Manager, Chaos Daemon, and Dashboard.

# Create Chaos Mesh namespace
kubectl create namespace chaos-mesh

# Add Helm repository
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

# Install Chaos Mesh
helm upgrade --install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock \
  --set dashboard.securityMode=true \
  --set dashboard.create=true \
  --version 2.7.0

# Verify installation
kubectl get pods -n chaos-mesh

# Verify CRDs
kubectl get crd | grep chaos-mesh

# Access Dashboard (port forwarding)
kubectl port-forward svc/chaos-dashboard -n chaos-mesh 2333:2333

An important note: chaosDaemon.runtime and chaosDaemon.socketPath in Chaos Mesh must be configured to match the container runtime of your cluster. The socket paths differ for containerd, CRI-O, and Docker, so you must verify your cluster's runtime environment first.

ChaosExperiment Design

Pod Kill Experiment

This is the most fundamental chaos experiment, which forcefully terminates specific Pods to validate the application's self-healing capabilities.

Litmus - Pod Delete Experiment:

# litmus-pod-delete.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-chaos
  namespace: production
spec:
  engineState: active
  annotationCheck: 'false'
  appinfo:
    appns: production
    applabel: app=payment-service
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            # Experiment duration (seconds)
            - name: TOTAL_CHAOS_DURATION
              value: '60'
            # Pod deletion interval (seconds)
            - name: CHAOS_INTERVAL
              value: '10'
            # Force delete
            - name: FORCE
              value: 'false'
            # Number of Pods to delete
            - name: PODS_AFFECTED_PERC
              value: '50'
            # Execution order (parallel or serial)
            - name: SEQUENCE
              value: 'parallel'
        probe:
          - name: payment-health-check
            type: httpProbe
            mode: Continuous
            httpProbe/inputs:
              url: 'http://payment-service.production.svc:8080/health'
              method:
                get:
                  criteria: ==
                  responseCode: '200'
            runProperties:
              probeTimeout: 5s
              interval: 5s
              retry: 3
              probePollingInterval: 2s

Chaos Mesh - PodChaos Experiment:

# chaos-mesh-pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-payment
  namespace: chaos-mesh
spec:
  action: pod-kill
  mode: fixed-percent
  value: '50'
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  # gracePeriod: 0 means force kill
  gracePeriod: 30
  duration: '60s'
---
# Scheduled experiment (every Wednesday at 10:00 AM)
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: scheduled-pod-kill
  namespace: chaos-mesh
spec:
  schedule: '0 10 * * 3'
  type: PodChaos
  historyLimit: 5
  concurrencyPolicy: Forbid
  podChaos:
    action: pod-kill
    mode: one
    selector:
      namespaces:
        - production
      labelSelectors:
        app: payment-service
    gracePeriod: 30
    duration: '30s'

Network Latency Experiment

Inject network latency to verify that timeout configurations and Circuit Breaker patterns between microservices are functioning correctly.

Chaos Mesh - NetworkChaos (Latency Injection):

# chaos-mesh-network-delay.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay-between-services
  namespace: chaos-mesh
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: order-service
  delay:
    latency: '200ms'
    jitter: '50ms'
    correlation: '25'
  direction: to
  target:
    selector:
      namespaces:
        - production
      labelSelectors:
        app: payment-service
    mode: all
  duration: '5m'
---
# Network packet loss experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-loss-experiment
  namespace: chaos-mesh
spec:
  action: loss
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-gateway
  loss:
    loss: '30'
    correlation: '25'
  direction: both
  duration: '3m'

IO Fault Experiment

Simulate disk I/O faults to validate the resilience of databases, caches, and logging systems.

Chaos Mesh - IOChaos (I/O Fault Injection):

# chaos-mesh-io-fault.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-latency-database
  namespace: chaos-mesh
spec:
  action: latency
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: postgres
  volumePath: /var/lib/postgresql/data
  path: '**/*'
  delay: '100ms'
  percent: 50
  duration: '5m'
---
# I/O error injection (return errors on read operations)
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-error-experiment
  namespace: chaos-mesh
spec:
  action: fault
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: postgres
  volumePath: /var/lib/postgresql/data
  path: '**/*.dat'
  errno: 5
  percent: 30
  duration: '3m'

CPU Stress Experiment

Simulate CPU overload conditions to validate HPA (Horizontal Pod Autoscaler) scale-out behavior and service quality degradation responses.

Chaos Mesh - StressChaos (CPU/Memory Stress):

# chaos-mesh-stress.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress-api
  namespace: chaos-mesh
spec:
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  stressors:
    cpu:
      workers: 4
      load: 80
    memory:
      workers: 2
      size: '512MB'
  duration: '5m'
  containerNames:
    - api-server

GameDay Process

A GameDay is an organizational event for conducting Chaos Engineering experiments. Beyond simple technical testing, the core purpose is to validate the team's incident response processes and communication in a realistic setting.

Phase 1: Preparation (1-2 Weeks Before)

  • Define Scope: Finalize the target services, blast radius, and experiment types.
  • Establish Hypotheses: Define steady state hypotheses based on SLOs. For example, formalize them as: "Even if 50% of order service Pods are terminated, p99 response time should remain under 1 second and error rate should stay below 1%."
  • Prepare Observability Tools: Pre-configure dashboards (Grafana), log systems (ELK/Loki), and tracing systems (Jaeger/Tempo) for monitoring during the experiment.
  • Rollback Plan: Document emergency abort procedures and recovery steps for unexpected situations during the experiment.
  • Notify Stakeholders: Share the experiment schedule and expected blast radius with relevant teams (operations, business, customer support) in advance.

Phase 2: Experiment Execution (Day Of)

  • Pre-Flight Check: Verify that the current system state is healthy before the experiment. Never run chaos experiments while an existing incident is in progress.
  • Fault Injection: Execute the pre-defined ChaosExperiments sequentially. Start with lightweight experiments and progressively increase intensity.
  • Real-Time Observation: Monitor steady state hypothesis metrics in real time and immediately record any anomalies.
  • Abort Criteria Evaluation: Immediately halt the experiment if pre-defined abort conditions are reached (e.g., error rate exceeds 5%, response time exceeds 5 seconds).

Phase 3: Post-Analysis (Within 1 Week After the Experiment)

  • Document Results: Record each experiment's hypothesis, actual results, and observed anomalies in detail.
  • Identify Improvements: When a hypothesis fails (i.e., the system did not recover as expected), assign specific improvement items with owners.
  • Track Action Items: Continuously track the implementation status of identified improvements and re-validate them in the next GameDay.

SLO-Based Steady State Hypothesis Validation

The success or failure of a Chaos Engineering experiment depends on the definition and validation of the Steady State Hypothesis. Defining steady state based on SLOs (Service Level Objectives) enables deriving experiment results that are meaningful from a business perspective.

SLO Definition Examples

ServiceSLI (Indicator)SLO (Target)Measurement Method
API GatewayAvailability99.9% (monthly)Prometheus up metric
Order ServiceResponse time p99Under 500msHistogram http_request_duration_seconds
Payment ServiceError rateBelow 0.1%http_requests_total{status=~"5.."} / total requests
Search ServiceThroughputOver 1000 requests/secrate(http_requests_total[5m])
DatabaseQuery latency p95Under 100mspg_stat_statements

Validation with LitmusChaos Probes

LitmusChaos automatically validates steady state hypotheses during experiment execution through its Probe mechanism. It supports various probe types including HTTP Probe, CMD Probe, Prometheus Probe, and Kubernetes Probe. If experiment results fail to meet the hypothesis, the experiment is marked as "Failed."

HTTP Probe can verify an endpoint's response code and response time at three stages: before the experiment (SOT), during the experiment (Continuous), and after the experiment (EOT). Prometheus Probe executes Prometheus queries to verify that SLI metrics do not breach SLO thresholds. This enables objective determination of experiment success or failure.

Validation with Chaos Mesh StatusCheck

In Chaos Mesh, steady state validation can be automated by combining StatusCheck CRDs with Workflows. HTTP requests or custom scripts are executed before and after experiment runs to verify system state, and Grafana dashboard integration provides visual tracking of SLO metric changes.

Failure Case Studies and Recovery Procedures

Case 1: Chaos Experiment Fails to Clean Up

Symptom: Network latency persists even after the NetworkChaos experiment's duration has expired, causing ongoing impact to production services.

Root Cause: The Chaos Daemon was OOM-killed or the node restarted, preventing cleanup of tc (traffic control) rules. Alternatively, the Chaos Mesh Controller Manager failed to process the deletion event.

Recovery Procedure:

# 1. Force delete the in-progress Chaos resource
kubectl delete networkchaos network-delay-between-services -n chaos-mesh

# 2. Verify the Chaos Daemon is running normally
kubectl get pods -n chaos-mesh -l app.kubernetes.io/component=chaos-daemon

# 3. Restart the Chaos Daemon if it is unhealthy
kubectl rollout restart daemonset chaos-daemon -n chaos-mesh

# 4. Manually clean up tc rules (on the affected node)
# After SSH-ing into the node
tc qdisc show dev eth0
tc qdisc del dev eth0 root netem 2>/dev/null || true

# 5. Clean up iptables rules (for network partition experiments)
iptables -L -n | grep CHAOS
iptables -F CHAOS-INPUT 2>/dev/null || true
iptables -F CHAOS-OUTPUT 2>/dev/null || true

Preventive Measures: Set sufficient resource requests/limits for the Chaos Daemon, and configure monitoring alerts to verify automatic cleanup after experiments. In production environments, always set the duration field to short values and increase gradually.

Case 2: Chaos Experiment Affects Unintended Pods

Symptom: Due to label selector errors, the chaos experiment injects faults into Pods belonging to services other than the intended target, causing cascading service failures.

Root Cause: The label selector was configured too broadly, or the namespace was omitted, causing Pods across the entire cluster to be included as targets.

Recovery Procedure:

  1. Immediately delete the Chaos resource to halt the experiment.
  2. Check the status of affected Pods and perform a rollout restart of Deployments if necessary.
  3. Modify the experiment's selectors to target only the correct Pods.

Preventive Measures: Before running experiments, execute a --dry-run or a script that pre-validates the target Pod list. Always explicitly specify namespaces, and enable annotationCheck to ensure only Pods with chaos-target annotations are selected.

Case 3: StressChaos Destabilizes an Entire Node

Symptom: A CPU/memory stress experiment affects not only the target Pod but also other Pods on the same node, causing the entire node to enter a NotReady state.

Root Cause: The stress experiment ran without cgroup limits, causing node-level resource exhaustion. Alternatively, the target Pod had no resource limits set, allowing the stress to propagate across the entire node.

Recovery Procedure:

  1. Immediately delete the Chaos resource.
  2. If the node is in a NotReady state, restart kubelet or drain the node and reboot it.
  3. Verify the recovery status of affected workloads.

Preventive Measures: Always set resource limits on Pods targeted by StressChaos experiments. Use the containerNames field to inject stress into specific containers only, and start with low load levels (CPU load 20-30%) before gradually increasing.

Case 4: Data Integrity Issues After IOChaos Experiment

Symptom: After an I/O fault injection experiment, database WAL (Write-Ahead Log) files become corrupted, preventing the database from starting normally.

Root Cause: I/O error injection (action: fault) was applied to a critical database path, causing transaction logs to be written incompletely.

Recovery Procedure:

  1. Delete the database Pod and restart from the PVC to attempt automatic recovery.
  2. Use WAL recovery tools to restore the corrupted logs.
  3. If automatic recovery is not possible, restore data from the most recent backup.

Preventive Measures: When targeting databases with IOChaos experiments, always inject faults only on read-only paths. Use the path field to exclude WAL directories, and set the percent value low (10-20%) to simulate only partial I/O failures. Always verify that a recent database backup exists before running the experiment.

Case 5: LitmusChaos ChaosEngine Fails to Complete

Symptom: The ChaosEngine resource status remains stuck at Running and the experiment appears to run indefinitely.

Root Cause: The Chaos Runner Pod entered a crash loop, or the experiment execution Pod failed to pull its image, preventing the experiment from starting.

Recovery Procedure:

  1. Check the ChaosEngine status and related Pods: kubectl describe chaosengine <name> -n <namespace>
  2. Examine the logs of the Chaos Runner and experiment Pods.
  3. Change the ChaosEngine's engineState to stop to halt the experiment.
  4. After resolving the issue, create a new ChaosEngine to retry.

Preventive Measures: Pre-pull the container images used in experiments onto nodes, and set timeouts on ChaosEngines so they automatically abort if not completed within the specified time.

Operational Checklist

Experiment Design Checklist

  • Is the Steady State Hypothesis defined with measurable metrics?
  • Is the experiment's blast radius clearly bounded?
  • Have label selectors and namespaces been pre-validated to target only the intended Pods?
  • Are abort criteria defined in advance?
  • Are rollback/recovery procedures documented and understood by team members?

Production Deployment Checklist

  • Has the experiment been thoroughly tested in a staging environment before applying to production?
  • Has the current system state been verified as healthy before starting the experiment?
  • Is real-time monitoring (Grafana dashboards, alerts) configured for the experiment duration?
  • Is the experiment scheduled to avoid business-critical time windows (payment peaks, promotional events)?
  • Have the experiment schedule and blast radius been shared with relevant teams in advance?
  • Are experiment execution permissions appropriately restricted via Chaos RBAC?

During Experiment Checklist

  • Are all SLI metrics being monitored in real time?
  • Is the team prepared to immediately halt the experiment if abort conditions are reached?
  • Are experiment results (success/failure) being recorded?
  • Are unexpected side effects being reported immediately upon discovery?

Post-Experiment Checklist

  • Has it been confirmed that all injected faults have been fully cleaned up?
  • Has it been verified that system-level changes such as tc rules and iptables rules have been reverted?
  • Have experiment results been documented and improvement items identified?
  • Have owners and deadlines been assigned to identified improvement items?
  • Has the next GameDay schedule and re-validation plan been established?

References