- Published on
Practical Guide to Chaos Engineering: Fault Injection and Resilience Validation in Kubernetes with Litmus and Chaos Mesh
- Authors
- Name
- Introduction
- Chaos Engineering Principles
- Litmus vs Chaos Mesh Comparison
- Installation Guide
- ChaosExperiment Design
- GameDay Process
- SLO-Based Steady State Hypothesis Validation
- Failure Case Studies and Recovery Procedures
- Operational Checklist
- References

Introduction
"Everything fails, all the time." This statement by AWS CTO Werner Vogels captures a fundamental truth about operating distributed systems. No matter how robustly a system is designed, unexpected failures can occur. The question is not whether failures will happen, but how quickly and safely the system recovers when they do.
Chaos Engineering is a methodology rooted in this philosophy that proactively validates system resilience by intentionally injecting faults in a controlled environment. Since Netflix introduced Chaos Monkey in the early 2010s, Chaos Engineering has established itself as a core engineering practice for ensuring the reliability of production systems. In Kubernetes environments in particular, various types of failures can occur -- Pod crashes, network latency, node outages, I/O errors, and more -- making it essential to experiment with these scenarios in advance and build response mechanisms for operational stability.
This post provides an in-depth comparison of the two most widely used open-source Chaos Engineering platforms for Kubernetes environments -- LitmusChaos and Chaos Mesh -- and comprehensively covers everything from designing fault injection experiments in real production environments, to GameDay operations, SLO-based steady state validation, and failure case studies with recovery procedures.
Chaos Engineering Principles
Chaos Engineering is not simply about "breaking systems." It is a systematic approach that follows the scientific experimental method. The core principles of Chaos Engineering, as established by Netflix, are as follows.
1. Build a Hypothesis Around Steady State (Steady State Hypothesis)
Before running an experiment, define the "steady state" of the system. This must be expressed as measurable business metrics or technical indicators. For example, "API response time at p99 is under 500ms," "error rate is below 0.1%," or "order throughput exceeds 100 requests per second" serve as steady state hypotheses.
2. Vary Real-World Events (Real-World Events)
The faults injected in experiments must reflect scenarios that can actually occur. Typical examples include Pod crashes, network partitions, disk I/O latency, CPU overload, and DNS failures.
3. Run Experiments in Production
To most accurately validate actual system behavior, experiments should be run in the production environment. However, this should only be done after sufficient safeguards (abort conditions, blast radius limitations) are in place.
4. Automate Experiments to Run Continuously
Chaos Engineering is not a one-time event. The goal is to integrate experiments into CI/CD pipelines for continuous execution.
5. Minimize Blast Radius
Limit the scope of experimental impact to the absolute minimum, and ensure mechanisms are in place to immediately halt experiments when anomalies occur.
Litmus vs Chaos Mesh Comparison
The following table compares the two platforms across various dimensions to help organizations choose the right tool for their requirements.
| Category | LitmusChaos | Chaos Mesh |
|---|---|---|
| CNCF Status | Incubating (promoted in 2022) | Incubating (promoted in 2022) |
| Developed By | ChaosNative (acquired by Harness) | PingCAP / Community |
| Supported Envs | Kubernetes, VM, Cloud, Bare Metal | Kubernetes only |
| Installation | Helm / kubectl / Operator | Helm / kubectl |
| Web UI | ChaosCenter (full dashboard) | Chaos Dashboard |
| Experiment Def. | ChaosExperiment + ChaosEngine CRD | Per-chaos-type CRDs (PodChaos, NetworkChaos, etc.) |
| Experiment Hub | ChaosHub (community-shared experiments) | Built-in experiments + custom |
| SLO Integration | Probe-based steady state validation | StatusCheck + Grafana integration |
| RBAC | Native support (project/team level) | Kubernetes RBAC integration |
| Scheduling | CronWorkflow-based | Schedule CRD-based |
| Learning Curve | Moderate (concepts can be complex) | Low (intuitive CRD structure) |
| Community Size | GitHub Stars 4.4k+ | GitHub Stars 6.5k+ |
| CI/CD Integration | API/SDK, GitHub Actions support | API, CLI-based |
| Commercial Support | Harness Chaos Engineering | Community-based |
Selection Criteria Summary: If you need comprehensive Chaos Engineering that extends beyond Kubernetes, or if systematic experiment management and ChaosHub-based experiment sharing are important, LitmusChaos is the better fit. If you are working exclusively in Kubernetes and want to get started quickly, or prefer intuitive CRD-based fault injection, Chaos Mesh is an excellent choice.
Installation Guide
LitmusChaos Installation
LitmusChaos consists of ChaosCenter (management dashboard) and Chaos Infrastructure (experiment execution agent).
# Create LitmusChaos namespace
kubectl create namespace litmus
# Add Helm repository
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
# Install ChaosCenter (includes MongoDB, Auth Server, Frontend, Backend)
helm upgrade --install litmus litmuschaos/litmus \
--namespace litmus \
--set portal.server.service.type=ClusterIP \
--set portal.frontend.service.type=ClusterIP \
--set mongodb.persistence.enabled=true \
--set mongodb.persistence.storageClass=standard \
--set mongodb.persistence.size=20Gi
# Verify installation
kubectl get pods -n litmus
# Access ChaosCenter UI (port forwarding)
kubectl port-forward svc/litmus-frontend-service -n litmus 9091:9091
# Default admin credentials: admin / litmus
# Be sure to change the password after first login
Chaos Mesh Installation
Chaos Mesh consists of Controller Manager, Chaos Daemon, and Dashboard.
# Create Chaos Mesh namespace
kubectl create namespace chaos-mesh
# Add Helm repository
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
# Install Chaos Mesh
helm upgrade --install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-mesh \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock \
--set dashboard.securityMode=true \
--set dashboard.create=true \
--version 2.7.0
# Verify installation
kubectl get pods -n chaos-mesh
# Verify CRDs
kubectl get crd | grep chaos-mesh
# Access Dashboard (port forwarding)
kubectl port-forward svc/chaos-dashboard -n chaos-mesh 2333:2333
An important note: chaosDaemon.runtime and chaosDaemon.socketPath in Chaos Mesh must be configured to match the container runtime of your cluster. The socket paths differ for containerd, CRI-O, and Docker, so you must verify your cluster's runtime environment first.
ChaosExperiment Design
Pod Kill Experiment
This is the most fundamental chaos experiment, which forcefully terminates specific Pods to validate the application's self-healing capabilities.
Litmus - Pod Delete Experiment:
# litmus-pod-delete.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-delete-chaos
namespace: production
spec:
engineState: active
annotationCheck: 'false'
appinfo:
appns: production
applabel: app=payment-service
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
# Experiment duration (seconds)
- name: TOTAL_CHAOS_DURATION
value: '60'
# Pod deletion interval (seconds)
- name: CHAOS_INTERVAL
value: '10'
# Force delete
- name: FORCE
value: 'false'
# Number of Pods to delete
- name: PODS_AFFECTED_PERC
value: '50'
# Execution order (parallel or serial)
- name: SEQUENCE
value: 'parallel'
probe:
- name: payment-health-check
type: httpProbe
mode: Continuous
httpProbe/inputs:
url: 'http://payment-service.production.svc:8080/health'
method:
get:
criteria: ==
responseCode: '200'
runProperties:
probeTimeout: 5s
interval: 5s
retry: 3
probePollingInterval: 2s
Chaos Mesh - PodChaos Experiment:
# chaos-mesh-pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-payment
namespace: chaos-mesh
spec:
action: pod-kill
mode: fixed-percent
value: '50'
selector:
namespaces:
- production
labelSelectors:
app: payment-service
# gracePeriod: 0 means force kill
gracePeriod: 30
duration: '60s'
---
# Scheduled experiment (every Wednesday at 10:00 AM)
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: scheduled-pod-kill
namespace: chaos-mesh
spec:
schedule: '0 10 * * 3'
type: PodChaos
historyLimit: 5
concurrencyPolicy: Forbid
podChaos:
action: pod-kill
mode: one
selector:
namespaces:
- production
labelSelectors:
app: payment-service
gracePeriod: 30
duration: '30s'
Network Latency Experiment
Inject network latency to verify that timeout configurations and Circuit Breaker patterns between microservices are functioning correctly.
Chaos Mesh - NetworkChaos (Latency Injection):
# chaos-mesh-network-delay.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay-between-services
namespace: chaos-mesh
spec:
action: delay
mode: all
selector:
namespaces:
- production
labelSelectors:
app: order-service
delay:
latency: '200ms'
jitter: '50ms'
correlation: '25'
direction: to
target:
selector:
namespaces:
- production
labelSelectors:
app: payment-service
mode: all
duration: '5m'
---
# Network packet loss experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-loss-experiment
namespace: chaos-mesh
spec:
action: loss
mode: all
selector:
namespaces:
- production
labelSelectors:
app: api-gateway
loss:
loss: '30'
correlation: '25'
direction: both
duration: '3m'
IO Fault Experiment
Simulate disk I/O faults to validate the resilience of databases, caches, and logging systems.
Chaos Mesh - IOChaos (I/O Fault Injection):
# chaos-mesh-io-fault.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: io-latency-database
namespace: chaos-mesh
spec:
action: latency
mode: one
selector:
namespaces:
- production
labelSelectors:
app: postgres
volumePath: /var/lib/postgresql/data
path: '**/*'
delay: '100ms'
percent: 50
duration: '5m'
---
# I/O error injection (return errors on read operations)
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: io-error-experiment
namespace: chaos-mesh
spec:
action: fault
mode: one
selector:
namespaces:
- production
labelSelectors:
app: postgres
volumePath: /var/lib/postgresql/data
path: '**/*.dat'
errno: 5
percent: 30
duration: '3m'
CPU Stress Experiment
Simulate CPU overload conditions to validate HPA (Horizontal Pod Autoscaler) scale-out behavior and service quality degradation responses.
Chaos Mesh - StressChaos (CPU/Memory Stress):
# chaos-mesh-stress.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-stress-api
namespace: chaos-mesh
spec:
mode: all
selector:
namespaces:
- production
labelSelectors:
app: api-server
stressors:
cpu:
workers: 4
load: 80
memory:
workers: 2
size: '512MB'
duration: '5m'
containerNames:
- api-server
GameDay Process
A GameDay is an organizational event for conducting Chaos Engineering experiments. Beyond simple technical testing, the core purpose is to validate the team's incident response processes and communication in a realistic setting.
Phase 1: Preparation (1-2 Weeks Before)
- Define Scope: Finalize the target services, blast radius, and experiment types.
- Establish Hypotheses: Define steady state hypotheses based on SLOs. For example, formalize them as: "Even if 50% of order service Pods are terminated, p99 response time should remain under 1 second and error rate should stay below 1%."
- Prepare Observability Tools: Pre-configure dashboards (Grafana), log systems (ELK/Loki), and tracing systems (Jaeger/Tempo) for monitoring during the experiment.
- Rollback Plan: Document emergency abort procedures and recovery steps for unexpected situations during the experiment.
- Notify Stakeholders: Share the experiment schedule and expected blast radius with relevant teams (operations, business, customer support) in advance.
Phase 2: Experiment Execution (Day Of)
- Pre-Flight Check: Verify that the current system state is healthy before the experiment. Never run chaos experiments while an existing incident is in progress.
- Fault Injection: Execute the pre-defined ChaosExperiments sequentially. Start with lightweight experiments and progressively increase intensity.
- Real-Time Observation: Monitor steady state hypothesis metrics in real time and immediately record any anomalies.
- Abort Criteria Evaluation: Immediately halt the experiment if pre-defined abort conditions are reached (e.g., error rate exceeds 5%, response time exceeds 5 seconds).
Phase 3: Post-Analysis (Within 1 Week After the Experiment)
- Document Results: Record each experiment's hypothesis, actual results, and observed anomalies in detail.
- Identify Improvements: When a hypothesis fails (i.e., the system did not recover as expected), assign specific improvement items with owners.
- Track Action Items: Continuously track the implementation status of identified improvements and re-validate them in the next GameDay.
SLO-Based Steady State Hypothesis Validation
The success or failure of a Chaos Engineering experiment depends on the definition and validation of the Steady State Hypothesis. Defining steady state based on SLOs (Service Level Objectives) enables deriving experiment results that are meaningful from a business perspective.
SLO Definition Examples
| Service | SLI (Indicator) | SLO (Target) | Measurement Method |
|---|---|---|---|
| API Gateway | Availability | 99.9% (monthly) | Prometheus up metric |
| Order Service | Response time p99 | Under 500ms | Histogram http_request_duration_seconds |
| Payment Service | Error rate | Below 0.1% | http_requests_total{status=~"5.."} / total requests |
| Search Service | Throughput | Over 1000 requests/sec | rate(http_requests_total[5m]) |
| Database | Query latency p95 | Under 100ms | pg_stat_statements |
Validation with LitmusChaos Probes
LitmusChaos automatically validates steady state hypotheses during experiment execution through its Probe mechanism. It supports various probe types including HTTP Probe, CMD Probe, Prometheus Probe, and Kubernetes Probe. If experiment results fail to meet the hypothesis, the experiment is marked as "Failed."
HTTP Probe can verify an endpoint's response code and response time at three stages: before the experiment (SOT), during the experiment (Continuous), and after the experiment (EOT). Prometheus Probe executes Prometheus queries to verify that SLI metrics do not breach SLO thresholds. This enables objective determination of experiment success or failure.
Validation with Chaos Mesh StatusCheck
In Chaos Mesh, steady state validation can be automated by combining StatusCheck CRDs with Workflows. HTTP requests or custom scripts are executed before and after experiment runs to verify system state, and Grafana dashboard integration provides visual tracking of SLO metric changes.
Failure Case Studies and Recovery Procedures
Case 1: Chaos Experiment Fails to Clean Up
Symptom: Network latency persists even after the NetworkChaos experiment's duration has expired, causing ongoing impact to production services.
Root Cause: The Chaos Daemon was OOM-killed or the node restarted, preventing cleanup of tc (traffic control) rules. Alternatively, the Chaos Mesh Controller Manager failed to process the deletion event.
Recovery Procedure:
# 1. Force delete the in-progress Chaos resource
kubectl delete networkchaos network-delay-between-services -n chaos-mesh
# 2. Verify the Chaos Daemon is running normally
kubectl get pods -n chaos-mesh -l app.kubernetes.io/component=chaos-daemon
# 3. Restart the Chaos Daemon if it is unhealthy
kubectl rollout restart daemonset chaos-daemon -n chaos-mesh
# 4. Manually clean up tc rules (on the affected node)
# After SSH-ing into the node
tc qdisc show dev eth0
tc qdisc del dev eth0 root netem 2>/dev/null || true
# 5. Clean up iptables rules (for network partition experiments)
iptables -L -n | grep CHAOS
iptables -F CHAOS-INPUT 2>/dev/null || true
iptables -F CHAOS-OUTPUT 2>/dev/null || true
Preventive Measures: Set sufficient resource requests/limits for the Chaos Daemon, and configure monitoring alerts to verify automatic cleanup after experiments. In production environments, always set the duration field to short values and increase gradually.
Case 2: Chaos Experiment Affects Unintended Pods
Symptom: Due to label selector errors, the chaos experiment injects faults into Pods belonging to services other than the intended target, causing cascading service failures.
Root Cause: The label selector was configured too broadly, or the namespace was omitted, causing Pods across the entire cluster to be included as targets.
Recovery Procedure:
- Immediately delete the Chaos resource to halt the experiment.
- Check the status of affected Pods and perform a rollout restart of Deployments if necessary.
- Modify the experiment's selectors to target only the correct Pods.
Preventive Measures: Before running experiments, execute a --dry-run or a script that pre-validates the target Pod list. Always explicitly specify namespaces, and enable annotationCheck to ensure only Pods with chaos-target annotations are selected.
Case 3: StressChaos Destabilizes an Entire Node
Symptom: A CPU/memory stress experiment affects not only the target Pod but also other Pods on the same node, causing the entire node to enter a NotReady state.
Root Cause: The stress experiment ran without cgroup limits, causing node-level resource exhaustion. Alternatively, the target Pod had no resource limits set, allowing the stress to propagate across the entire node.
Recovery Procedure:
- Immediately delete the Chaos resource.
- If the node is in a NotReady state, restart kubelet or drain the node and reboot it.
- Verify the recovery status of affected workloads.
Preventive Measures: Always set resource limits on Pods targeted by StressChaos experiments. Use the containerNames field to inject stress into specific containers only, and start with low load levels (CPU load 20-30%) before gradually increasing.
Case 4: Data Integrity Issues After IOChaos Experiment
Symptom: After an I/O fault injection experiment, database WAL (Write-Ahead Log) files become corrupted, preventing the database from starting normally.
Root Cause: I/O error injection (action: fault) was applied to a critical database path, causing transaction logs to be written incompletely.
Recovery Procedure:
- Delete the database Pod and restart from the PVC to attempt automatic recovery.
- Use WAL recovery tools to restore the corrupted logs.
- If automatic recovery is not possible, restore data from the most recent backup.
Preventive Measures: When targeting databases with IOChaos experiments, always inject faults only on read-only paths. Use the path field to exclude WAL directories, and set the percent value low (10-20%) to simulate only partial I/O failures. Always verify that a recent database backup exists before running the experiment.
Case 5: LitmusChaos ChaosEngine Fails to Complete
Symptom: The ChaosEngine resource status remains stuck at Running and the experiment appears to run indefinitely.
Root Cause: The Chaos Runner Pod entered a crash loop, or the experiment execution Pod failed to pull its image, preventing the experiment from starting.
Recovery Procedure:
- Check the ChaosEngine status and related Pods:
kubectl describe chaosengine <name> -n <namespace> - Examine the logs of the Chaos Runner and experiment Pods.
- Change the ChaosEngine's
engineStatetostopto halt the experiment. - After resolving the issue, create a new ChaosEngine to retry.
Preventive Measures: Pre-pull the container images used in experiments onto nodes, and set timeouts on ChaosEngines so they automatically abort if not completed within the specified time.
Operational Checklist
Experiment Design Checklist
- Is the Steady State Hypothesis defined with measurable metrics?
- Is the experiment's blast radius clearly bounded?
- Have label selectors and namespaces been pre-validated to target only the intended Pods?
- Are abort criteria defined in advance?
- Are rollback/recovery procedures documented and understood by team members?
Production Deployment Checklist
- Has the experiment been thoroughly tested in a staging environment before applying to production?
- Has the current system state been verified as healthy before starting the experiment?
- Is real-time monitoring (Grafana dashboards, alerts) configured for the experiment duration?
- Is the experiment scheduled to avoid business-critical time windows (payment peaks, promotional events)?
- Have the experiment schedule and blast radius been shared with relevant teams in advance?
- Are experiment execution permissions appropriately restricted via Chaos RBAC?
During Experiment Checklist
- Are all SLI metrics being monitored in real time?
- Is the team prepared to immediately halt the experiment if abort conditions are reached?
- Are experiment results (success/failure) being recorded?
- Are unexpected side effects being reported immediately upon discovery?
Post-Experiment Checklist
- Has it been confirmed that all injected faults have been fully cleaned up?
- Has it been verified that system-level changes such as tc rules and iptables rules have been reverted?
- Have experiment results been documented and improvement items identified?
- Have owners and deadlines been assigned to identified improvement items?
- Has the next GameDay schedule and re-validation plan been established?
References
- LitmusChaos Official Documentation
- LitmusChaos GitHub Repository
- Chaos Mesh Official Documentation
- Chaos Mesh GitHub Repository
- Principles of Chaos Engineering
- CNCF - LitmusChaos Q4 2025 Update
- Running Chaos Experiments in Kubernetes with Chaos Mesh - Hands-on Guide
- Top 5 Chaos Engineering Platforms Compared - Loft Labs
- AWS Blog - Chaos Engineering with LitmusChaos on Amazon EKS