- Authors
- Name
- Introduction
- Architecture and Core Concepts
- Installation and Service Mesh Enablement
- mTLS Configuration: SPIRE-Based Mutual Authentication
- L4/L7 Traffic Management
- Performance Comparison: Cilium vs Istio vs Linkerd
- Troubleshooting: Failure Scenarios and Recovery
- Operations Notes
- Operations Checklist
- References

Introduction
Service Mesh is an infrastructure layer that makes communication between microservices secure and observable. Traditional service mesh solutions like Istio and Linkerd inject a sidecar proxy into each Pod. While this approach works, each sidecar consumes additional CPU and memory, Pod startup time increases, and extra network hops add latency.
Cilium Service Mesh fundamentally changes this paradigm. It uses eBPF to handle L4 traffic at the kernel level and only uses a per-node shared Envoy proxy (DaemonSet) when L7 functionality is needed. Without sidecars, resource overhead is dramatically reduced and the complexity of managing per-Pod proxies disappears.
This article comprehensively covers the architectural principles of Cilium Service Mesh, installation, mTLS configuration, traffic management, performance comparison, as well as failure scenarios and recovery procedures you may encounter in production.
Architecture and Core Concepts
Limitations of the Traditional Sidecar Model
In a traditional service mesh, an Envoy sidecar is injected into every Pod. If you have 100 Pods, 100 additional Envoy instances are running.
Cost of the Sidecar Model:
- Each Envoy sidecar per Pod consumes approximately 50-100MB of memory
- Pod startup delay due to sidecar initialization (additional 2-5 seconds)
- Additional latency from all traffic passing through the sidecar (~1ms)
- Full Pod rolling restart required when upgrading sidecars
Cilium Service Mesh Sidecarless Architecture
Cilium Service Mesh implements service mesh functionality in two layers.
L4 Layer (eBPF): TCP connection management, load balancing, mTLS encryption/decryption, and network policy enforcement are handled by eBPF programs inside the kernel. Since it operates directly in the kernel without sidecars, overhead is minimal.
L7 Layer (Shared Envoy): Only when L7 functionality is needed, such as HTTP routing, header-based traffic splitting, and gRPC filtering, a single shared Envoy proxy per node (DaemonSet) handles the traffic. Since it is per-node rather than per-Pod, the number of Envoy instances is dramatically reduced.
Core Components
- Cilium Agent (DaemonSet): Loads and manages eBPF programs on each node. Serves as the data plane of the service mesh.
- Cilium Operator (Deployment): Handles cluster-level resource management, IP pool allocation, and CRD synchronization.
- Envoy DaemonSet: A shared proxy that only handles traffic with L7 policies applied. The Cilium Agent automatically injects Envoy configuration.
- Hubble: A built-in observability tool that monitors all network flows in real time.
Installation and Service Mesh Enablement
Prerequisites
# Check kernel version (5.10+ required, 6.1+ recommended)
uname -r
# Verify eBPF support
cat /boot/config-$(uname -r) | grep CONFIG_BPF
# CONFIG_BPF=y
# CONFIG_BPF_SYSCALL=y
# CONFIG_BPF_JIT=y
# Install Cilium CLI
CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)
curl -L --fail --remote-name-all \
https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-amd64.tar.gz
sudo tar xzvf cilium-linux-amd64.tar.gz -C /usr/local/bin
Installing Service Mesh with Helm
# Add Helm repo
helm repo add cilium https://helm.cilium.io/
helm repo update
# Install Cilium Service Mesh (kube-proxy replacement + service mesh enabled)
helm install cilium cilium/cilium --version 1.19.0 \
--namespace kube-system \
--set kubeProxyReplacement=true \
--set k8sServiceHost="API_SERVER_IP" \
--set k8sServicePort=6443 \
--set envoyConfig.enabled=true \
--set ingressController.enabled=true \
--set ingressController.loadbalancerMode=shared \
--set hubble.enabled=true \
--set hubble.relay.enabled=true \
--set hubble.ui.enabled=true \
--set encryption.enabled=true \
--set encryption.type=wireguard \
--set authentication.mutual.spire.enabled=true \
--set authentication.mutual.spire.install.enabled=true
Here is an explanation of the key parameters in this configuration:
envoyConfig.enabled=true: Enables L7 traffic management via the CiliumEnvoyConfig CRDingressController.enabled=true: Uses Cilium as the Kubernetes Ingress controllerauthentication.mutual.spire.enabled=true: Enables SPIRE-based mTLS authenticationencryption.type=wireguard: WireGuard-based transparent encryption between nodes
Verifying the Installation
# Check overall Cilium status
cilium status --wait
# Expected output:
# /¯¯\
# /¯¯\__/¯¯\ Cilium: OK
# \__/¯¯\__/ Operator: OK
# /¯¯\__/¯¯\ Envoy DaemonSet: OK
# \__/¯¯\__/ Hubble Relay: OK
# \__/ ClusterMesh: disabled
# SPIRE Server: OK
# SPIRE Agent: OK
# Verify service mesh features
cilium config view | grep -E "envoy|mesh|mutual"
# Connectivity test (including service mesh)
cilium connectivity test
mTLS Configuration: SPIRE-Based Mutual Authentication
Why mTLS Is Needed
One of the core features of a service mesh is mutual authentication (mTLS) for inter-service communication. mTLS requires both the client and server to present certificates, verifying each other's identity. This prevents man-in-the-middle (MITM) attacks and implements workload identity-based security independent of network policies.
Cilium implements mTLS using the SPIFFE/SPIRE framework. Each workload is automatically assigned a SPIFFE ID, and certificate issuance and renewal happen transparently.
Applying mTLS Authentication Policies
# mtls-authentication-policy.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: require-mutual-auth
namespace: production
spec:
endpointSelector:
matchLabels:
app: payment-service
ingress:
- fromEndpoints:
- matchLabels:
app: order-service
authentication:
mode: required # mTLS authentication required
toPorts:
- ports:
- port: '8080'
protocol: TCP
This policy requires mTLS authentication for inbound traffic to payment-service. If order-service does not have a valid SPIFFE ID, the connection will be rejected.
Enforcing Cluster-Wide mTLS
# cluster-wide-mtls.yaml
apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
name: enforce-mtls-cluster-wide
spec:
endpointSelector:
matchExpressions:
- key: io.kubernetes.pod.namespace
operator: NotIn
values:
- kube-system
ingress:
- fromEndpoints:
- {}
authentication:
mode: required
Important Note: Before enabling cluster-wide mTLS, make sure all workloads are registered with SPIRE. Unregistered workloads will have their communication blocked immediately. Always test in a staging environment first, and apply to production gradually on a per-namespace basis.
# Check workloads registered with SPIRE
kubectl exec -n kube-system spire-server-0 -- \
/opt/spire/bin/spire-server entry show
# Check mTLS authentication status
cilium-dbg identity list | grep -i auth
hubble observe --namespace production --verdict DROPPED -o json | \
jq 'select(.drop_reason_desc == "Authentication required")'
L4/L7 Traffic Management
L7 Routing with CiliumEnvoyConfig
Cilium Service Mesh performs L7 traffic management through the CiliumEnvoyConfig CRD. This serves a similar role to Istio's VirtualService/DestinationRule.
# l7-traffic-split.yaml
apiVersion: cilium.io/v2
kind: CiliumEnvoyConfig
metadata:
name: api-traffic-split
namespace: production
spec:
services:
- name: api-service
namespace: production
backendServices:
- name: api-service-v1
namespace: production
- name: api-service-v2
namespace: production
resources:
- '@type': type.googleapis.com/envoy.config.listener.v3.Listener
name: api-listener
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
'@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: api-traffic
route_config:
name: api-routes
virtual_hosts:
- name: api-host
domains: ['*']
routes:
- match:
prefix: '/'
headers:
- name: 'x-canary'
exact_match: 'true'
route:
cluster: 'production/api-service-v2'
- match:
prefix: '/'
route:
weighted_clusters:
clusters:
- name: 'production/api-service-v1'
weight: 90
- name: 'production/api-service-v2'
weight: 10
http_filters:
- name: envoy.filters.http.router
typed_config:
'@type': type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
This configuration defines two routing rules:
- Requests with the
x-canary: trueheader are routed to v2 - Remaining requests are distributed with 90% weight to v1 and 10% to v2 (canary deployment)
L4 Load Balancing Policy
L4 load balancing is handled directly by eBPF without going through Envoy, making it very efficient.
# l4-lb-policy.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: backend-l4-policy
namespace: production
spec:
endpointSelector:
matchLabels:
app: backend-service
ingress:
- fromEndpoints:
- matchLabels:
app: api-gateway
toPorts:
- ports:
- port: '8080'
protocol: TCP
- port: '9090'
protocol: TCP
egress:
- toEndpoints:
- matchLabels:
app: database
toPorts:
- ports:
- port: '5432'
protocol: TCP
- toEndpoints:
- matchLabels:
app: cache
toPorts:
- ports:
- port: '6379'
protocol: TCP
L7 HTTP Policies and Rate Limiting
# l7-rate-limiting.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: api-l7-with-ratelimit
namespace: production
spec:
endpointSelector:
matchLabels:
app: public-api
ingress:
- fromEndpoints:
- matchLabels:
app: frontend
toPorts:
- ports:
- port: '8080'
protocol: TCP
rules:
http:
- method: GET
path: '/api/v1/products.*'
- method: GET
path: '/api/v1/categories.*'
- method: POST
path: '/api/v1/orders'
headers:
- 'Content-Type: application/json'
Performance Comparison: Cilium vs Istio vs Linkerd
Performance is one of the most important criteria when choosing a service mesh. Below are benchmark results measured on the same workload (gRPC microservice, 100 RPS).
| Item | Cilium Service Mesh | Istio (Sidecar) | Linkerd |
|---|---|---|---|
| Architecture | Sidecarless (eBPF + Node Envoy) | Per-Pod Sidecar Envoy | Per-Pod Sidecar linkerd2-proxy |
| P50 Latency Overhead | ~0.1ms (L4), ~0.3ms (L7) | ~1.0ms | ~0.5ms |
| P99 Latency Overhead | ~0.3ms (L4), ~0.8ms (L7) | ~3.0ms | ~1.5ms |
| Memory Overhead per Pod | 0MB (L4) / ~150MB shared per node | ~50-100MB | ~20-30MB |
| CPU Overhead per Pod | Nearly none (L4) | ~10-50m | ~5-20m |
| Pod Startup Delay | 0s (no sidecar) | 2-5s (sidecar injection) | 1-3s |
| mTLS | SPIRE/WireGuard | Built-in (Citadel) | Built-in (own PKI) |
| L7 Features | CiliumEnvoyConfig | VirtualService/DestinationRule | HTTPRoute/ServiceProfile |
| Observability | Hubble (built-in) | Kiali, Jaeger (separate) | Linkerd Viz (built-in) |
| Gateway API Support | Native | Native | Native |
| Community | CNCF Graduated | CNCF Graduated | CNCF Graduated |
| Learning Curve | Medium (eBPF understanding required) | High (complex CRD system) | Low |
How to Run the Benchmark
# Performance measurement using fortio
kubectl run fortio-client --rm -it --image=fortio/fortio -- \
load -c 50 -qps 1000 -t 60s -json - \
http://api-service.production:8080/api/v1/health
# Key analysis points:
# - P50, P90, P99 latency
# - Maximum QPS (Queries Per Second)
# - Error rate
# - CPU/Memory usage (measured separately with kubectl top pods)
# Check L7 latency using Hubble metrics
hubble observe --namespace production --protocol http -o json | \
jq '.l7.latency_ns / 1000000'
Key Takeaway: Cilium Service Mesh shows 3-10x lower latency compared to the sidecar model when processing L4 traffic. Even when using L7 features, the shared Envoy model results in zero per-Pod memory overhead, with a single Envoy per node handling all L7 traffic on that node.
Troubleshooting: Failure Scenarios and Recovery
Case 1: Inter-Service Communication Failure Due to mTLS Authentication Failure
Symptoms: A specific service suddenly cannot communicate with other services, and Authentication required drops are observed in Hubble
# Check authentication failure traffic
hubble observe --namespace production --verdict DROPPED -o compact
# Check SPIRE agent status
kubectl get pods -n kube-system -l app=spire-agent
kubectl logs -n kube-system -l app=spire-agent --tail=50
# Check SVID (certificate) status for a specific workload
kubectl exec -n kube-system spire-server-0 -- \
/opt/spire/bin/spire-server entry show -selector k8s:ns:production
# Check authentication status of Cilium endpoints
kubectl exec -n kube-system ds/cilium -- \
cilium-dbg endpoint list -o json | jq '.[].status.policy.realized.auth'
Causes and Solutions:
- SPIRE Agent OOMKilled: Increase the memory limit of the SPIRE agent. For large clusters, 512Mi or more is recommended
- Certificate Expiration: The default SVID TTL in SPIRE is 1 hour. If the SPIRE server goes down, certificates cannot be renewed, and communication will break after 1 hour. Ensure high availability of the SPIRE server
- Incorrect Selectors:
authentication.mode: requiredin theCiliumNetworkPolicymay have been applied to an unintended scope. Test with a specific namespace first
Case 2: L7 Policy Failure Due to Envoy DaemonSet Outage
Symptoms: L7 HTTP policies (path/header-based filtering) are not working, and only L4 policies are applied
# Check Envoy DaemonSet status
kubectl get ds -n kube-system cilium-envoy
kubectl describe ds -n kube-system cilium-envoy
# Check Envoy logs on a specific node
kubectl logs -n kube-system -l k8s-app=cilium-envoy --tail=100
# Check connectivity between Cilium Agent and Envoy
kubectl exec -n kube-system ds/cilium -- \
cilium-dbg status --verbose | grep -A5 "Envoy"
# Check CiliumEnvoyConfig status
kubectl get cec -n production -o yaml
Causes and Solutions:
- Envoy Pod Resource Shortage: When a node has heavy L7 traffic, Envoy may run out of memory. Increase
envoy.resources.limits.memoryin Helm values - Invalid CiliumEnvoyConfig: If there are Envoy configuration syntax errors, the corresponding listener will not be loaded. Use the
cilium-dbg envoy configcommand to verify the actually loaded configuration - Node Affinity Issues: If Envoy is not scheduled on specific nodes, L7 policies will not work on those nodes
Case 3: Connection Disruption After Service Mesh Upgrade
Symptoms: After a Cilium version upgrade, communication between some Pods fails intermittently
# Check Cilium Agent rolling restart status
kubectl rollout status ds/cilium -n kube-system
# Check eBPF map synchronization status
kubectl exec -n kube-system ds/cilium -- \
cilium-dbg bpf endpoint list
# Wait for endpoint recovery
kubectl exec -n kube-system ds/cilium -- \
cilium-dbg endpoint list | grep -v ready
Recovery Procedure:
- First, verify that the Cilium Agent has restarted successfully on all nodes
- Confirm that eBPF maps have been reloaded properly using
cilium-dbg bpf endpoint list - If the problem persists, restart the affected Pods to re-register the endpoints
- As a last resort, run
cilium-dbg endpoint regenerate --allto regenerate the eBPF programs for all endpoints
Note: When upgrading, always verify changes beforehand with helm diff upgrade and upgrade one minor version at a time. Skipping from Cilium 1.17 to 1.19 is not supported.
Case 4: Hubble Observation Data Not Being Collected
Symptoms: The hubble observe command returns empty results or times out
# Check Hubble Relay status
kubectl get pods -n kube-system -l k8s-app=hubble-relay
kubectl logs -n kube-system -l k8s-app=hubble-relay --tail=50
# Test Hubble connectivity
cilium hubble port-forward &
hubble status
# Cilium Agent's Hubble monitoring status
kubectl exec -n kube-system ds/cilium -- \
cilium-dbg monitor --type drop --type trace
Solution: The most common cause is a disconnected gRPC connection for Hubble Relay. Restart the Relay with kubectl rollout restart deployment hubble-relay -n kube-system.
Operations Notes
Migration from Istio to Cilium Service Mesh
When migrating from an existing Istio environment to Cilium Service Mesh, proceed gradually rather than switching all at once.
Phase 1 - Coexistence (2-4 weeks): Install Cilium as the CNI but keep service mesh features disabled. Istio sidecars continue to operate.
Phase 2 - Per-Namespace Transition: Starting with non-critical namespaces, disable Istio sidecar injection and enable Cilium's mTLS and L7 policies.
Phase 3 - Full Transition: Remove Istio from all namespaces and switch entirely to Cilium Service Mesh.
# Disable Istio sidecar injection per namespace
kubectl label namespace staging istio-injection-
kubectl rollout restart deployment -n staging
# Apply Cilium mTLS policy
kubectl apply -f cilium-mtls-policy-staging.yaml
# Verify after transition
hubble observe --namespace staging --protocol http
Resource Sizing Guidelines
Cilium Agent (DaemonSet):
- Small clusters (10 nodes or fewer): CPU 200m / Memory 256Mi
- Medium clusters (50 nodes or fewer): CPU 500m / Memory 512Mi
- Large clusters (100 nodes or more): CPU 1000m / Memory 1Gi
Envoy DaemonSet:
- Low L7 traffic: CPU 100m / Memory 128Mi
- High L7 traffic: CPU 500m / Memory 512Mi
- Very high L7 throughput: CPU 1000m / Memory 1Gi
SPIRE Server:
- Under 1,000 workloads: CPU 200m / Memory 256Mi
- Over 5,000 workloads: CPU 500m / Memory 512Mi, HA configuration is mandatory
Essential Monitoring Metrics
# Key metrics to collect in Prometheus
# 1. Cilium Agent status
# cilium_agent_api_process_time_seconds - API processing time
# cilium_agent_bootstrap_seconds - Agent startup time
# cilium_bpf_map_ops_total - BPF map operation count
# 2. Service mesh related
# cilium_proxy_upstream_reply_seconds - L7 proxy upstream response time
# cilium_proxy_redirects - Number of connections redirected to L7 proxy
# cilium_auth_map_entries - mTLS authentication map entry count
# 3. Hubble observability
# hubble_flows_processed_total - Total processed flows
# hubble_tcp_flags_total - Count by TCP flags
# Import Grafana dashboards
# Official Cilium dashboard: https://grafana.com/grafana/dashboards/16611
Operations Checklist
Pre-Deployment Checklist
- Verify the kernel version is 5.10 or higher (6.1 or higher recommended)
- Verify that
CONFIG_BPF,CONFIG_BPF_SYSCALL, andCONFIG_BPF_JITare enabled - When installing in kube-proxy replacement mode, verify that the existing kube-proxy has been removed
- Verify that the SPIRE server is deployed in an HA configuration (required for production)
- Verify that
CiliumNetworkPolicydoes not conflict with existing KubernetesNetworkPolicy - Verify that PodDisruptionBudget is configured for Cilium DaemonSet and SPIRE
- Verify that
cilium connectivity testpasses
Upgrade Checklist
- Always check Breaking Changes in the Cilium release notes
- Upgrade one minor version at a time sequentially (1.17 -> 1.18 -> 1.19)
- Verify changes beforehand with
helm diff upgrade - Upgrade in the staging environment first and observe for at least 24 hours
- Monitor Agent rolling restart during upgrade with
cilium status - Re-run
cilium connectivity testafter upgrade
Incident Response Checklist
- Cilium Agent failure: Pods on the affected node can continue communicating using existing eBPF maps, but policy updates are suspended
- Envoy DaemonSet failure: Only L7 policies are affected; L4 policies continue to work via eBPF
- SPIRE server failure: Recovery is required before the TTL (default 1 hour) of existing certificates expires
- etcd failure (Cilium KVStore): Cilium Agent operates with local cache, but new policies cannot be applied
References
- Cilium Official Documentation - Service Mesh
- Cilium Official Documentation - Mutual Authentication (mTLS)
- Cilium GitHub - Service Mesh Examples
- Cilium Official Documentation - CiliumEnvoyConfig
- SPIFFE/SPIRE Official Documentation
- Kubernetes Official Blog - Service Mesh Interface
- Isovalent - Cilium Service Mesh Performance Benchmark
- Cilium Official Documentation - Hubble Observability