- Published on
Service Mesh Deep Dive — Envoy, Istio, Linkerd, Cilium eBPF, Ambient Mesh, xDS/mTLS
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Intro — "Answer or Fad?"
Around 2017 the consensus was "microservices require a Service Mesh." By 2021 people asked "100MB per sidecar — is this really efficient?" By 2023-2024, Istio Ambient Mesh and Cilium's sidecar-less mesh arrived in force.
What happened in eight years? This post covers:
- Why a Service Mesh was needed — pains born of microservices
- How Envoy works — listener, filter, cluster, xDS
- Istio's control plane — what istiod actually does
- How mTLS becomes automatic
- Ambient Mesh — Istio without sidecars
- Cilium eBPF — L7 handled in the kernel
- Linkerd's Rust sidecar — the minimalist school
- Selection criteria in the field
1. Why Service Meshes Were Born
Splitting a monolith into 100 microservices creates:
- Retries — how many, with what backoff?
- Timeouts — how long to wait?
- Circuit breaking — fail fast when downstream keeps erroring
- Load balancing — round robin? Least request?
- Traffic splitting — route 10% to the new version
- mTLS — encrypt and authenticate between services
- Observability — trace IDs, latency histograms per call
- AuthN/AuthZ — which service can call which?
From 2015-2017 people solved this with per-language libraries (Netflix Hystrix, Ribbon, Eureka). But once Go, Python, Node arrived, every language needed its own implementation, and upgrades required redeploying every service.
The Answer: Outsource to a Network Proxy
If A calls B over HTTP/gRPC, a proxy in the middle can add features regardless of language. Lyft open-sourced Envoy in 2016, Google and IBM built Istio on top in 2017, and the Service Mesh era began.
2. Envoy — The Modern Network Proxy Standard
Why Envoy
- Written in C++ — fast like nginx, but modern design
- Native HTTP/2 and gRPC
- Dynamic config (xDS) — no restart needed
- Strong observability — stats, logs, traces built-in
- Filter chains — extensible via WASM and Lua
The Four Core Concepts
┌─────────────────────────────────────────────────┐
│ Envoy │
│ │
│ ┌─────────┐ ┌──────────────┐ ┌────────┐ │
│ │Listener │ -> │ Filter Chain │ -> │ Cluster│ │
│ └─────────┘ └──────────────┘ └────────┘ │
└─────────────────────────────────────────────────┘
- Listener — which port to listen on (L4)
- Filter Chain — how to process incoming connections
- Cluster — where to send traffic (upstream IP:Port list)
- Endpoint — actual IPs within a Cluster
Filters are pluggable, so WAF, Rate Limit, and JWT validation can be inserted.
xDS API — The Heart of Dynamic Config
Envoy receives config from five discovery services:
- LDS (Listener), RDS (Route), CDS (Cluster), EDS (Endpoint), SDS (Secret)
These are aggregated into ADS (Aggregated Discovery Service). Envoy connects to the control plane via a gRPC stream, and the control plane pushes config changes. Updates apply without downtime — this is how changes to a Kubernetes Service or Deployment propagate instantly.
3. Istio — The Most Widely Used Mesh
Istio 1.0's Complexity
Istio 1.0 (2018) had four components: Pilot, Mixer, Citadel, Galley. Mixer received a separate gRPC call per request and became a bottleneck. In 1.5, Mixer was removed and all functionality moved into Envoy WASM filters.
Today's Istio: Just istiod
┌──────────────────────────────────┐
│ istiod │
│ Pilot (xDS) + Citadel + Galley │
└─────────────┬────────────────────┘
│ xDS
▼
Envoy Sidecar
(one per Pod)
How Sidecar Injection Works
A Mutating Admission Webhook.
- User creates a Pod in a namespace labeled
istio-injection=enabled - API Server calls Istio's Mutating Webhook
- Webhook injects the
istio-proxysidecar container - An
istio-initinitContainer is also added to insert iptables rules
The iptables rules:
All outbound traffic → REDIRECT to 127.0.0.1:15001 (Envoy)
All inbound traffic → REDIRECT to 127.0.0.1:15006 (Envoy)
The app thinks it talks directly to the network, but iptables transparently intercepts into Envoy.
VirtualService and DestinationRule
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: reviews
spec:
hosts: [reviews]
http:
- match:
- headers:
user-agent: { regex: ".*Mobile.*" }
route:
- destination: { host: reviews, subset: v2 }
- route:
- destination: { host: reviews, subset: v1 }
weight: 90
- destination: { host: reviews, subset: v2 }
weight: 10
kind: DestinationRule
spec:
host: reviews
trafficPolicy:
connectionPool:
tcp: { maxConnections: 100 }
http: { http1MaxPendingRequests: 10 }
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
subsets:
- name: v1
labels: { version: v1 }
- name: v2
labels: { version: v2 }
4. mTLS — How "Automatic Encryption" Actually Works
Certificate Flow
- istiod acts as its own CA with a root key
- Each Envoy sidecar sends a CSR to istiod on startup
- The CSR carries the Pod's Service Account identity
- istiod validates and returns a signed cert (typically 24h validity)
- Envoy re-rotates via SDS
The cert SAN contains a SPIFFE ID:
spiffe://cluster.local/ns/default/sa/reviews
Handshake
Envoy A connects to Envoy B: TLS ClientHello, B presents its cert, A validates against istiod's root CA. B also demands A's cert (the "m" in mTLS). Both sides get an authenticated SPIFFE identity, which AuthorizationPolicy uses to enforce "SA X may call SA Y."
STRICT vs PERMISSIVE
kind: PeerAuthentication
spec:
mtls:
mode: STRICT
Production target is STRICT. During migration, PERMISSIVE allows both plaintext and mTLS.
5. The Cost of Sidecars
Memory
50-150MB per Envoy. 1000 Pods = 100GB just in sidecars. Tiny apps suffer the most — "150MB sidecar for a 100MB app."
Startup Latency
initContainer, sidecar boot, xDS config fetch, app ready. 2-3s extra. Painful for Jobs and CronJobs.
Lifecycle Mismatch
App container exits but sidecar still runs, or vice versa. Kubernetes 1.28 introduced native Sidecar Containers (Beta), which helps, but the memory issue remains.
Double Hop
App A -> Envoy A -> Envoy B -> App B. Four L7 parses per request.
Debugging Pain
A 503 appeared. App A? Envoy A? Network? Envoy B? App B? Five places to check.
6. Ambient Mesh — Istio's Sidecar-less Revolution
Announced 2022, Beta 2024, GA as of 2025.
Core Idea: Split L4 and L7
Ztunnel (per-node L4 proxy) — one Rust proxy per Node; handles mTLS, AuthN, and L4 routing. Pod traffic is redirected via iptables/eBPF.
Waypoint Proxy (per-service L7 proxy) — deployed only for services that need L7 features; a regular Deployment that can autoscale.
┌──────────────────────────────┐
│ Node │
│ [App Pod] [App Pod] │
│ │ │ │
│ ▼ ▼ │
│ [Ztunnel] (L4 mTLS) │
└──────┼───────────────────────┘
│
▼ (only when L7 needed)
[Waypoint Proxy Pod]
▼
[Destination Service]
Pros and Cons
Pros: less memory, gradual adoption (L4 only or add Waypoint), separated lifecycle.
Cons: younger, extra hop when Waypoint is used, tooling less mature.
7. Cilium Service Mesh — eBPF in the Kernel
eBPF in One Minute
eBPF is a Linux kernel virtual machine that safely runs bytecode at kernel hooks — networking (XDP, TC), syscall tracing, profiling, security. Extends the kernel without modules or reboots.
Cilium's Bet: Move the Proxy into the Kernel
Packets from a Pod are routed in-kernel without passing through a userspace proxy. kube-proxy is replaced (no iptables). L7 still uses Envoy, but only one per Node.
┌────────────────────────────────┐
│ Node │
│ [App Pod] --- (eBPF hook) │
│ │ │
│ ▼ │
│ Kernel eBPF Program │
│ (L3/L4 policy, LB) │
│ │ │
│ ▼ │
│ [Envoy (per-node, L7 only)] │
└────────────────────────────────┘
Performance (per Cilium benchmarks)
- P99 latency 2-3x better vs sidecars
- CPU usage reduced 40%+
- Near-zero memory overhead per connection
Features
- L3-L7 NetworkPolicy (HTTP method/path)
- Hubble — flow logs, service map
- Tetragon — runtime security
8. Linkerd — The Minimalist Philosophy
Linkerd 2.0 (2018) uses a Rust proxy, linkerd2-proxy.
- Not Envoy — a custom proxy
- Memory: 10-20MB (roughly 1/10 of Envoy)
- Five solid features: mTLS, retry/timeout, metrics, load balancing (EWMA), service profiles
Philosophy: "Don't bloat the mesh — less ops burden." No WASM, no fancy CRDs.
Target: under ~100 microservices, Istio feels heavy, "I just want mTLS and retries for free."
9. Selection Guide
Do you need a Service Mesh?
├── < 10 services → No. Libraries are fine.
├── Single language → Use a language-native RPC lib.
└── Mixed langs + policy complexity → candidate.
Which mesh?
├── Already on Istio → consider Ambient migration
├── Large (1000+) + perf → Cilium
├── Minimalism + Rust trust → Linkerd
└── Rich features + community → Istio (Sidecar or Ambient)
2025 Landscape
- Netflix, Uber, Airbnb — own libraries/proxies (controlled language surface)
- Cloud-native startups — Cilium rising fast
- Enterprises (finance/telecom) — Istio for stability
- Small teams — Linkerd or nothing
10. Practical Tuning
Connection Pool
trafficPolicy:
connectionPool:
tcp:
maxConnections: 1000
connectTimeout: 5s
http:
http1MaxPendingRequests: 1024
http2MaxRequests: 10000
maxRequestsPerConnection: 10000
Circuit Breaker
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 60s
maxEjectionPercent: 50
The 50% rule matters — otherwise all endpoints get ejected and the service dies.
Retry Budget
retries:
attempts: 3
perTryTimeout: 2s
retryOn: 5xx,connect-failure,refused-stream
Retries amplify load. Combine with circuit breakers.
11. Observability — The Hidden Gift
Service Mesh automatically provides:
istio_requests_total— counter by source/destination/codeistio_request_duration_milliseconds— histogramistio_tcp_sent_bytes_total/received_bytes_total
Plus B3 headers (x-request-id, x-b3-traceid, x-b3-spanid) for distributed tracing. The mesh emits, but the app must propagate headers or the trace breaks.
Access logs: Envoy writes JSON/custom format per request — Loki or Elasticsearch gives you "show me all 5xx" queries.
12. Pitfalls and Anti-Patterns
- Mixed mTLS modes — STRICT and PERMISSIVE mixed is debugging hell. Unify by namespace.
- Confusing Ingress Gateway with Sidecar — different lifecycle and tuning.
- Injecting into all services — Redis/Kafka sidecars hurt more than they help (L7 features meaningless for TCP, connection reuse broken).
- Control plane fat-fingers — a bad VirtualService breaks every Envoy. Use revision-based canary control planes.
- WASM filter cost —
EnvoyFilterwith WASM adds several ms latency. Simulate in prod-like load.
13. The Future — Will Service Mesh Disappear?
Predictions:
- eBPF supplants sidecars — Cilium's direction, already reality
- Absorbed into platform engineering — users get policies without knowing about "mesh"
- gRPC client-side LB returns — apps consume xDS directly (Google's internal way)
- Gateway API standardization — GAMMA initiative unifies mesh CRDs
The common theme: "the sidecar pattern was a five-year transitional design." But the problems it solved — language-neutral policy, automatic mTLS, observability — aren't going away. Only the implementation evolves.
14. 12-Point Field Checklist
- Ask five times if you really need a mesh
- New adoption? Pick Ambient over Sidecar
- Check CNI compatibility — already on Cilium? Consider Cilium Mesh
- mTLS: start PERMISSIVE, graduate to STRICT
- Set resource requests/limits — sidecar OOM kills the service
- Enable Telemetry v2 — no perf regression
- Match Envoy and Istio versions
- Run
istioctl analyzein CI - Gateway on a separate NodePool
- Sample access logs — 100% is a log-cost bomb
- Configure retries + circuit breaker as a pair
- Upgrade via revision-based canary — no big-bang data plane restart
Next — OpenTelemetry: Closing the Observability Loop
Service Mesh sprays metrics and traces automatically, but in real operations, application and infrastructure traces must form one chain. Next post: OpenTelemetry's birth (OpenTracing + OpenCensus), Span/Trace/Context Propagation semantics, the Collector architecture (Receiver/Processor/Exporter), head vs tail sampling, three pillars unified (logs + metrics + traces), profiles as a fourth pillar (Pyroscope/Parca), eBPF auto-instrumentation, and OTLP internals.
"Observability isn't log collection. It's the design philosophy that lets a distributed system explain its own state."