Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Intro — "Answer or Fad?"

Around 2017 the consensus was "microservices require a Service Mesh." By 2021 people asked "100MB per sidecar — is this really efficient?" By 2023-2024, Istio Ambient Mesh and Cilium's sidecar-less mesh arrived in force.

What happened in eight years? This post covers:

Why a Service Mesh was needed — pains born of microservices
How Envoy works — listener, filter, cluster, xDS
Istio's control plane — what istiod actually does
How mTLS becomes automatic
Ambient Mesh — Istio without sidecars
Cilium eBPF — L7 handled in the kernel
Linkerd's Rust sidecar — the minimalist school
Selection criteria in the field

1. Why Service Meshes Were Born

Splitting a monolith into 100 microservices creates:

Retries — how many, with what backoff?
Timeouts — how long to wait?
Circuit breaking — fail fast when downstream keeps erroring
Load balancing — round robin? Least request?
Traffic splitting — route 10% to the new version
mTLS — encrypt and authenticate between services
Observability — trace IDs, latency histograms per call
AuthN/AuthZ — which service can call which?

From 2015-2017 people solved this with per-language libraries (Netflix Hystrix, Ribbon, Eureka). But once Go, Python, Node arrived, every language needed its own implementation, and upgrades required redeploying every service.

The Answer: Outsource to a Network Proxy

If A calls B over HTTP/gRPC, a proxy in the middle can add features regardless of language. Lyft open-sourced Envoy in 2016, Google and IBM built Istio on top in 2017, and the Service Mesh era began.

2. Envoy — The Modern Network Proxy Standard

Why Envoy

Written in C++ — fast like nginx, but modern design
Native HTTP/2 and gRPC
Dynamic config (xDS) — no restart needed
Strong observability — stats, logs, traces built-in
Filter chains — extensible via WASM and Lua

The Four Core Concepts

┌─────────────────────────────────────────────────┐
│                  Envoy                          │
│                                                 │
│  ┌─────────┐    ┌──────────────┐    ┌────────┐ │
│  │Listener │ -> │ Filter Chain │ -> │ Cluster│ │
│  └─────────┘    └──────────────┘    └────────┘ │
└─────────────────────────────────────────────────┘

Listener — which port to listen on (L4)
Filter Chain — how to process incoming connections
Cluster — where to send traffic (upstream IP:Port list)
Endpoint — actual IPs within a Cluster

Filters are pluggable, so WAF, Rate Limit, and JWT validation can be inserted.

xDS API — The Heart of Dynamic Config

Envoy receives config from five discovery services:

LDS (Listener), RDS (Route), CDS (Cluster), EDS (Endpoint), SDS (Secret)

These are aggregated into ADS (Aggregated Discovery Service). Envoy connects to the control plane via a gRPC stream, and the control plane pushes config changes. Updates apply without downtime — this is how changes to a Kubernetes Service or Deployment propagate instantly.

3. Istio — The Most Widely Used Mesh

Istio 1.0's Complexity

Istio 1.0 (2018) had four components: Pilot, Mixer, Citadel, Galley. Mixer received a separate gRPC call per request and became a bottleneck. In 1.5, Mixer was removed and all functionality moved into Envoy WASM filters.

Today's Istio: Just istiod

┌──────────────────────────────────┐
│          istiod                  │
│  Pilot (xDS) + Citadel + Galley  │
└─────────────┬────────────────────┘
              │ xDS
              ▼
       Envoy Sidecar
     (one per Pod)

How Sidecar Injection Works

A Mutating Admission Webhook.

User creates a Pod in a namespace labeled istio-injection=enabled
API Server calls Istio's Mutating Webhook
Webhook injects the istio-proxy sidecar container
An istio-init initContainer is also added to insert iptables rules

The iptables rules:

All outbound traffic  → REDIRECT to 127.0.0.1:15001 (Envoy)
All inbound traffic   → REDIRECT to 127.0.0.1:15006 (Envoy)

The app thinks it talks directly to the network, but iptables transparently intercepts into Envoy.

VirtualService and DestinationRule

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts: [reviews]
  http:
  - match:
    - headers:
        user-agent: { regex: ".*Mobile.*" }
    route:
    - destination: { host: reviews, subset: v2 }
  - route:
    - destination: { host: reviews, subset: v1 }
      weight: 90
    - destination: { host: reviews, subset: v2 }
      weight: 10

kind: DestinationRule
spec:
  host: reviews
  trafficPolicy:
    connectionPool:
      tcp: { maxConnections: 100 }
      http: { http1MaxPendingRequests: 10 }
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
  subsets:
  - name: v1
    labels: { version: v1 }
  - name: v2
    labels: { version: v2 }

4. mTLS — How "Automatic Encryption" Actually Works

Certificate Flow

istiod acts as its own CA with a root key
Each Envoy sidecar sends a CSR to istiod on startup
The CSR carries the Pod's Service Account identity
istiod validates and returns a signed cert (typically 24h validity)
Envoy re-rotates via SDS

The cert SAN contains a SPIFFE ID:

spiffe://cluster.local/ns/default/sa/reviews

Handshake

Envoy A connects to Envoy B: TLS ClientHello, B presents its cert, A validates against istiod's root CA. B also demands A's cert (the "m" in mTLS). Both sides get an authenticated SPIFFE identity, which AuthorizationPolicy uses to enforce "SA X may call SA Y."

STRICT vs PERMISSIVE

kind: PeerAuthentication
spec:
  mtls:
    mode: STRICT

Production target is STRICT. During migration, PERMISSIVE allows both plaintext and mTLS.

5. The Cost of Sidecars

Memory

50-150MB per Envoy. 1000 Pods = 100GB just in sidecars. Tiny apps suffer the most — "150MB sidecar for a 100MB app."

Startup Latency

initContainer, sidecar boot, xDS config fetch, app ready. 2-3s extra. Painful for Jobs and CronJobs.

Lifecycle Mismatch

App container exits but sidecar still runs, or vice versa. Kubernetes 1.28 introduced native Sidecar Containers (Beta), which helps, but the memory issue remains.

Double Hop

App A -> Envoy A -> Envoy B -> App B. Four L7 parses per request.

Debugging Pain

A 503 appeared. App A? Envoy A? Network? Envoy B? App B? Five places to check.

6. Ambient Mesh — Istio's Sidecar-less Revolution

Announced 2022, Beta 2024, GA as of 2025.

Core Idea: Split L4 and L7

Ztunnel (per-node L4 proxy) — one Rust proxy per Node; handles mTLS, AuthN, and L4 routing. Pod traffic is redirected via iptables/eBPF.

Waypoint Proxy (per-service L7 proxy) — deployed only for services that need L7 features; a regular Deployment that can autoscale.

┌──────────────────────────────┐
│         Node                 │
│  [App Pod]  [App Pod]        │
│      │         │             │
│      ▼         ▼             │
│    [Ztunnel] (L4 mTLS)       │
└──────┼───────────────────────┘
       │
       ▼ (only when L7 needed)
  [Waypoint Proxy Pod]
       ▼
  [Destination Service]

Pros and Cons

Pros: less memory, gradual adoption (L4 only or add Waypoint), separated lifecycle.

Cons: younger, extra hop when Waypoint is used, tooling less mature.

7. Cilium Service Mesh — eBPF in the Kernel

eBPF in One Minute

eBPF is a Linux kernel virtual machine that safely runs bytecode at kernel hooks — networking (XDP, TC), syscall tracing, profiling, security. Extends the kernel without modules or reboots.

Cilium's Bet: Move the Proxy into the Kernel

Packets from a Pod are routed in-kernel without passing through a userspace proxy. kube-proxy is replaced (no iptables). L7 still uses Envoy, but only one per Node.

┌────────────────────────────────┐
│       Node                     │
│  [App Pod] --- (eBPF hook)     │
│                   │            │
│                   ▼            │
│         Kernel eBPF Program    │
│         (L3/L4 policy, LB)     │
│                   │            │
│                   ▼            │
│  [Envoy (per-node, L7 only)]   │
└────────────────────────────────┘

Performance (per Cilium benchmarks)

P99 latency 2-3x better vs sidecars
CPU usage reduced 40%+
Near-zero memory overhead per connection

Features

L3-L7 NetworkPolicy (HTTP method/path)
Hubble — flow logs, service map
Tetragon — runtime security

8. Linkerd — The Minimalist Philosophy

Linkerd 2.0 (2018) uses a Rust proxy, linkerd2-proxy.

Not Envoy — a custom proxy
Memory: 10-20MB (roughly 1/10 of Envoy)
Five solid features: mTLS, retry/timeout, metrics, load balancing (EWMA), service profiles

Philosophy: "Don't bloat the mesh — less ops burden." No WASM, no fancy CRDs.

Target: under ~100 microservices, Istio feels heavy, "I just want mTLS and retries for free."

9. Selection Guide

Do you need a Service Mesh?
├── < 10 services     → No. Libraries are fine.
├── Single language   → Use a language-native RPC lib.
└── Mixed langs + policy complexity → candidate.

Which mesh?
├── Already on Istio        → consider Ambient migration
├── Large (1000+) + perf    → Cilium
├── Minimalism + Rust trust → Linkerd
└── Rich features + community → Istio (Sidecar or Ambient)

2025 Landscape

Netflix, Uber, Airbnb — own libraries/proxies (controlled language surface)
Cloud-native startups — Cilium rising fast
Enterprises (finance/telecom) — Istio for stability
Small teams — Linkerd or nothing

10. Practical Tuning

Connection Pool

trafficPolicy:
  connectionPool:
    tcp:
      maxConnections: 1000
      connectTimeout: 5s
    http:
      http1MaxPendingRequests: 1024
      http2MaxRequests: 10000
      maxRequestsPerConnection: 10000

Circuit Breaker

outlierDetection:
  consecutive5xxErrors: 5
  interval: 30s
  baseEjectionTime: 60s
  maxEjectionPercent: 50

The 50% rule matters — otherwise all endpoints get ejected and the service dies.

Retry Budget

retries:
  attempts: 3
  perTryTimeout: 2s
  retryOn: 5xx,connect-failure,refused-stream

Retries amplify load. Combine with circuit breakers.

11. Observability — The Hidden Gift

Service Mesh automatically provides:

istio_requests_total — counter by source/destination/code
istio_request_duration_milliseconds — histogram
istio_tcp_sent_bytes_total / received_bytes_total

Plus B3 headers (x-request-id, x-b3-traceid, x-b3-spanid) for distributed tracing. The mesh emits, but the app must propagate headers or the trace breaks.

Access logs: Envoy writes JSON/custom format per request — Loki or Elasticsearch gives you "show me all 5xx" queries.

12. Pitfalls and Anti-Patterns

Mixed mTLS modes — STRICT and PERMISSIVE mixed is debugging hell. Unify by namespace.
Confusing Ingress Gateway with Sidecar — different lifecycle and tuning.
Injecting into all services — Redis/Kafka sidecars hurt more than they help (L7 features meaningless for TCP, connection reuse broken).
Control plane fat-fingers — a bad VirtualService breaks every Envoy. Use revision-based canary control planes.
WASM filter cost — EnvoyFilter with WASM adds several ms latency. Simulate in prod-like load.

13. The Future — Will Service Mesh Disappear?

Predictions:

eBPF supplants sidecars — Cilium's direction, already reality
Absorbed into platform engineering — users get policies without knowing about "mesh"
gRPC client-side LB returns — apps consume xDS directly (Google's internal way)
Gateway API standardization — GAMMA initiative unifies mesh CRDs

The common theme: "the sidecar pattern was a five-year transitional design." But the problems it solved — language-neutral policy, automatic mTLS, observability — aren't going away. Only the implementation evolves.

14. 12-Point Field Checklist

Ask five times if you really need a mesh
New adoption? Pick Ambient over Sidecar
Check CNI compatibility — already on Cilium? Consider Cilium Mesh
mTLS: start PERMISSIVE, graduate to STRICT
Set resource requests/limits — sidecar OOM kills the service
Enable Telemetry v2 — no perf regression
Match Envoy and Istio versions
Run istioctl analyze in CI
Gateway on a separate NodePool
Sample access logs — 100% is a log-cost bomb
Configure retries + circuit breaker as a pair
Upgrade via revision-based canary — no big-bang data plane restart

Next — OpenTelemetry: Closing the Observability Loop

Service Mesh sprays metrics and traces automatically, but in real operations, application and infrastructure traces must form one chain. Next post: OpenTelemetry's birth (OpenTracing + OpenCensus), Span/Trace/Context Propagation semantics, the Collector architecture (Receiver/Processor/Exporter), head vs tail sampling, three pillars unified (logs + metrics + traces), profiles as a fourth pillar (Pyroscope/Parca), eBPF auto-instrumentation, and OTLP internals.

"Observability isn't log collection. It's the design philosophy that lets a distributed system explain its own state."