Skip to content

✍️ 필사 모드: Service Mesh Deep Dive — Envoy, Istio, Linkerd, Cilium eBPF, Ambient Mesh, xDS/mTLS

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Intro — "Answer or Fad?"

Around 2017 the consensus was "microservices require a Service Mesh." By 2021 people asked "100MB per sidecar — is this really efficient?" By 2023-2024, Istio Ambient Mesh and Cilium's sidecar-less mesh arrived in force.

What happened in eight years? This post covers:

  • Why a Service Mesh was needed — pains born of microservices
  • How Envoy works — listener, filter, cluster, xDS
  • Istio's control plane — what istiod actually does
  • How mTLS becomes automatic
  • Ambient Mesh — Istio without sidecars
  • Cilium eBPF — L7 handled in the kernel
  • Linkerd's Rust sidecar — the minimalist school
  • Selection criteria in the field

1. Why Service Meshes Were Born

Splitting a monolith into 100 microservices creates:

  1. Retries — how many, with what backoff?
  2. Timeouts — how long to wait?
  3. Circuit breaking — fail fast when downstream keeps erroring
  4. Load balancing — round robin? Least request?
  5. Traffic splitting — route 10% to the new version
  6. mTLS — encrypt and authenticate between services
  7. Observability — trace IDs, latency histograms per call
  8. AuthN/AuthZ — which service can call which?

From 2015-2017 people solved this with per-language libraries (Netflix Hystrix, Ribbon, Eureka). But once Go, Python, Node arrived, every language needed its own implementation, and upgrades required redeploying every service.

The Answer: Outsource to a Network Proxy

If A calls B over HTTP/gRPC, a proxy in the middle can add features regardless of language. Lyft open-sourced Envoy in 2016, Google and IBM built Istio on top in 2017, and the Service Mesh era began.


2. Envoy — The Modern Network Proxy Standard

Why Envoy

  • Written in C++ — fast like nginx, but modern design
  • Native HTTP/2 and gRPC
  • Dynamic config (xDS) — no restart needed
  • Strong observability — stats, logs, traces built-in
  • Filter chains — extensible via WASM and Lua

The Four Core Concepts

┌─────────────────────────────────────────────────┐
│                  Envoy                          │
│                                                 │
│  ┌─────────┐    ┌──────────────┐    ┌────────┐ │
│  │Listener │ -> │ Filter Chain │ -> │ Cluster│ │
│  └─────────┘    └──────────────┘    └────────┘ │
└─────────────────────────────────────────────────┘
  • Listener — which port to listen on (L4)
  • Filter Chain — how to process incoming connections
  • Cluster — where to send traffic (upstream IP:Port list)
  • Endpoint — actual IPs within a Cluster

Filters are pluggable, so WAF, Rate Limit, and JWT validation can be inserted.

xDS API — The Heart of Dynamic Config

Envoy receives config from five discovery services:

  • LDS (Listener), RDS (Route), CDS (Cluster), EDS (Endpoint), SDS (Secret)

These are aggregated into ADS (Aggregated Discovery Service). Envoy connects to the control plane via a gRPC stream, and the control plane pushes config changes. Updates apply without downtime — this is how changes to a Kubernetes Service or Deployment propagate instantly.


3. Istio — The Most Widely Used Mesh

Istio 1.0's Complexity

Istio 1.0 (2018) had four components: Pilot, Mixer, Citadel, Galley. Mixer received a separate gRPC call per request and became a bottleneck. In 1.5, Mixer was removed and all functionality moved into Envoy WASM filters.

Today's Istio: Just istiod

┌──────────────────────────────────┐
│          istiod                  │
│  Pilot (xDS) + Citadel + Galley  │
└─────────────┬────────────────────┘
              │ xDS
       Envoy Sidecar
     (one per Pod)

How Sidecar Injection Works

A Mutating Admission Webhook.

  1. User creates a Pod in a namespace labeled istio-injection=enabled
  2. API Server calls Istio's Mutating Webhook
  3. Webhook injects the istio-proxy sidecar container
  4. An istio-init initContainer is also added to insert iptables rules

The iptables rules:

All outbound traffic  → REDIRECT to 127.0.0.1:15001 (Envoy)
All inbound traffic   → REDIRECT to 127.0.0.1:15006 (Envoy)

The app thinks it talks directly to the network, but iptables transparently intercepts into Envoy.

VirtualService and DestinationRule

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts: [reviews]
  http:
  - match:
    - headers:
        user-agent: { regex: ".*Mobile.*" }
    route:
    - destination: { host: reviews, subset: v2 }
  - route:
    - destination: { host: reviews, subset: v1 }
      weight: 90
    - destination: { host: reviews, subset: v2 }
      weight: 10
kind: DestinationRule
spec:
  host: reviews
  trafficPolicy:
    connectionPool:
      tcp: { maxConnections: 100 }
      http: { http1MaxPendingRequests: 10 }
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
  subsets:
  - name: v1
    labels: { version: v1 }
  - name: v2
    labels: { version: v2 }

4. mTLS — How "Automatic Encryption" Actually Works

Certificate Flow

  1. istiod acts as its own CA with a root key
  2. Each Envoy sidecar sends a CSR to istiod on startup
  3. The CSR carries the Pod's Service Account identity
  4. istiod validates and returns a signed cert (typically 24h validity)
  5. Envoy re-rotates via SDS

The cert SAN contains a SPIFFE ID:

spiffe://cluster.local/ns/default/sa/reviews

Handshake

Envoy A connects to Envoy B: TLS ClientHello, B presents its cert, A validates against istiod's root CA. B also demands A's cert (the "m" in mTLS). Both sides get an authenticated SPIFFE identity, which AuthorizationPolicy uses to enforce "SA X may call SA Y."

STRICT vs PERMISSIVE

kind: PeerAuthentication
spec:
  mtls:
    mode: STRICT

Production target is STRICT. During migration, PERMISSIVE allows both plaintext and mTLS.


5. The Cost of Sidecars

Memory

50-150MB per Envoy. 1000 Pods = 100GB just in sidecars. Tiny apps suffer the most — "150MB sidecar for a 100MB app."

Startup Latency

initContainer, sidecar boot, xDS config fetch, app ready. 2-3s extra. Painful for Jobs and CronJobs.

Lifecycle Mismatch

App container exits but sidecar still runs, or vice versa. Kubernetes 1.28 introduced native Sidecar Containers (Beta), which helps, but the memory issue remains.

Double Hop

App A -> Envoy A -> Envoy B -> App B. Four L7 parses per request.

Debugging Pain

A 503 appeared. App A? Envoy A? Network? Envoy B? App B? Five places to check.


6. Ambient Mesh — Istio's Sidecar-less Revolution

Announced 2022, Beta 2024, GA as of 2025.

Core Idea: Split L4 and L7

Ztunnel (per-node L4 proxy) — one Rust proxy per Node; handles mTLS, AuthN, and L4 routing. Pod traffic is redirected via iptables/eBPF.

Waypoint Proxy (per-service L7 proxy) — deployed only for services that need L7 features; a regular Deployment that can autoscale.

┌──────────────────────────────┐
│         Node                 │
│  [App Pod]  [App Pod]        │
│      │         │             │
│      ▼         ▼             │
│    [Ztunnel] (L4 mTLS)       │
└──────┼───────────────────────┘
       ▼ (only when L7 needed)
  [Waypoint Proxy Pod]
  [Destination Service]

Pros and Cons

Pros: less memory, gradual adoption (L4 only or add Waypoint), separated lifecycle.

Cons: younger, extra hop when Waypoint is used, tooling less mature.


7. Cilium Service Mesh — eBPF in the Kernel

eBPF in One Minute

eBPF is a Linux kernel virtual machine that safely runs bytecode at kernel hooks — networking (XDP, TC), syscall tracing, profiling, security. Extends the kernel without modules or reboots.

Cilium's Bet: Move the Proxy into the Kernel

Packets from a Pod are routed in-kernel without passing through a userspace proxy. kube-proxy is replaced (no iptables). L7 still uses Envoy, but only one per Node.

┌────────────────────────────────┐
│       Node                     │
│  [App Pod] --- (eBPF hook)     │
│                   │            │
│                   ▼            │
│         Kernel eBPF Program    │
│         (L3/L4 policy, LB)     │
│                   │            │
│                   ▼            │
│  [Envoy (per-node, L7 only)]   │
└────────────────────────────────┘

Performance (per Cilium benchmarks)

  • P99 latency 2-3x better vs sidecars
  • CPU usage reduced 40%+
  • Near-zero memory overhead per connection

Features

  • L3-L7 NetworkPolicy (HTTP method/path)
  • Hubble — flow logs, service map
  • Tetragon — runtime security

8. Linkerd — The Minimalist Philosophy

Linkerd 2.0 (2018) uses a Rust proxy, linkerd2-proxy.

  • Not Envoy — a custom proxy
  • Memory: 10-20MB (roughly 1/10 of Envoy)
  • Five solid features: mTLS, retry/timeout, metrics, load balancing (EWMA), service profiles

Philosophy: "Don't bloat the mesh — less ops burden." No WASM, no fancy CRDs.

Target: under ~100 microservices, Istio feels heavy, "I just want mTLS and retries for free."


9. Selection Guide

Do you need a Service Mesh?
├── < 10 services     → No. Libraries are fine.
├── Single language   → Use a language-native RPC lib.
└── Mixed langs + policy complexity → candidate.

Which mesh?
├── Already on Istio        → consider Ambient migration
├── Large (1000+) + perf    → Cilium
├── Minimalism + Rust trust → Linkerd
└── Rich features + community → Istio (Sidecar or Ambient)

2025 Landscape

  • Netflix, Uber, Airbnb — own libraries/proxies (controlled language surface)
  • Cloud-native startups — Cilium rising fast
  • Enterprises (finance/telecom) — Istio for stability
  • Small teams — Linkerd or nothing

10. Practical Tuning

Connection Pool

trafficPolicy:
  connectionPool:
    tcp:
      maxConnections: 1000
      connectTimeout: 5s
    http:
      http1MaxPendingRequests: 1024
      http2MaxRequests: 10000
      maxRequestsPerConnection: 10000

Circuit Breaker

outlierDetection:
  consecutive5xxErrors: 5
  interval: 30s
  baseEjectionTime: 60s
  maxEjectionPercent: 50

The 50% rule matters — otherwise all endpoints get ejected and the service dies.

Retry Budget

retries:
  attempts: 3
  perTryTimeout: 2s
  retryOn: 5xx,connect-failure,refused-stream

Retries amplify load. Combine with circuit breakers.


11. Observability — The Hidden Gift

Service Mesh automatically provides:

  • istio_requests_total — counter by source/destination/code
  • istio_request_duration_milliseconds — histogram
  • istio_tcp_sent_bytes_total / received_bytes_total

Plus B3 headers (x-request-id, x-b3-traceid, x-b3-spanid) for distributed tracing. The mesh emits, but the app must propagate headers or the trace breaks.

Access logs: Envoy writes JSON/custom format per request — Loki or Elasticsearch gives you "show me all 5xx" queries.


12. Pitfalls and Anti-Patterns

  1. Mixed mTLS modes — STRICT and PERMISSIVE mixed is debugging hell. Unify by namespace.
  2. Confusing Ingress Gateway with Sidecar — different lifecycle and tuning.
  3. Injecting into all services — Redis/Kafka sidecars hurt more than they help (L7 features meaningless for TCP, connection reuse broken).
  4. Control plane fat-fingers — a bad VirtualService breaks every Envoy. Use revision-based canary control planes.
  5. WASM filter cost — EnvoyFilter with WASM adds several ms latency. Simulate in prod-like load.

13. The Future — Will Service Mesh Disappear?

Predictions:

  1. eBPF supplants sidecars — Cilium's direction, already reality
  2. Absorbed into platform engineering — users get policies without knowing about "mesh"
  3. gRPC client-side LB returns — apps consume xDS directly (Google's internal way)
  4. Gateway API standardization — GAMMA initiative unifies mesh CRDs

The common theme: "the sidecar pattern was a five-year transitional design." But the problems it solved — language-neutral policy, automatic mTLS, observability — aren't going away. Only the implementation evolves.


14. 12-Point Field Checklist

  1. Ask five times if you really need a mesh
  2. New adoption? Pick Ambient over Sidecar
  3. Check CNI compatibility — already on Cilium? Consider Cilium Mesh
  4. mTLS: start PERMISSIVE, graduate to STRICT
  5. Set resource requests/limits — sidecar OOM kills the service
  6. Enable Telemetry v2 — no perf regression
  7. Match Envoy and Istio versions
  8. Run istioctl analyze in CI
  9. Gateway on a separate NodePool
  10. Sample access logs — 100% is a log-cost bomb
  11. Configure retries + circuit breaker as a pair
  12. Upgrade via revision-based canary — no big-bang data plane restart

Next — OpenTelemetry: Closing the Observability Loop

Service Mesh sprays metrics and traces automatically, but in real operations, application and infrastructure traces must form one chain. Next post: OpenTelemetry's birth (OpenTracing + OpenCensus), Span/Trace/Context Propagation semantics, the Collector architecture (Receiver/Processor/Exporter), head vs tail sampling, three pillars unified (logs + metrics + traces), profiles as a fourth pillar (Pyroscope/Parca), eBPF auto-instrumentation, and OTLP internals.

"Observability isn't log collection. It's the design philosophy that lets a distributed system explain its own state."

현재 단락 (1/215)

Around 2017 the consensus was "microservices require a Service Mesh." By 2021 people asked "100MB pe...

작성 글자: 0원문 글자: 10,543작성 단락: 0/215