Skip to content

필사 모드: Service Mesh Production Guide: mTLS, Traffic Management, and Observability with Istio, Envoy, and Linkerd

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

As microservice architectures have proliferated, the complexity of service-to-service communication has grown dramatically. Implementing cross-cutting concerns such as authentication, encryption, traffic management, observability, and fault isolation directly in each service leads to exponential code duplication and operational burden. A service mesh extracts these networking concerns into the infrastructure layer, providing consistent security, observability, and traffic control without application code changes.

This guide explains the core service mesh concepts of data plane and control plane, compares Istio (Envoy sidecar) with Linkerd, and covers mTLS configuration, traffic splitting, circuit breaking, observability tools, and the latest Ambient Mesh with practical examples. We also address common production failure scenarios and their resolutions.

Service Mesh Core Concepts

Data Plane and Control Plane

A service mesh consists of two main layers.

| Layer | Role | Implementation |

| ------------- | ----------------------------------------------------------------- | ---------------------------------------------- |

| Data Plane | Intercepts and handles all network traffic between services | Envoy (Istio), linkerd2-proxy (Linkerd) |

| Control Plane | Distributes proxy config, manages certificates, service discovery | Istiod (Istio), destination/identity (Linkerd) |

Istio vs Linkerd Comparison

| Aspect | Istio | Linkerd |

| -------------------- | ------------------------------------------- | ------------------------------------- |

| Proxy | Envoy (C++) | linkerd2-proxy (Rust) |

| Control Plane | Istiod (unified) | destination, identity, proxy-injector |

| Proxy Memory | ~50MB+ / sidecar | ~20-30MB / sidecar |

| Control Plane Memory | 1-2GB (production) | 200-300MB |

| L7 Features | Very rich (header routing, mirroring, etc.) | Core features focused |

| Learning Curve | Steep | Gentle |

| CRD Count | 50+ | ~10 |

| Ambient Mode | Supported (ztunnel + waypoint) | Not supported |

| Best For | Large-scale, complex traffic management | Small-to-mid scale, quick adoption |

Performance Overhead Comparison

Benchmark results (at 2000 RPS)

P99 latency added:

No mesh: baseline

Linkerd: +2.0ms

Istio Sidecar: +5.8ms

Istio Ambient: +2.4ms

Resource usage (per sidecar):

Envoy: ~50MB RAM, ~0.5 vCPU

linkerd2-proxy: ~20MB RAM, ~0.2 vCPU

At high load (12800 RPS) benchmarks,

Istio Ambient recorded the lowest latency

~11ms difference at P99 compared to Linkerd

Istio Architecture and Configuration

Istio Installation

Install istioctl

curl -L https://istio.io/downloadIstio | sh -

cd istio-1.24.0

export PATH=$PWD/bin:$PATH

Install with production profile

istioctl install --set profile=default -y

Enable automatic sidecar injection for namespace

kubectl label namespace default istio-injection=enabled

Verify installation

istioctl verify-install

kubectl get pods -n istio-system

VirtualService and DestinationRule

VirtualService: define traffic routing rules

apiVersion: networking.istio.io/v1

kind: VirtualService

metadata:

name: reviews-route

namespace: default

spec:

hosts:

- reviews

http:

- match:

- headers:

end-user:

exact: beta-tester

route:

- destination:

host: reviews

subset: v2

weight: 100

- route:

- destination:

host: reviews

subset: v1

weight: 90

- destination:

host: reviews

subset: v2

weight: 10

DestinationRule: define service subsets and policies

apiVersion: networking.istio.io/v1

kind: DestinationRule

metadata:

name: reviews-destination

namespace: default

spec:

host: reviews

trafficPolicy:

connectionPool:

tcp:

maxConnections: 100

http:

h2UpgradePolicy: DEFAULT

http1MaxPendingRequests: 100

http2MaxRequests: 1000

loadBalancer:

simple: ROUND_ROBIN

subsets:

- name: v1

labels:

version: v1

- name: v2

labels:

version: v2

trafficPolicy:

loadBalancer:

simple: LEAST_REQUEST

Traffic Splitting (Canary Deployment)

Canary deployment: gradually increase traffic to v2

apiVersion: networking.istio.io/v1

kind: VirtualService

metadata:

name: my-service-canary

spec:

hosts:

- my-service

http:

- route:

- destination:

host: my-service

subset: stable

weight: 95

- destination:

host: my-service

subset: canary

weight: 5

Gradual canary traffic increase script

5% → 10% → 25% → 50% → 100%

for weight in 10 25 50 100; do

stable_weight=$((100 - weight))

kubectl patch virtualservice my-service-canary --type=json \

-p="[

{\"op\":\"replace\",\"path\":\"/spec/http/0/route/0/weight\",\"value\":${stable_weight}},

{\"op\":\"replace\",\"path\":\"/spec/http/0/route/1/weight\",\"value\":${weight}}

]"

echo "Canary weight: ${weight}%, Stable weight: ${stable_weight}%"

echo "Monitoring for 5 minutes..."

sleep 300

done

Circuit Breaker Configuration

Circuit breaker via DestinationRule

apiVersion: networking.istio.io/v1

kind: DestinationRule

metadata:

name: payment-service-cb

spec:

host: payment-service

trafficPolicy:

connectionPool:

tcp:

maxConnections: 50

http:

http1MaxPendingRequests: 50

http2MaxRequests: 100

maxRequestsPerConnection: 10

maxRetries: 3

outlierDetection:

consecutive5xxErrors: 5

interval: 30s

baseEjectionTime: 30s

maxEjectionPercent: 50

minHealthPercent: 30

Check circuit breaker status

istioctl proxy-config cluster <pod-name> --fqdn payment-service.default.svc.cluster.local -o json | grep -A 20 "outlierDetection"

Verify circuit breaker activity in Envoy stats

kubectl exec <pod-name> -c istio-proxy -- pilot-agent request GET stats | grep "circuit_breakers"

mTLS Configuration and Security

Strict mTLS Enforcement

Apply Strict mTLS to entire namespace

apiVersion: security.istio.io/v1

kind: PeerAuthentication

metadata:

name: default

namespace: default

spec:

mtls:

mode: STRICT

Apply Strict mTLS mesh-wide (create in istio-system namespace)

apiVersion: security.istio.io/v1

kind: PeerAuthentication

metadata:

name: default

namespace: istio-system

spec:

mtls:

mode: STRICT

Port Exclusions (Legacy Service Integration)

PERMISSIVE mode for specific ports on a service

apiVersion: security.istio.io/v1

kind: PeerAuthentication

metadata:

name: legacy-integration

namespace: default

spec:

selector:

matchLabels:

app: legacy-adapter

mtls:

mode: STRICT

portLevelMtls:

8080:

mode: PERMISSIVE

Certificate Management and SPIFFE

Check current mTLS status

istioctl authn tls-check <pod-name>

View certificate information

istioctl proxy-config secret <pod-name> -o json

SPIFFE ID format: spiffe://cluster.local/ns/NAMESPACE/sa/SERVICE_ACCOUNT

Istio automatically assigns SPIFFE IDs based on Kubernetes service accounts

Check certificate expiry (default 24h, auto-renewed)

kubectl exec <pod-name> -c istio-proxy -- \

openssl x509 -noout -dates -in /var/run/secrets/istio/tls/cert-chain.pem

Force certificate rotation (for debugging)

kubectl delete secret istio-ca-root-cert -n default

Istiod will automatically issue new certificates

Authorization Policy (Access Control)

Allow access only from specific services

apiVersion: security.istio.io/v1

kind: AuthorizationPolicy

metadata:

name: payment-access

namespace: default

spec:

selector:

matchLabels:

app: payment-service

rules:

- from:

- source:

principals:

- cluster.local/ns/default/sa/order-service

- cluster.local/ns/default/sa/checkout-service

to:

- operation:

methods: ['POST', 'GET']

paths: ['/api/v1/payments/*']

Deny all access (default deny policy)

apiVersion: security.istio.io/v1

kind: AuthorizationPolicy

metadata:

name: deny-all

namespace: default

spec: {}

Ambient Mesh (Sidecar-less Mode)

Ambient Mesh Architecture

Ambient Mesh eliminates sidecars and provides mesh functionality through two layers.

| Layer | Component | Functionality |

| ------------------- | ------------------------------ | -------------------------------------------- |

| L4 (Secure Overlay) | ztunnel (DaemonSet per node) | mTLS, L4 authorization, L4 telemetry |

| L7 (Waypoint) | waypoint proxy (per namespace) | HTTP routing, L7 authorization, L7 telemetry |

Install Istio with Ambient mode

istioctl install --set profile=ambient -y

Add namespace to Ambient mesh

kubectl label namespace default istio.io/dataplane-mode=ambient

Verify ztunnel DaemonSet

kubectl get pods -n istio-system -l app=ztunnel

Deploy Waypoint proxy for L7 features

istioctl waypoint apply --namespace default --name default-waypoint

Verify Waypoint proxy

kubectl get pods -n default -l istio.io/gateway-name=default-waypoint

Ambient vs Sidecar Comparison

Sidecar mode:

Pros: Full L7 control, mature ecosystem

Cons: Per-pod proxy overhead, restart required

Resources: ~50MB RAM + ~0.5 vCPU / pod

Ambient mode:

Pros: No pod restart needed, lower resource overhead

Cons: L7 requires waypoint, relatively newer technology

Resources: ztunnel ~30MB RAM / node + shared waypoint

Selection criteria:

Migrating existing workloads → Consider Ambient first

Fine-grained L7 control needed → Keep Sidecar

Resource savings priority → Ambient

Stability priority → Sidecar (more mature)

Observability

Kiali Dashboard

Install Kiali (Istio addon)

kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/addons/kiali.yaml

Access Kiali dashboard

istioctl dashboard kiali

Information Kiali provides:

- Service-to-service traffic flow graph

- Request success rate / error rate

- P50/P90/P99 latency

- mTLS status (lock icon)

- Istio configuration validation (error highlighting)

Distributed Tracing (Jaeger/Zipkin)

Install Jaeger

kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/addons/jaeger.yaml

Access Jaeger dashboard

istioctl dashboard jaeger

Applications MUST propagate trace headers upstream

The following headers must be forwarded:

x-request-id

x-b3-traceid

x-b3-spanid

x-b3-parentspanid

x-b3-sampled

x-b3-flags

traceparent

tracestate

Python Flask trace header propagation example

from flask import Flask, request

app = Flask(__name__)

TRACE_HEADERS = [

'x-request-id',

'x-b3-traceid',

'x-b3-spanid',

'x-b3-parentspanid',

'x-b3-sampled',

'x-b3-flags',

'traceparent',

'tracestate',

]

def propagate_headers():

headers = {}

for header in TRACE_HEADERS:

value = request.headers.get(header)

if value:

headers[header] = value

return headers

@app.route('/api/orders')

def get_orders():

Propagate trace headers when calling downstream services

headers = propagate_headers()

response = requests.get(

'http://payment-service:8080/api/payments',

headers=headers

)

return response.json()

Prometheus + Grafana Metrics

Install Prometheus and Grafana

kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/addons/prometheus.yaml

kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/addons/grafana.yaml

Access Grafana dashboard

istioctl dashboard grafana

Key metrics automatically collected by Istio:

istio_requests_total - Total request count

istio_request_duration_milliseconds - Request latency

istio_request_bytes - Request size

istio_response_bytes - Response size

istio_tcp_connections_opened_total - TCP connection count

Example Prometheus queries

Error rate by service (5xx)

rate(istio_requests_total{response_code=~"5.."}[5m])

/

rate(istio_requests_total[5m])

P99 latency

histogram_quantile(0.99,

sum(rate(istio_request_duration_milliseconds_bucket[5m]))

by (le, destination_service_name))

Failure Scenarios and Responses

Scenario 1: Sidecar Injection Failure

Symptom: Pod is Running but sidecar (istio-proxy) is missing

1. Check namespace labels

kubectl get namespace default --show-labels

Verify istio-injection=enabled label exists

2. Check container list in pod

kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].name}'

If istio-proxy is not listed, injection failed

3. Diagnose root cause

a. Missing namespace label

kubectl label namespace default istio-injection=enabled

b. Pod has injection-disabled annotation

kubectl get pod <pod-name> -o jsonpath='{.metadata.annotations.sidecar\.istio\.io/inject}'

"false" means injection is disabled

c. Check webhook configuration

kubectl get mutatingwebhookconfiguration istio-sidecar-injector -o yaml

4. Manual injection (emergency)

istioctl kube-inject -f deployment.yaml | kubectl apply -f -

5. Restart pods (to apply injection)

kubectl rollout restart deployment <deployment-name>

Scenario 2: Certificate Rotation Failure

Symptom: Service-to-service communication failure, TLS handshake errors

1. Check certificate status

istioctl proxy-config secret <pod-name>

Verify VALID status and expiry time

2. Check Istiod logs for certificate errors

kubectl logs -n istio-system deployment/istiod | grep -i "certificate\|cert\|error"

3. Verify CA certificate

kubectl get secret istio-ca-secret -n istio-system -o jsonpath='{.data.ca-cert\.pem}' | base64 -d | openssl x509 -noout -dates

4. Force certificate renewal

Restart the pod's istio-proxy

kubectl delete pod <pod-name>

5. Restart Istiod (if CA issue)

kubectl rollout restart deployment istiod -n istio-system

6. Root CA rotation (planned operation)

Create new Root CA and gradually transition via intermediate CA

Refer to official CA rotation guide

Scenario 3: Excessive Memory Usage (Envoy OOM)

Symptom: istio-proxy container restarting with OOMKilled

1. Check current resource usage

kubectl top pod <pod-name> --containers

2. Check Envoy statistics

kubectl exec <pod-name> -c istio-proxy -- pilot-agent request GET stats/memory

3. Adjust resource limits

kubectl patch deployment <deployment-name> --type=json \

-p='[{"op":"replace","path":"/spec/template/metadata/annotations/sidecar.istio.io~1proxyMemoryLimit","value":"512Mi"}]'

4. Global proxy resource settings (IstioOperator)

Set in istio-operator.yaml:

spec:

meshConfig:

defaultConfig:

proxyMetadata: {}

values:

global:

proxy:

resources:

requests:

cpu: 100m

memory: 128Mi

limits:

cpu: 500m

memory: 512Mi

Operational Notes

1. **Gradual Adoption**: Do not apply the service mesh across the entire cluster at once. Start with non-critical workloads and expand incrementally. Begin with PERMISSIVE mTLS mode and transition to STRICT.

2. **Resource Budget**: Budget approximately 50MB RAM + 0.5 vCPU per pod for Envoy sidecars. This overhead can be substantial in large clusters.

3. **Trace Header Propagation**: Distributed tracing requires applications to propagate trace headers (x-b3-traceid, etc.). The service mesh does not do this automatically.

4. **CRD Management**: Istio uses 50+ CRDs. Always verify CRD compatibility during upgrades and use canary upgrades.

5. **Consider Ambient Mesh**: For new deployments, seriously evaluate Ambient Mesh. It provides L4 security immediately without sidecar overhead, and L7 features can be added via waypoint proxies only where needed.

6. **Istiod High Availability**: In production, run Istiod with at least 2 replicas and configure a Pod Disruption Budget.

Scale Istiod replicas

kubectl scale deployment istiod -n istio-system --replicas=3

Configure PDB

kubectl apply -f - <<ENDF

apiVersion: policy/v1

kind: PodDisruptionBudget

metadata:

name: istiod-pdb

namespace: istio-system

spec:

minAvailable: 1

selector:

matchLabels:

app: istiod

ENDF

Conclusion

A service mesh solves three core challenges in microservice environments at the infrastructure level: security, observability, and traffic management. Istio excels with rich features and fine-grained control, while Linkerd offers lightweight operation and quick adoption. The latest Ambient Mesh significantly lowers the barrier to service mesh adoption by eliminating sidecar overhead while still providing core security features.

The most important factor in production is gradual adoption. Start with PERMISSIVE mTLS to gain observability, verify stability, and then transition to STRICT mode. This phased approach is the key to success. The consistent observability and security that a service mesh provides will greatly reduce the complexity of operating microservices.

References

- [Istio Architecture - Official Documentation](https://istio.io/latest/docs/ops/deployment/architecture/)

- [Istio Ambient Mesh Overview](https://istio.io/latest/docs/ambient/overview/)

- [Istio Performance and Scalability](https://istio.io/latest/docs/ops/deployment/performance-and-scalability/)

- [Linkerd vs Istio - Buoyant](https://www.buoyant.io/linkerd-vs-istio)

- [Mutual TLS: Securing Microservices in Service Mesh - The New Stack](https://thenewstack.io/mutual-tls-microservices-encryption-for-service-mesh/)

- [Service Mesh Architecture: Istio and Envoy in Production - Java Code Geeks](https://www.javacodegeeks.com/2025/11/service-mesh-architecture-istio-and-envoy-in-production.html)

- [Performance Comparison of Service Mesh Frameworks - arXiv](https://arxiv.org/html/2411.02267v1)

현재 단락 (1/278)

As microservice architectures have proliferated, the complexity of service-to-service communication ...

작성 글자: 0원문 글자: 15,009작성 단락: 0/278