Split View: Service Mesh 완전 가이드 2025: Istio vs Linkerd, mTLS, 트래픽 관리, Observability

✨ Learn with Quiz

Service Mesh 완전 가이드 2025: Istio vs Linkerd, mTLS, 트래픽 관리, Observability

들어가며: 왜 Service Mesh가 필요한가?

마이크로서비스 아키텍처가 보편화되면서, 수십에서 수백 개의 서비스가 네트워크를 통해 통신하는 환경이 일반적이 되었습니다. 이 복잡한 서비스 간 통신에서 다음과 같은 문제들이 반복적으로 발생합니다.

보안 문제: 서비스 간 통신이 암호화되지 않으면 내부 네트워크에서도 도청이 가능합니다. 각 서비스마다 TLS를 직접 구현하고 인증서를 관리하는 것은 엄청난 운영 부담입니다.

관찰 가능성(Observability) 부재: 요청이 여러 서비스를 거치면서 어디서 지연이 발생하는지, 어떤 서비스가 오류를 반환하는지 파악하기 어렵습니다.

트래픽 제어의 어려움: 카나리 배포, A/B 테스트, 서킷 브레이커 같은 고급 트래픽 관리를 애플리케이션 코드에 직접 구현해야 합니다.

Service Mesh는 이 모든 문제를 인프라 레이어에서 해결합니다. 애플리케이션 코드를 한 줄도 수정하지 않고, 보안/관찰/제어 기능을 네트워크 레벨에서 투명하게 추가할 수 있습니다.

1. Service Mesh 아키텍처

Service Mesh는 크게 두 가지 평면(plane)으로 구성됩니다.

1.1 데이터 플레인 (Data Plane)

데이터 플레인은 실제 서비스 트래픽을 처리하는 프록시들의 집합입니다. 각 서비스 Pod에 사이드카로 배포되어 모든 인바운드/아웃바운드 트래픽을 가로챕니다.

┌─────────────────────────────────────────────┐
│                   Pod                        │
│  ┌─────────────┐    ┌─────────────────────┐ │
│  │  Application │◄──►│   Sidecar Proxy     │ │
│  │  Container   │    │  (Envoy/linkerd2)   │ │
│  └─────────────┘    └─────────────────────┘ │
└─────────────────────────────────────────────┘

사이드카 프록시의 주요 역할:

모든 트래픽을 투명하게 가로채기 (iptables 규칙 활용)
mTLS 암호화/복호화 수행
로드 밸런싱 (라운드 로빈, 최소 연결 등)
메트릭 수집 및 분산 트레이싱 헤더 전파
재시도, 타임아웃, 서킷 브레이킹 적용

1.2 컨트롤 플레인 (Control Plane)

컨트롤 플레인은 데이터 플레인의 프록시들을 중앙에서 관리하고 설정합니다.

Istio의 컨트롤 플레인 (Istiod):

# Istiod가 관리하는 주요 기능
- 서비스 디스커버리: Kubernetes API에서 서비스 목록 동기화
- 설정 배포: VirtualService, DestinationRule 등을 Envoy 설정으로 변환
- 인증서 관리: mTLS용 인증서 발급/갱신 (내장 CA)
- 정책 적용: AuthorizationPolicy, PeerAuthentication 배포

Linkerd의 컨트롤 플레인:

# Linkerd 컨트롤 플레인 컴포넌트
- destination: 서비스 디스커버리 + 정책 배포
- identity: mTLS 인증서 발급 (trust anchor 기반)
- proxy-injector: Pod 생성 시 사이드카 자동 주입
- heartbeat: 텔레메트리 수집

2. Istio 심층 분석

2.1 Istio 아키텍처 개요

Istio는 가장 기능이 풍부한 Service Mesh입니다. Google, IBM, Lyft가 공동 개발했으며, 현재 CNCF 졸업 프로젝트입니다.

# Istio 설치 (istioctl)
curl -L https://istio.io/downloadIstio | sh -
cd istio-1.24.0
export PATH=$PWD/bin:$PATH

# 프로필 기반 설치
istioctl install --set profile=demo -y

# 네임스페이스에 사이드카 자동 주입 활성화
kubectl label namespace default istio-injection=enabled

2.2 Envoy 사이드카 프록시

Istio의 데이터 플레인은 Envoy 프록시를 사용합니다. Envoy는 C++로 작성된 고성능 L4/L7 프록시로, 다음 기능을 제공합니다.

# Envoy의 핵심 기능
- HTTP/1.1, HTTP/2, gRPC 지원
- 자동 재시도 및 서킷 브레이킹
- 동적 설정 업데이트 (xDS API)
- 풍부한 메트릭 및 트레이싱
- 웹어셈블리(Wasm) 확장 지원
- 핫 리스타트 (graceful restart)

메모리 오버헤드는 Pod당 약 40-100MB이며, CPU 오버헤드는 요청당 수 밀리초 수준입니다.

2.3 VirtualService

VirtualService는 Istio에서 트래픽 라우팅 규칙을 정의하는 핵심 리소스입니다.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews-route
spec:
  hosts:
    - reviews
  http:
    # 카나리 배포: 90% v1, 10% v2
    - route:
        - destination:
            host: reviews
            subset: v1
          weight: 90
        - destination:
            host: reviews
            subset: v2
          weight: 10
      timeout: 5s
      retries:
        attempts: 3
        perTryTimeout: 2s
        retryOn: 5xx,reset,connect-failure

2.4 DestinationRule

DestinationRule은 라우팅이 결정된 후 트래픽에 적용할 정책을 정의합니다.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: reviews-destination
spec:
  host: reviews
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: DEFAULT
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    loadBalancer:
      simple: LEAST_REQUEST
  subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2

2.5 Gateway

Istio Gateway는 메시 외부에서 들어오는 트래픽을 관리합니다.

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: bookinfo-gateway
spec:
  selector:
    istio: ingressgateway
  servers:
    - port:
        number: 443
        name: https
        protocol: HTTPS
      tls:
        mode: SIMPLE
        credentialName: bookinfo-cert
      hosts:
        - "bookinfo.example.com"

2.6 PeerAuthentication

PeerAuthentication은 서비스 간 mTLS 정책을 정의합니다.

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  # 메시 전체에 STRICT mTLS 적용
  mtls:
    mode: STRICT
---
# 특정 네임스페이스에만 PERMISSIVE 모드
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: legacy-compat
  namespace: legacy-apps
spec:
  mtls:
    mode: PERMISSIVE

2.7 AuthorizationPolicy

AuthorizationPolicy는 서비스 간 접근 제어를 정의합니다.

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: reviews-viewer
  namespace: default
spec:
  selector:
    matchLabels:
      app: reviews
  action: ALLOW
  rules:
    - from:
        - source:
            principals: ["cluster.local/ns/default/sa/productpage"]
      to:
        - operation:
            methods: ["GET"]
            paths: ["/reviews/*"]

3. Linkerd 심층 분석

3.1 Linkerd 아키텍처 개요

Linkerd는 가볍고 단순함을 추구하는 Service Mesh입니다. Buoyant가 개발했으며, CNCF 졸업 프로젝트입니다.

# Linkerd CLI 설치
curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh
export PATH=$HOME/.linkerd2/bin:$PATH

# 사전 점검
linkerd check --pre

# 설치
linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -

# 검증
linkerd check

# Viz 확장 (대시보드 + 메트릭)
linkerd viz install | kubectl apply -f -

3.2 linkerd2-proxy: Rust로 작성된 마이크로 프록시

Linkerd의 핵심 차별점은 데이터 플레인 프록시입니다. linkerd2-proxy는 Rust로 작성되어 다음과 같은 장점이 있습니다.

성능 비교 (linkerd2-proxy vs Envoy)
========================================
메모리 사용량: ~20MB vs ~50-100MB
P99 레이턴시: ~1ms 추가 vs ~2-5ms 추가
바이너리 크기: ~13MB vs ~50MB
보안: Rust 메모리 안전성 보장
기능 범위: Service Mesh 전용 vs 범용 프록시

linkerd2-proxy는 Service Mesh에 필요한 기능만 구현하여 경량화를 달성했습니다. Envoy처럼 범용 프록시가 아니므로 Wasm 확장 같은 기능은 없지만, 핵심 기능에서는 뛰어난 성능을 보여줍니다.

3.3 ServiceProfile

Linkerd의 ServiceProfile은 서비스별 라우팅 및 관찰 가능성 설정을 정의합니다.

apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: webapp.default.svc.cluster.local
  namespace: default
spec:
  routes:
    - name: GET /api/users
      condition:
        method: GET
        pathRegex: /api/users
      responseClasses:
        - condition:
            status:
              min: 500
              max: 599
          isFailure: true
    - name: POST /api/orders
      condition:
        method: POST
        pathRegex: /api/orders
      isRetryable: true
      timeout: 10s

3.4 TrafficSplit (SMI)

Linkerd는 SMI(Service Mesh Interface) 표준을 사용하여 트래픽 분할을 구현합니다.

apiVersion: split.smi-spec.io/v1alpha4
kind: TrafficSplit
metadata:
  name: webapp-split
  namespace: default
spec:
  service: webapp
  backends:
    - service: webapp-v1
      weight: 900
    - service: webapp-v2
      weight: 100

3.5 Linkerd 멀티클러스터

Linkerd는 멀티클러스터 통신을 네이티브로 지원합니다.

# 멀티클러스터 설치
linkerd multicluster install | kubectl apply -f -

# 원격 클러스터 연결
linkerd multicluster link --cluster-name=west \
  --api-server-address="https://west.example.com:6443" | \
  kubectl apply -f -

# 서비스 미러링 확인
linkerd multicluster gateways

4. Istio vs Linkerd 상세 비교

비교 항목	Istio	Linkerd
데이터 플레인 프록시	Envoy (C++)	linkerd2-proxy (Rust)
메모리 오버헤드 (Pod당)	50-100MB	10-20MB
P99 레이턴시 추가	2-5ms	0.5-1ms
설치 복잡도	높음 (다양한 프로필)	낮음 (단일 명령)
CRD 수	50개 이상	10개 이하
학습 곡선	가파름	완만함
트래픽 관리	매우 풍부 (VirtualService)	기본적 (ServiceProfile)
보안 정책	세밀한 RBAC (AuthorizationPolicy)	기본 mTLS + Server/Authorization
프로토콜 지원	HTTP, gRPC, TCP, WebSocket	HTTP, gRPC, TCP
Wasm 확장	지원	미지원
멀티클러스터	지원 (복잡)	지원 (상대적으로 간단)
Ambient Mesh	지원 (사이드카 없는 모드)	해당 없음
Gateway API	완전 지원	부분 지원
커뮤니티 규모	매우 큼 (CNCF 졸업)	큼 (CNCF 졸업)
운영 복잡도	높음	낮음
적합한 환경	대규모, 복잡한 정책 필요	소중규모, 단순함 선호

선택 기준 요약

Istio를 선택해야 할 때:

세밀한 트래픽 관리가 필요한 경우 (가중치 기반 라우팅, 폴트 인젝션, 트래픽 미러링)
복잡한 보안 정책이 필요한 경우 (JWT 검증, 외부 인가)
Wasm 기반 확장 플러그인이 필요한 경우
Ambient Mesh(사이드카 없는 모드)를 사용하려는 경우

Linkerd를 선택해야 할 때:

리소스 오버헤드를 최소화하고 싶은 경우
빠른 도입과 간단한 운영을 원하는 경우
핵심 기능(mTLS, 메트릭, 재시도)만으로 충분한 경우
운영팀 규모가 작은 경우

5. mTLS (상호 TLS)

5.1 mTLS의 원리

Service Mesh에서 mTLS는 서비스 간 통신을 자동으로 암호화합니다.

서비스 A (클라이언트)          서비스 B (서버)
     │                            │
     │── ClientHello ──────────►  │
     │◄─ ServerHello + 서버 인증서 │
     │── 클라이언트 인증서 ──────► │
     │◄─ 인증서 검증 완료 ────────│
     │                            │
     │◄════ 암호화된 통신 ════════►│

일반 TLS와의 차이점: mTLS에서는 양쪽 모두 인증서를 제시하고 검증합니다. 이를 통해 서버도 클라이언트의 신원을 확인할 수 있습니다.

5.2 SPIFFE 신원 체계

Istio와 Linkerd 모두 SPIFFE(Secure Production Identity Framework For Everyone) 표준을 사용합니다.

SPIFFE ID 형식:
spiffe://cluster.local/ns/NAMESPACE/sa/SERVICE_ACCOUNT

예시:
spiffe://cluster.local/ns/production/sa/frontend
spiffe://cluster.local/ns/production/sa/backend-api

SPIFFE ID는 Kubernetes의 ServiceAccount에 매핑되어, Pod의 신원을 네트워크 레벨에서 증명합니다.

5.3 인증서 자동 로테이션

# Istio: 인증서 수명 설정 (MeshConfig)
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    defaultConfig:
      # 워크로드 인증서 기본 24시간
      # proxyMetadata를 통해 커스터마이즈 가능
    certificates: []
  values:
    pilot:
      env:
        # 최대 인증서 수명
        MAX_WORKLOAD_CERT_TTL: "48h"
        # 기본 인증서 수명
        DEFAULT_WORKLOAD_CERT_TTL: "24h"

Linkerd의 인증서 관리:

# Trust anchor 생성 (10년 수명)
step certificate create root.linkerd.cluster.local ca.crt ca.key \
  --profile root-ca --no-password --insecure --not-after=87600h

# Issuer 인증서 생성 (48시간 수명, 자동 갱신)
step certificate create identity.linkerd.cluster.local issuer.crt issuer.key \
  --profile intermediate-ca --not-after=48h --no-password --insecure \
  --ca ca.crt --ca-key ca.key

# 인증서로 설치
linkerd install \
  --identity-trust-anchors-file ca.crt \
  --identity-issuer-certificate-file issuer.crt \
  --identity-issuer-key-file issuer.key | kubectl apply -f -

6. 트래픽 관리

6.1 카나리 릴리스

# Istio - 점진적 카나리 배포
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
    - reviews
  http:
    - match:
        - headers:
            x-canary-user:
              exact: "true"
      route:
        - destination:
            host: reviews
            subset: v2
    - route:
        - destination:
            host: reviews
            subset: v1
          weight: 95
        - destination:
            host: reviews
            subset: v2
          weight: 5

Flagger를 사용한 자동 카나리:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: reviews
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: reviews
  service:
    port: 9080
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m

6.2 트래픽 미러링 (Shadow Traffic)

프로덕션 트래픽의 복사본을 새 버전에 보내 실제 환경에서의 동작을 검증합니다.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews-mirror
spec:
  hosts:
    - reviews
  http:
    - route:
        - destination:
            host: reviews
            subset: v1
      mirror:
        host: reviews
        subset: v2
      mirrorPercentage:
        value: 100.0

미러링의 핵심 특성:

미러된 트래픽의 응답은 폐기됩니다 (클라이언트에 영향 없음)
Host 헤더에 -shadow 접미사가 추가됩니다
새 버전의 성능과 에러율을 실제 트래픽으로 검증 가능합니다

6.3 폴트 인젝션 (Fault Injection)

의도적으로 장애를 주입하여 시스템의 복원력을 테스트합니다.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ratings-fault
spec:
  hosts:
    - ratings
  http:
    - fault:
        delay:
          percentage:
            value: 10
          fixedDelay: 5s
        abort:
          percentage:
            value: 5
          httpStatus: 503
      route:
        - destination:
            host: ratings

6.4 서킷 브레이킹 (Circuit Breaking)

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: reviews-circuit-breaker
spec:
  host: reviews
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 50
      http:
        http1MaxPendingRequests: 100
        http2MaxRequests: 100
        maxRequestsPerConnection: 10
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 30
      minHealthPercent: 70

6.5 재시도 및 타임아웃

# Istio
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews-retry
spec:
  hosts:
    - reviews
  http:
    - timeout: 10s
      retries:
        attempts: 3
        perTryTimeout: 3s
        retryOn: 5xx,reset,connect-failure,retriable-4xx
      route:
        - destination:
            host: reviews

# Linkerd ServiceProfile
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: reviews.default.svc.cluster.local
spec:
  routes:
    - name: GET /reviews
      condition:
        method: GET
        pathRegex: /reviews/.*
      isRetryable: true
      timeout: 10s

7. Observability (관찰 가능성)

7.1 메트릭 (Prometheus)

Service Mesh는 자동으로 다음 메트릭을 수집합니다.

골든 시그널 (Golden Signals)
================================
1. 레이턴시: 요청 처리 시간
2. 트래픽: 초당 요청 수
3. 에러율: 실패한 요청 비율
4. 포화도: 리소스 사용률

Istio 주요 메트릭:
- istio_requests_total: 총 요청 수 (소스, 대상, 응답 코드별)
- istio_request_duration_milliseconds: 요청 소요 시간
- istio_request_bytes / istio_response_bytes: 요청/응답 크기

Linkerd 주요 메트릭:
- request_total: 총 요청 수
- response_latency_ms: 응답 레이턴시
- tcp_open_total: TCP 연결 수

# Prometheus 스크래핑 설정 (Istio)
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    scrape_configs:
      - job_name: 'envoy-stats'
        metrics_path: /stats/prometheus
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true

7.2 분산 트레이싱 (Jaeger / Zipkin)

서비스 메시는 트레이싱 헤더를 자동으로 전파하여 요청의 전체 경로를 추적합니다.

# Istio 텔레메트리 설정
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: mesh-default
  namespace: istio-system
spec:
  tracing:
    - providers:
        - name: jaeger
      randomSamplingPercentage: 10
      customTags:
        environment:
          literal:
            value: "production"

중요: 애플리케이션은 다음 헤더를 전파해야 합니다 (자동 생성은 되지만 전파는 애플리케이션의 책임).

전파해야 할 트레이싱 헤더:
- x-request-id
- x-b3-traceid
- x-b3-spanid
- x-b3-parentspanid
- x-b3-sampled
- x-b3-flags
- traceparent (W3C Trace Context)
- tracestate

7.3 Kiali 대시보드

Kiali는 Istio 전용 관찰 가능성 대시보드입니다.

# Kiali 설치
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/addons/kiali.yaml

# 대시보드 접속
istioctl dashboard kiali

Kiali의 주요 기능:

서비스 토폴로지 그래프 시각화
실시간 트래픽 흐름 모니터링
Istio 설정 검증 및 오류 탐지
분산 트레이싱 통합
메트릭 기반 건강 상태 표시

7.4 Grafana 대시보드

# Grafana + 사전 구성된 대시보드 설치
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/addons/grafana.yaml

# 대시보드 접속
istioctl dashboard grafana

주요 대시보드:

Mesh Dashboard: 전체 메시 트래픽 개요
Service Dashboard: 개별 서비스 메트릭
Workload Dashboard: 워크로드별 상세 정보
Performance Dashboard: P50/P90/P99 레이턴시

8. Kubernetes Gateway API

8.1 Gateway API란?

Kubernetes Gateway API는 기존 Ingress를 대체하는 차세대 트래픽 관리 표준입니다. 역할 기반 설계로 인프라/클러스터/애플리케이션 관리자의 책임을 명확히 분리합니다.

# GatewayClass: 인프라 관리자가 정의
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: istio
spec:
  controllerName: istio.io/gateway-controller
---
# Gateway: 클러스터 관리자가 정의
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: bookinfo-gateway
spec:
  gatewayClassName: istio
  listeners:
    - name: https
      protocol: HTTPS
      port: 443
      tls:
        mode: Terminate
        certificateRefs:
          - name: bookinfo-tls
      allowedRoutes:
        namespaces:
          from: Selector
          selector:
            matchLabels:
              expose: "true"
---
# HTTPRoute: 애플리케이션 개발자가 정의
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: bookinfo-route
spec:
  parentRefs:
    - name: bookinfo-gateway
  hostnames:
    - "bookinfo.example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /reviews
      backendRefs:
        - name: reviews
          port: 9080
          weight: 90
        - name: reviews-v2
          port: 9080
          weight: 10

8.2 Istio Gateway vs Kubernetes Gateway API

기존 Istio 방식:
  Gateway + VirtualService + DestinationRule

Kubernetes Gateway API 방식:
  GatewayClass + Gateway + HTTPRoute

이점:
  - 표준화된 API (여러 구현 간 이식성)
  - 역할 기반 접근 제어
  - 더 나은 네임스페이스 격리
  - Istio, Linkerd, Cilium 등에서 동일한 API 사용 가능

9. Ambient Mesh

9.1 사이드카의 한계

기존 사이드카 방식의 문제점:

Pod당 50-100MB 추가 메모리
모든 요청에 프록시 홉 추가 (레이턴시)
사이드카 주입으로 인한 Pod 재시작 필요
리소스 오버프로비저닝

9.2 Ambient Mesh 아키텍처

Istio의 Ambient Mesh는 사이드카 없이 서비스 메시를 구현하는 새로운 모드입니다.

기존 사이드카 모드:
┌────────────┐    ┌────────────┐
│ App + Envoy│───►│ App + Envoy│
└────────────┘    └────────────┘

Ambient Mesh 모드:
┌────────────┐    ┌────────────┐
│    App     │    │    App     │
└─────┬──────┘    └──────┬─────┘
      │                  │
┌─────┴──────────────────┴─────┐  ← ztunnel (노드당 1개, L4)
└──────────────┬───────────────┘
               │
        ┌──────┴──────┐           ← waypoint proxy (선택, L7)
        │   Waypoint  │
        └─────────────┘

ztunnel (Zero Trust Tunnel):

노드당 하나의 데몬셋으로 실행
L4 기능만 담당: mTLS, 기본 인증
Rust로 작성, 매우 가벼움
Pod 재시작 불필요

Waypoint Proxy:

L7 기능이 필요한 경우에만 배포
네임스페이스 또는 서비스별로 배포 가능
Envoy 기반, 전체 L7 기능 제공

# Ambient 모드로 Istio 설치
istioctl install --set profile=ambient -y

# 네임스페이스를 Ambient 메시에 추가
kubectl label namespace default istio.io/dataplane-mode=ambient

# Waypoint Proxy 배포 (L7 기능 필요 시)
istioctl waypoint apply --namespace default --name reviews-waypoint

9.3 Ambient Mesh의 이점

리소스 절감 비교 (100 Pod 클러스터 기준):
========================================
            사이드카 모드    Ambient 모드
메모리:     5-10GB 추가     200-500MB 추가
CPU:        상당한 오버헤드   최소 오버헤드
운영:       사이드카 관리     ztunnel 데몬셋만 관리
업그레이드:  Pod 재시작 필요   ztunnel 롤링 업데이트

10. 보안 심층 분석

10.1 RBAC (역할 기반 접근 제어)

# 네임스페이스 레벨 거부 정책
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: deny-all
  namespace: production
spec:
  # 규칙이 비어있으면 모든 요청 거부
  {}
---
# 특정 서비스만 허용
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: allow-frontend-to-api
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-server
  action: ALLOW
  rules:
    - from:
        - source:
            namespaces: ["production"]
            principals: ["cluster.local/ns/production/sa/frontend"]
      to:
        - operation:
            methods: ["GET", "POST"]
            paths: ["/api/v1/*"]
      when:
        - key: request.headers[x-api-version]
          values: ["v1", "v2"]

10.2 JWT 검증

# RequestAuthentication: JWT 검증 정의
apiVersion: security.istio.io/v1beta1
kind: RequestAuthentication
metadata:
  name: jwt-auth
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-server
  jwtRules:
    - issuer: "https://auth.example.com"
      jwksUri: "https://auth.example.com/.well-known/jwks.json"
      forwardOriginalToken: true
      outputPayloadToHeader: "x-jwt-payload"
---
# JWT 클레임 기반 인가
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: require-jwt
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-server
  action: ALLOW
  rules:
    - from:
        - source:
            requestPrincipals: ["https://auth.example.com/*"]
      when:
        - key: request.auth.claims[role]
          values: ["admin", "editor"]

10.3 외부 인가 (External Authorization)

# 외부 인가 서비스 연동
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: ext-authz
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-server
  action: CUSTOM
  provider:
    name: "opa-ext-authz"
  rules:
    - to:
        - operation:
            paths: ["/admin/*"]

# MeshConfig에 외부 인가 프로바이더 등록
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    extensionProviders:
      - name: "opa-ext-authz"
        envoyExtAuthzGrpc:
          service: "opa.opa-system.svc.cluster.local"
          port: 9191
          includeRequestBodyInCheck:
            maxRequestBytes: 1024

11. 프로덕션 운영 베스트 프랙티스

11.1 리소스 제한 설정

# Istio sidecar 리소스 제한
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    defaultConfig:
      concurrency: 2
  values:
    global:
      proxy:
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 256Mi

11.2 점진적 롤아웃 전략

# 1단계: PERMISSIVE mTLS (기존 트래픽 허용)
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: default
spec:
  mtls:
    mode: PERMISSIVE
EOF

# 2단계: 메트릭 모니터링 (mTLS 트래픽 비율 확인)
# istio_requests_total 메트릭에서 connection_security_policy 확인

# 3단계: STRICT mTLS 전환
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: default
spec:
  mtls:
    mode: STRICT
EOF

11.3 디버깅 도구

# Istio 프록시 상태 확인
istioctl proxy-status

# Envoy 설정 덤프
istioctl proxy-config all POD_NAME -o json

# 라우팅 규칙 확인
istioctl proxy-config route POD_NAME

# 클러스터 설정 확인
istioctl proxy-config cluster POD_NAME

# 분석 도구 (설정 오류 탐지)
istioctl analyze --all-namespaces

# Linkerd 진단
linkerd check
linkerd diagnostics proxy-metrics POD_NAME
linkerd viz stat deploy
linkerd viz top deploy/webapp
linkerd viz tap deploy/webapp

11.4 Horizontal Pod Autoscaler 연동

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: reviews-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: reviews
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: istio_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"
    - type: Pods
      pods:
        metric:
          name: istio_request_duration_milliseconds_p99
        target:
          type: AverageValue
          averageValue: "500"

11.5 업그레이드 전략

# Istio 카나리 업그레이드
# 1. 새 버전 컨트롤 플레인 설치 (리비전 기반)
istioctl install --set revision=1-24-0

# 2. 네임스페이스 라벨 변경으로 점진적 전환
kubectl label namespace default istio.io/rev=1-24-0 --overwrite

# 3. Pod 재시작으로 새 프록시 적용
kubectl rollout restart deployment -n default

# 4. 이전 버전 제거
istioctl uninstall --revision 1-23-0

12. Service Mesh를 사용하지 말아야 할 때

Service Mesh는 강력하지만 모든 환경에 적합한 것은 아닙니다.

사용하지 말아야 할 상황:

서비스 수가 적은 경우: 5개 이하의 서비스라면 Service Mesh의 복잡성이 이점보다 클 수 있습니다.
팀이 Kubernetes에 익숙하지 않은 경우: Service Mesh는 Kubernetes 위에 추가되는 복잡성입니다.
리소스가 극도로 제한된 경우: 사이드카 프록시의 메모리/CPU 오버헤드를 감당하기 어려울 때.
성능이 극도로 중요한 경우: 마이크로초 단위의 레이턴시가 중요한 HFT(고빈도 거래) 같은 환경.

대안 고려:

단순한 mTLS만 필요: cert-manager + 서비스 자체 TLS
기본 관찰 가능성: OpenTelemetry 직접 계측
간단한 로드 밸런싱: Kubernetes Service (ClusterIP)
인그레스만 필요: NGINX Ingress Controller 또는 Traefik
네트워크 정책: Kubernetes NetworkPolicy 또는 Cilium

퀴즈

Q1: Service Mesh에서 데이터 플레인과 컨트롤 플레인의 역할을 설명하세요.

데이터 플레인: 사이드카 프록시들의 집합으로, 실제 서비스 트래픽을 가로채서 처리합니다. mTLS 암호화, 로드 밸런싱, 메트릭 수집, 재시도/타임아웃 등을 수행합니다. Istio는 Envoy, Linkerd는 linkerd2-proxy를 사용합니다.

컨트롤 플레인: 데이터 플레인의 프록시들을 중앙에서 관리하고 설정합니다. 서비스 디스커버리, 인증서 발급, 정책 배포 등을 담당합니다. Istio는 Istiod, Linkerd는 destination/identity/proxy-injector 컴포넌트로 구성됩니다.

Q2: mTLS에서 일반 TLS와의 핵심 차이점은 무엇인가요?

일반 TLS에서는 클라이언트만 서버의 인증서를 검증합니다. mTLS(상호 TLS)에서는 양쪽 모두 인증서를 제시하고 검증합니다.

클라이언트가 서버의 인증서를 검증 (일반 TLS와 동일)
서버도 클라이언트의 인증서를 검증 (mTLS의 추가 단계)
이를 통해 서비스 간 양방향 신원 확인이 가능합니다
SPIFFE 표준을 사용하여 서비스의 신원을 Kubernetes ServiceAccount에 매핑합니다

Q3: Istio의 Ambient Mesh가 해결하는 문제와 아키텍처를 설명하세요.

해결하는 문제: 기존 사이드카 방식은 Pod당 50-100MB 메모리 오버헤드, 사이드카 주입을 위한 Pod 재시작 필요, 모든 요청에 프록시 홉 추가 등의 문제가 있습니다.

아키텍처:

ztunnel: 노드당 하나의 데몬셋으로 실행되는 L4 프록시. Rust로 작성되어 매우 가볍고, mTLS와 기본 인증만 담당합니다.
Waypoint Proxy: L7 기능이 필요한 경우에만 선택적으로 배포. Envoy 기반으로 VirtualService, 트래픽 관리 등 전체 L7 기능을 제공합니다.

100개 Pod 기준으로 메모리 사용량이 5-10GB(사이드카)에서 200-500MB(Ambient)로 대폭 절감됩니다.

Q4: Istio와 Linkerd 중 어떤 상황에서 각각을 선택해야 하나요?

Istio 선택 기준:

세밀한 트래픽 관리가 필요 (가중치 라우팅, 폴트 인젝션, 미러링)
복잡한 보안 정책 (JWT 검증, 외부 인가, RBAC)
Wasm 확장 플러그인 필요
Ambient Mesh(사이드카 없는 모드) 사용

Linkerd 선택 기준:

리소스 오버헤드 최소화 (Pod당 10-20MB)
빠른 도입과 간단한 운영
핵심 기능(mTLS, 메트릭, 재시도)만으로 충분
소규모 팀 운영

Q5: Service Mesh를 도입하지 말아야 할 상황은 언제인가요?

서비스 수가 5개 이하: 복잡성이 이점보다 큽니다
팀이 Kubernetes에 미숙: Service Mesh는 추가 복잡성 레이어입니다
극도의 리소스 제한: 사이드카 메모리/CPU 오버헤드 감당 불가
극도의 저지연 요구: 마이크로초 단위 레이턴시가 중요한 환경 (HFT 등)

대안: cert-manager(mTLS), OpenTelemetry(관찰 가능성), Kubernetes NetworkPolicy(네트워크 보안), NGINX Ingress(인그레스)

참고 자료

Service Mesh Complete Guide 2025: Istio vs Linkerd, mTLS, Traffic Management, Observability

Introduction: Why Service Mesh?

As microservices architecture has become the standard, environments with dozens to hundreds of services communicating over the network are now commonplace. Several recurring problems emerge in this complex inter-service communication landscape.

Security concerns: Without encryption, service-to-service communication is vulnerable to eavesdropping even within internal networks. Implementing TLS individually in every service and managing certificates is a massive operational burden.

Observability gaps: When requests traverse multiple services, identifying where latency occurs or which service returns errors becomes incredibly difficult.

Traffic control challenges: Advanced traffic management like canary deployments, A/B testing, and circuit breakers must be implemented directly in application code.

Service Mesh solves all of these problems at the infrastructure layer. Without modifying a single line of application code, you can transparently add security, observability, and traffic control at the network level.

1. Service Mesh Architecture

A Service Mesh consists of two main planes.

1.1 Data Plane

The data plane is the collection of proxies that handle actual service traffic. Deployed as sidecars in each service Pod, they intercept all inbound/outbound traffic.

┌─────────────────────────────────────────────┐
│                   Pod                        │
│  ┌─────────────┐    ┌─────────────────────┐ │
│  │  Application │◄──►│   Sidecar Proxy     │ │
│  │  Container   │    │  (Envoy/linkerd2)   │ │
│  └─────────────┘    └─────────────────────┘ │
└─────────────────────────────────────────────┘

Key responsibilities of sidecar proxies:

Transparently intercept all traffic (using iptables rules)
Perform mTLS encryption/decryption
Load balancing (round robin, least connections, etc.)
Collect metrics and propagate distributed tracing headers
Apply retries, timeouts, and circuit breaking

1.2 Control Plane

The control plane centrally manages and configures the data plane proxies.

Istio Control Plane (Istiod):

# Key functions managed by Istiod
- Service discovery: Syncs service list from Kubernetes API
- Configuration distribution: Converts VirtualService, DestinationRule to Envoy config
- Certificate management: Issues/renews mTLS certificates (built-in CA)
- Policy enforcement: Distributes AuthorizationPolicy, PeerAuthentication

Linkerd Control Plane:

# Linkerd control plane components
- destination: Service discovery + policy distribution
- identity: mTLS certificate issuance (trust anchor based)
- proxy-injector: Automatic sidecar injection on Pod creation
- heartbeat: Telemetry collection

2. Istio Deep Dive

2.1 Istio Architecture Overview

Istio is the most feature-rich Service Mesh. Co-developed by Google, IBM, and Lyft, it is now a CNCF graduated project.

# Install Istio (istioctl)
curl -L https://istio.io/downloadIstio | sh -
cd istio-1.24.0
export PATH=$PWD/bin:$PATH

# Profile-based installation
istioctl install --set profile=demo -y

# Enable automatic sidecar injection for namespace
kubectl label namespace default istio-injection=enabled

2.2 Envoy Sidecar Proxy

Istio's data plane uses Envoy proxy. Envoy is a high-performance L4/L7 proxy written in C++.

# Envoy core features
- HTTP/1.1, HTTP/2, gRPC support
- Automatic retries and circuit breaking
- Dynamic configuration updates (xDS API)
- Rich metrics and tracing
- WebAssembly (Wasm) extension support
- Hot restart (graceful restart)

Memory overhead is approximately 40-100MB per Pod, with CPU overhead in the low milliseconds per request range.

2.3 VirtualService

VirtualService is the core resource for defining traffic routing rules in Istio.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews-route
spec:
  hosts:
    - reviews
  http:
    # Canary deployment: 90% v1, 10% v2
    - route:
        - destination:
            host: reviews
            subset: v1
          weight: 90
        - destination:
            host: reviews
            subset: v2
          weight: 10
      timeout: 5s
      retries:
        attempts: 3
        perTryTimeout: 2s
        retryOn: 5xx,reset,connect-failure

2.4 DestinationRule

DestinationRule defines policies applied to traffic after routing decisions are made.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: reviews-destination
spec:
  host: reviews
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: DEFAULT
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    loadBalancer:
      simple: LEAST_REQUEST
  subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2

2.5 Gateway

Istio Gateway manages traffic entering the mesh from external sources.

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: bookinfo-gateway
spec:
  selector:
    istio: ingressgateway
  servers:
    - port:
        number: 443
        name: https
        protocol: HTTPS
      tls:
        mode: SIMPLE
        credentialName: bookinfo-cert
      hosts:
        - "bookinfo.example.com"

2.6 PeerAuthentication

PeerAuthentication defines mTLS policies between services.

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  # Apply STRICT mTLS across the mesh
  mtls:
    mode: STRICT
---
# PERMISSIVE mode for specific namespace
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: legacy-compat
  namespace: legacy-apps
spec:
  mtls:
    mode: PERMISSIVE

2.7 AuthorizationPolicy

AuthorizationPolicy defines access control between services.

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: reviews-viewer
  namespace: default
spec:
  selector:
    matchLabels:
      app: reviews
  action: ALLOW
  rules:
    - from:
        - source:
            principals: ["cluster.local/ns/default/sa/productpage"]
      to:
        - operation:
            methods: ["GET"]
            paths: ["/reviews/*"]

3. Linkerd Deep Dive

3.1 Linkerd Architecture Overview

Linkerd is a Service Mesh focused on simplicity and lightness. Developed by Buoyant, it is a CNCF graduated project.

# Install Linkerd CLI
curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh
export PATH=$HOME/.linkerd2/bin:$PATH

# Pre-flight checks
linkerd check --pre

# Install
linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -

# Verify
linkerd check

# Viz extension (dashboard + metrics)
linkerd viz install | kubectl apply -f -

3.2 linkerd2-proxy: Micro-Proxy Written in Rust

Linkerd's key differentiator is its data plane proxy. linkerd2-proxy is written in Rust, offering these advantages:

Performance Comparison (linkerd2-proxy vs Envoy)
========================================
Memory usage:    ~20MB vs ~50-100MB
P99 latency:     ~1ms added vs ~2-5ms added
Binary size:     ~13MB vs ~50MB
Security:        Rust memory safety guaranteed
Feature scope:   Service Mesh dedicated vs general-purpose proxy

linkerd2-proxy achieves its lightweight footprint by implementing only the features needed for Service Mesh. Unlike Envoy, it is not a general-purpose proxy, so features like Wasm extensions are absent, but it delivers excellent performance for core functionality.

3.3 ServiceProfile

Linkerd's ServiceProfile defines per-service routing and observability settings.

apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: webapp.default.svc.cluster.local
  namespace: default
spec:
  routes:
    - name: GET /api/users
      condition:
        method: GET
        pathRegex: /api/users
      responseClasses:
        - condition:
            status:
              min: 500
              max: 599
          isFailure: true
    - name: POST /api/orders
      condition:
        method: POST
        pathRegex: /api/orders
      isRetryable: true
      timeout: 10s

3.4 TrafficSplit (SMI)

Linkerd uses the SMI (Service Mesh Interface) standard for traffic splitting.

apiVersion: split.smi-spec.io/v1alpha4
kind: TrafficSplit
metadata:
  name: webapp-split
  namespace: default
spec:
  service: webapp
  backends:
    - service: webapp-v1
      weight: 900
    - service: webapp-v2
      weight: 100

3.5 Linkerd Multi-cluster

Linkerd natively supports multi-cluster communication.

# Install multi-cluster extension
linkerd multicluster install | kubectl apply -f -

# Link remote cluster
linkerd multicluster link --cluster-name=west \
  --api-server-address="https://west.example.com:6443" | \
  kubectl apply -f -

# Verify service mirroring
linkerd multicluster gateways

4. Istio vs Linkerd Detailed Comparison

Dimension	Istio	Linkerd
Data Plane Proxy	Envoy (C++)	linkerd2-proxy (Rust)
Memory Overhead (per Pod)	50-100MB	10-20MB
P99 Latency Added	2-5ms	0.5-1ms
Installation Complexity	High (various profiles)	Low (single command)
CRD Count	50+	Under 10
Learning Curve	Steep	Gradual
Traffic Management	Very rich (VirtualService)	Basic (ServiceProfile)
Security Policies	Fine-grained RBAC (AuthorizationPolicy)	Basic mTLS + Server/Authorization
Protocol Support	HTTP, gRPC, TCP, WebSocket	HTTP, gRPC, TCP
Wasm Extensions	Supported	Not supported
Multi-cluster	Supported (complex)	Supported (relatively simple)
Ambient Mesh	Supported (sidecar-less mode)	N/A
Gateway API	Full support	Partial support
Community Size	Very large (CNCF graduated)	Large (CNCF graduated)
Operational Complexity	High	Low
Best For	Large scale, complex policies	Small-medium, simplicity preferred

Selection Criteria Summary

Choose Istio when:

You need fine-grained traffic management (weighted routing, fault injection, traffic mirroring)
Complex security policies are required (JWT validation, external authorization)
Wasm-based extension plugins are needed
You want to use Ambient Mesh (sidecar-less mode)

Choose Linkerd when:

Minimizing resource overhead is a priority
You want fast adoption and simple operations
Core features (mTLS, metrics, retries) are sufficient
Your operations team is small

5. mTLS (Mutual TLS)

5.1 How mTLS Works

In a Service Mesh, mTLS automatically encrypts service-to-service communication.

Service A (client)              Service B (server)
     |                            |
     |-- ClientHello -----------> |
     |<- ServerHello + ServerCert |
     |-- Client Certificate ----> |
     |<- Certificate Verified --- |
     |                            |
     |<=== Encrypted Traffic ===> |

The key difference from regular TLS: in mTLS, both sides present and verify certificates, enabling the server to verify the client's identity as well.

5.2 SPIFFE Identity Framework

Both Istio and Linkerd use the SPIFFE (Secure Production Identity Framework For Everyone) standard.

SPIFFE ID format:
spiffe://cluster.local/ns/NAMESPACE/sa/SERVICE_ACCOUNT

Examples:
spiffe://cluster.local/ns/production/sa/frontend
spiffe://cluster.local/ns/production/sa/backend-api

SPIFFE IDs map to Kubernetes ServiceAccounts, proving Pod identity at the network level.

5.3 Automatic Certificate Rotation

# Istio: Certificate lifetime configuration (MeshConfig)
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    defaultConfig:
      # Default workload certificate is 24 hours
      # Customizable through proxyMetadata
    certificates: []
  values:
    pilot:
      env:
        # Maximum certificate lifetime
        MAX_WORKLOAD_CERT_TTL: "48h"
        # Default certificate lifetime
        DEFAULT_WORKLOAD_CERT_TTL: "24h"

Linkerd certificate management:

# Create trust anchor (10-year lifetime)
step certificate create root.linkerd.cluster.local ca.crt ca.key \
  --profile root-ca --no-password --insecure --not-after=87600h

# Create issuer certificate (48-hour lifetime, auto-renewed)
step certificate create identity.linkerd.cluster.local issuer.crt issuer.key \
  --profile intermediate-ca --not-after=48h --no-password --insecure \
  --ca ca.crt --ca-key ca.key

# Install with certificates
linkerd install \
  --identity-trust-anchors-file ca.crt \
  --identity-issuer-certificate-file issuer.crt \
  --identity-issuer-key-file issuer.key | kubectl apply -f -

6. Traffic Management

6.1 Canary Releases

# Istio - Progressive canary deployment
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
    - reviews
  http:
    - match:
        - headers:
            x-canary-user:
              exact: "true"
      route:
        - destination:
            host: reviews
            subset: v2
    - route:
        - destination:
            host: reviews
            subset: v1
          weight: 95
        - destination:
            host: reviews
            subset: v2
          weight: 5

Automated canary with Flagger:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: reviews
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: reviews
  service:
    port: 9080
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m

6.2 Traffic Mirroring (Shadow Traffic)

Send a copy of production traffic to a new version to validate behavior in a real environment.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews-mirror
spec:
  hosts:
    - reviews
  http:
    - route:
        - destination:
            host: reviews
            subset: v1
      mirror:
        host: reviews
        subset: v2
      mirrorPercentage:
        value: 100.0

Key characteristics of mirroring:

Responses from mirrored traffic are discarded (no client impact)
The -shadow suffix is added to the Host header
Validate performance and error rates of the new version with real traffic

6.3 Fault Injection

Intentionally inject failures to test system resilience.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ratings-fault
spec:
  hosts:
    - ratings
  http:
    - fault:
        delay:
          percentage:
            value: 10
          fixedDelay: 5s
        abort:
          percentage:
            value: 5
          httpStatus: 503
      route:
        - destination:
            host: ratings

6.4 Circuit Breaking

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: reviews-circuit-breaker
spec:
  host: reviews
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 50
      http:
        http1MaxPendingRequests: 100
        http2MaxRequests: 100
        maxRequestsPerConnection: 10
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 30
      minHealthPercent: 70

6.5 Retries and Timeouts

# Istio
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews-retry
spec:
  hosts:
    - reviews
  http:
    - timeout: 10s
      retries:
        attempts: 3
        perTryTimeout: 3s
        retryOn: 5xx,reset,connect-failure,retriable-4xx
      route:
        - destination:
            host: reviews

# Linkerd ServiceProfile
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: reviews.default.svc.cluster.local
spec:
  routes:
    - name: GET /reviews
      condition:
        method: GET
        pathRegex: /reviews/.*
      isRetryable: true
      timeout: 10s

7. Observability

7.1 Metrics (Prometheus)

Service Mesh automatically collects the following metrics:

Golden Signals
================================
1. Latency: Request processing time
2. Traffic: Requests per second
3. Error rate: Percentage of failed requests
4. Saturation: Resource utilization

Istio key metrics:
- istio_requests_total: Total request count (by source, destination, response code)
- istio_request_duration_milliseconds: Request duration
- istio_request_bytes / istio_response_bytes: Request/response sizes

Linkerd key metrics:
- request_total: Total request count
- response_latency_ms: Response latency
- tcp_open_total: TCP connection count

# Prometheus scraping configuration (Istio)
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    scrape_configs:
      - job_name: 'envoy-stats'
        metrics_path: /stats/prometheus
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true

7.2 Distributed Tracing (Jaeger / Zipkin)

Service Mesh automatically propagates tracing headers to track the complete request path.

# Istio telemetry configuration
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: mesh-default
  namespace: istio-system
spec:
  tracing:
    - providers:
        - name: jaeger
      randomSamplingPercentage: 10
      customTags:
        environment:
          literal:
            value: "production"

Important: Applications must propagate the following headers (auto-generation happens, but propagation is the application's responsibility):

Tracing headers to propagate:
- x-request-id
- x-b3-traceid
- x-b3-spanid
- x-b3-parentspanid
- x-b3-sampled
- x-b3-flags
- traceparent (W3C Trace Context)
- tracestate

7.3 Kiali Dashboard

Kiali is a dedicated observability dashboard for Istio.

# Install Kiali
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/addons/kiali.yaml

# Access dashboard
istioctl dashboard kiali

Key Kiali features:

Service topology graph visualization
Real-time traffic flow monitoring
Istio configuration validation and error detection
Distributed tracing integration
Metric-based health status display

7.4 Grafana Dashboards

# Install Grafana + pre-configured dashboards
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/addons/grafana.yaml

# Access dashboard
istioctl dashboard grafana

Key dashboards:

Mesh Dashboard: Overall mesh traffic overview
Service Dashboard: Individual service metrics
Workload Dashboard: Workload-level details
Performance Dashboard: P50/P90/P99 latency

8. Kubernetes Gateway API

8.1 What is Gateway API?

Kubernetes Gateway API is the next-generation traffic management standard replacing Ingress. Its role-based design clearly separates responsibilities between infrastructure, cluster, and application administrators.

# GatewayClass: Defined by infrastructure admin
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: istio
spec:
  controllerName: istio.io/gateway-controller
---
# Gateway: Defined by cluster admin
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: bookinfo-gateway
spec:
  gatewayClassName: istio
  listeners:
    - name: https
      protocol: HTTPS
      port: 443
      tls:
        mode: Terminate
        certificateRefs:
          - name: bookinfo-tls
      allowedRoutes:
        namespaces:
          from: Selector
          selector:
            matchLabels:
              expose: "true"
---
# HTTPRoute: Defined by application developer
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: bookinfo-route
spec:
  parentRefs:
    - name: bookinfo-gateway
  hostnames:
    - "bookinfo.example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /reviews
      backendRefs:
        - name: reviews
          port: 9080
          weight: 90
        - name: reviews-v2
          port: 9080
          weight: 10

8.2 Istio Gateway vs Kubernetes Gateway API

Legacy Istio approach:
  Gateway + VirtualService + DestinationRule

Kubernetes Gateway API approach:
  GatewayClass + Gateway + HTTPRoute

Benefits:
  - Standardized API (portability across implementations)
  - Role-based access control
  - Better namespace isolation
  - Same API usable across Istio, Linkerd, Cilium, etc.

9. Ambient Mesh

9.1 Limitations of Sidecars

Problems with the traditional sidecar approach:

50-100MB additional memory per Pod
Extra proxy hop on every request (latency)
Pod restart required for sidecar injection
Resource over-provisioning

9.2 Ambient Mesh Architecture

Istio's Ambient Mesh implements Service Mesh without sidecars.

Traditional Sidecar Mode:
+--------------+    +--------------+
| App + Envoy  |--->| App + Envoy  |
+--------------+    +--------------+

Ambient Mesh Mode:
+--------------+    +--------------+
|     App      |    |     App      |
+------+-------+    +-------+------+
       |                    |
+------+--------------------+------+  <-- ztunnel (1 per node, L4)
+------------------+---------------+
                   |
            +------+------+          <-- waypoint proxy (optional, L7)
            |   Waypoint  |
            +-------------+

ztunnel (Zero Trust Tunnel):

Runs as a single DaemonSet per node
Handles L4 functions only: mTLS, basic authentication
Written in Rust, extremely lightweight
No Pod restart required

Waypoint Proxy:

Deployed only when L7 features are needed
Can be deployed per-namespace or per-service
Envoy-based, providing full L7 capabilities

# Install Istio in Ambient mode
istioctl install --set profile=ambient -y

# Add namespace to Ambient mesh
kubectl label namespace default istio.io/dataplane-mode=ambient

# Deploy Waypoint Proxy (when L7 features needed)
istioctl waypoint apply --namespace default --name reviews-waypoint

9.3 Benefits of Ambient Mesh

Resource Savings Comparison (100-Pod cluster):
========================================
                Sidecar Mode     Ambient Mode
Memory:         5-10GB added     200-500MB added
CPU:            Significant      Minimal overhead
Operations:     Manage sidecars  Only ztunnel DaemonSet
Upgrades:       Pod restart      ztunnel rolling update

10. Security Deep Dive

10.1 RBAC (Role-Based Access Control)

# Namespace-level deny policy
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: deny-all
  namespace: production
spec:
  # Empty rules deny all requests
  {}
---
# Allow specific services only
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: allow-frontend-to-api
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-server
  action: ALLOW
  rules:
    - from:
        - source:
            namespaces: ["production"]
            principals: ["cluster.local/ns/production/sa/frontend"]
      to:
        - operation:
            methods: ["GET", "POST"]
            paths: ["/api/v1/*"]
      when:
        - key: request.headers[x-api-version]
          values: ["v1", "v2"]

10.2 JWT Validation

# RequestAuthentication: Define JWT validation
apiVersion: security.istio.io/v1beta1
kind: RequestAuthentication
metadata:
  name: jwt-auth
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-server
  jwtRules:
    - issuer: "https://auth.example.com"
      jwksUri: "https://auth.example.com/.well-known/jwks.json"
      forwardOriginalToken: true
      outputPayloadToHeader: "x-jwt-payload"
---
# JWT claims-based authorization
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: require-jwt
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-server
  action: ALLOW
  rules:
    - from:
        - source:
            requestPrincipals: ["https://auth.example.com/*"]
      when:
        - key: request.auth.claims[role]
          values: ["admin", "editor"]

10.3 External Authorization

# External authorization service integration
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: ext-authz
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-server
  action: CUSTOM
  provider:
    name: "opa-ext-authz"
  rules:
    - to:
        - operation:
            paths: ["/admin/*"]

# Register external authz provider in MeshConfig
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    extensionProviders:
      - name: "opa-ext-authz"
        envoyExtAuthzGrpc:
          service: "opa.opa-system.svc.cluster.local"
          port: 9191
          includeRequestBodyInCheck:
            maxRequestBytes: 1024

11. Production Best Practices

11.1 Resource Limits

# Istio sidecar resource limits
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    defaultConfig:
      concurrency: 2
  values:
    global:
      proxy:
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 256Mi

11.2 Progressive Rollout Strategy

# Step 1: PERMISSIVE mTLS (allow legacy traffic)
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: default
spec:
  mtls:
    mode: PERMISSIVE
EOF

# Step 2: Monitor metrics (check mTLS traffic ratio)
# Check connection_security_policy in istio_requests_total metric

# Step 3: Switch to STRICT mTLS
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: default
spec:
  mtls:
    mode: STRICT
EOF

11.3 Debugging Tools

# Check Istio proxy status
istioctl proxy-status

# Dump Envoy configuration
istioctl proxy-config all POD_NAME -o json

# Check routing rules
istioctl proxy-config route POD_NAME

# Check cluster configuration
istioctl proxy-config cluster POD_NAME

# Analysis tool (detect config errors)
istioctl analyze --all-namespaces

# Linkerd diagnostics
linkerd check
linkerd diagnostics proxy-metrics POD_NAME
linkerd viz stat deploy
linkerd viz top deploy/webapp
linkerd viz tap deploy/webapp

11.4 Horizontal Pod Autoscaler Integration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: reviews-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: reviews
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: istio_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"
    - type: Pods
      pods:
        metric:
          name: istio_request_duration_milliseconds_p99
        target:
          type: AverageValue
          averageValue: "500"

11.5 Upgrade Strategy

# Istio canary upgrade
# 1. Install new version control plane (revision-based)
istioctl install --set revision=1-24-0

# 2. Gradually switch by changing namespace labels
kubectl label namespace default istio.io/rev=1-24-0 --overwrite

# 3. Restart Pods to apply new proxy
kubectl rollout restart deployment -n default

# 4. Remove previous version
istioctl uninstall --revision 1-23-0

12. When NOT to Use Service Mesh

Service Mesh is powerful but not suitable for every environment.

Situations to avoid Service Mesh:

Few services: With 5 or fewer services, the complexity likely outweighs the benefits.
Team unfamiliar with Kubernetes: Service Mesh adds complexity on top of Kubernetes.
Extremely limited resources: When the memory/CPU overhead of sidecar proxies cannot be absorbed.
Extreme low-latency requirements: Environments where microsecond-level latency matters, such as high-frequency trading (HFT).

Alternatives to consider:

Simple mTLS only: cert-manager + service-level TLS
Basic observability: Direct OpenTelemetry instrumentation
Simple load balancing: Kubernetes Service (ClusterIP)
Ingress only: NGINX Ingress Controller or Traefik
Network policies: Kubernetes NetworkPolicy or Cilium

Quiz

Q1: Explain the roles of the data plane and control plane in a Service Mesh.

Data Plane: The collection of sidecar proxies that intercept and process actual service traffic. They perform mTLS encryption, load balancing, metrics collection, retries/timeouts, and more. Istio uses Envoy, and Linkerd uses linkerd2-proxy.

Control Plane: Centrally manages and configures the data plane proxies. Responsible for service discovery, certificate issuance, and policy distribution. Istio uses Istiod, while Linkerd consists of destination/identity/proxy-injector components.

Q2: What is the key difference between mTLS and regular TLS?

In regular TLS, only the client verifies the server's certificate. In mTLS (mutual TLS), both sides present and verify certificates.

Client verifies the server's certificate (same as regular TLS)
Server also verifies the client's certificate (the additional mTLS step)
This enables bidirectional identity verification between services
The SPIFFE standard maps service identity to Kubernetes ServiceAccounts

Q3: Explain the problems Istio's Ambient Mesh solves and its architecture.

Problems solved: The traditional sidecar approach has 50-100MB memory overhead per Pod, requires Pod restart for sidecar injection, and adds a proxy hop on every request.

Architecture:

ztunnel: An L4 proxy running as a single DaemonSet per node. Written in Rust, extremely lightweight, handling only mTLS and basic authentication.
Waypoint Proxy: Optionally deployed only when L7 features are needed. Envoy-based, providing full L7 features like VirtualService and traffic management.

For a 100-Pod cluster, memory usage drops from 5-10GB (sidecar) to 200-500MB (Ambient).

Q4: When should you choose Istio vs Linkerd?

Choose Istio when:

Fine-grained traffic management is needed (weighted routing, fault injection, mirroring)
Complex security policies are required (JWT validation, external authorization, RBAC)
Wasm extension plugins are needed
Using Ambient Mesh (sidecar-less mode)

Choose Linkerd when:

Minimizing resource overhead (10-20MB per Pod)
Fast adoption and simple operations desired
Core features (mTLS, metrics, retries) are sufficient
Small operations team

Q5: When should you NOT adopt a Service Mesh?

5 or fewer services: Complexity outweighs benefits
Team inexperienced with Kubernetes: Service Mesh adds another complexity layer
Extremely limited resources: Cannot absorb sidecar memory/CPU overhead
Extreme low-latency requirements: Microsecond-level latency critical environments (e.g., HFT)

Alternatives: cert-manager (mTLS), OpenTelemetry (observability), Kubernetes NetworkPolicy (network security), NGINX Ingress (ingress)