Split View: Service Mesh 완전 가이드 2025: Istio vs Linkerd, mTLS, 트래픽 관리, Observability
Service Mesh 완전 가이드 2025: Istio vs Linkerd, mTLS, 트래픽 관리, Observability
들어가며: 왜 Service Mesh가 필요한가?
마이크로서비스 아키텍처가 보편화되면서, 수십에서 수백 개의 서비스가 네트워크를 통해 통신하는 환경이 일반적이 되었습니다. 이 복잡한 서비스 간 통신에서 다음과 같은 문제들이 반복적으로 발생합니다.
보안 문제: 서비스 간 통신이 암호화되지 않으면 내부 네트워크에서도 도청이 가능합니다. 각 서비스마다 TLS를 직접 구현하고 인증서를 관리하는 것은 엄청난 운영 부담입니다.
관찰 가능성(Observability) 부재: 요청이 여러 서비스를 거치면서 어디서 지연이 발생하는지, 어떤 서비스가 오류를 반환하는지 파악하기 어렵습니다.
트래픽 제어의 어려움: 카나리 배포, A/B 테스트, 서킷 브레이커 같은 고급 트래픽 관리를 애플리케이션 코드에 직접 구현해야 합니다.
Service Mesh는 이 모든 문제를 인프라 레이어에서 해결합니다. 애플리케이션 코드를 한 줄도 수정하지 않고, 보안/관찰/제어 기능을 네트워크 레벨에서 투명하게 추가할 수 있습니다.
1. Service Mesh 아키텍처
Service Mesh는 크게 두 가지 평면(plane)으로 구성됩니다.
1.1 데이터 플레인 (Data Plane)
데이터 플레인은 실제 서비스 트래픽을 처리하는 프록시들의 집합입니다. 각 서비스 Pod에 사이드카로 배포되어 모든 인바운드/아웃바운드 트래픽을 가로챕니다.
┌─────────────────────────────────────────────┐
│ Pod │
│ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Application │◄──►│ Sidecar Proxy │ │
│ │ Container │ │ (Envoy/linkerd2) │ │
│ └─────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────┘
사이드카 프록시의 주요 역할:
- 모든 트래픽을 투명하게 가로채기 (iptables 규칙 활용)
- mTLS 암호화/복호화 수행
- 로드 밸런싱 (라운드 로빈, 최소 연결 등)
- 메트릭 수집 및 분산 트레이싱 헤더 전파
- 재시도, 타임아웃, 서킷 브레이킹 적용
1.2 컨트롤 플레인 (Control Plane)
컨트롤 플레인은 데이터 플레인의 프록시들을 중앙에서 관리하고 설정합니다.
Istio의 컨트롤 플레인 (Istiod):
# Istiod가 관리하는 주요 기능
- 서비스 디스커버리: Kubernetes API에서 서비스 목록 동기화
- 설정 배포: VirtualService, DestinationRule 등을 Envoy 설정으로 변환
- 인증서 관리: mTLS용 인증서 발급/갱신 (내장 CA)
- 정책 적용: AuthorizationPolicy, PeerAuthentication 배포
Linkerd의 컨트롤 플레인:
# Linkerd 컨트롤 플레인 컴포넌트
- destination: 서비스 디스커버리 + 정책 배포
- identity: mTLS 인증서 발급 (trust anchor 기반)
- proxy-injector: Pod 생성 시 사이드카 자동 주입
- heartbeat: 텔레메트리 수집
2. Istio 심층 분석
2.1 Istio 아키텍처 개요
Istio는 가장 기능이 풍부한 Service Mesh입니다. Google, IBM, Lyft가 공동 개발했으며, 현재 CNCF 졸업 프로젝트입니다.
# Istio 설치 (istioctl)
curl -L https://istio.io/downloadIstio | sh -
cd istio-1.24.0
export PATH=$PWD/bin:$PATH
# 프로필 기반 설치
istioctl install --set profile=demo -y
# 네임스페이스에 사이드카 자동 주입 활성화
kubectl label namespace default istio-injection=enabled
2.2 Envoy 사이드카 프록시
Istio의 데이터 플레인은 Envoy 프록시를 사용합니다. Envoy는 C++로 작성된 고성능 L4/L7 프록시로, 다음 기능을 제공합니다.
# Envoy의 핵심 기능
- HTTP/1.1, HTTP/2, gRPC 지원
- 자동 재시도 및 서킷 브레이킹
- 동적 설정 업데이트 (xDS API)
- 풍부한 메트릭 및 트레이싱
- 웹어셈블리(Wasm) 확장 지원
- 핫 리스타트 (graceful restart)
메모리 오버헤드는 Pod당 약 40-100MB이며, CPU 오버헤드는 요청당 수 밀리초 수준입니다.
2.3 VirtualService
VirtualService는 Istio에서 트래픽 라우팅 규칙을 정의하는 핵심 리소스입니다.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews-route
spec:
hosts:
- reviews
http:
# 카나리 배포: 90% v1, 10% v2
- route:
- destination:
host: reviews
subset: v1
weight: 90
- destination:
host: reviews
subset: v2
weight: 10
timeout: 5s
retries:
attempts: 3
perTryTimeout: 2s
retryOn: 5xx,reset,connect-failure
2.4 DestinationRule
DestinationRule은 라우팅이 결정된 후 트래픽에 적용할 정책을 정의합니다.
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews-destination
spec:
host: reviews
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: DEFAULT
http1MaxPendingRequests: 100
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
loadBalancer:
simple: LEAST_REQUEST
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
2.5 Gateway
Istio Gateway는 메시 외부에서 들어오는 트래픽을 관리합니다.
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: bookinfo-gateway
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 443
name: https
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: bookinfo-cert
hosts:
- "bookinfo.example.com"
2.6 PeerAuthentication
PeerAuthentication은 서비스 간 mTLS 정책을 정의합니다.
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system
spec:
# 메시 전체에 STRICT mTLS 적용
mtls:
mode: STRICT
---
# 특정 네임스페이스에만 PERMISSIVE 모드
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: legacy-compat
namespace: legacy-apps
spec:
mtls:
mode: PERMISSIVE
2.7 AuthorizationPolicy
AuthorizationPolicy는 서비스 간 접근 제어를 정의합니다.
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: reviews-viewer
namespace: default
spec:
selector:
matchLabels:
app: reviews
action: ALLOW
rules:
- from:
- source:
principals: ["cluster.local/ns/default/sa/productpage"]
to:
- operation:
methods: ["GET"]
paths: ["/reviews/*"]
3. Linkerd 심층 분석
3.1 Linkerd 아키텍처 개요
Linkerd는 가볍고 단순함을 추구하는 Service Mesh입니다. Buoyant가 개발했으며, CNCF 졸업 프로젝트입니다.
# Linkerd CLI 설치
curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh
export PATH=$HOME/.linkerd2/bin:$PATH
# 사전 점검
linkerd check --pre
# 설치
linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -
# 검증
linkerd check
# Viz 확장 (대시보드 + 메트릭)
linkerd viz install | kubectl apply -f -
3.2 linkerd2-proxy: Rust로 작성된 마이크로 프록시
Linkerd의 핵심 차별점은 데이터 플레인 프록시입니다. linkerd2-proxy는 Rust로 작성되어 다음과 같은 장점이 있습니다.
성능 비교 (linkerd2-proxy vs Envoy)
========================================
메모리 사용량: ~20MB vs ~50-100MB
P99 레이턴시: ~1ms 추가 vs ~2-5ms 추가
바이너리 크기: ~13MB vs ~50MB
보안: Rust 메모리 안전성 보장
기능 범위: Service Mesh 전용 vs 범용 프록시
linkerd2-proxy는 Service Mesh에 필요한 기능만 구현하여 경량화를 달성했습니다. Envoy처럼 범용 프록시가 아니므로 Wasm 확장 같은 기능은 없지만, 핵심 기능에서는 뛰어난 성능을 보여줍니다.
3.3 ServiceProfile
Linkerd의 ServiceProfile은 서비스별 라우팅 및 관찰 가능성 설정을 정의합니다.
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
name: webapp.default.svc.cluster.local
namespace: default
spec:
routes:
- name: GET /api/users
condition:
method: GET
pathRegex: /api/users
responseClasses:
- condition:
status:
min: 500
max: 599
isFailure: true
- name: POST /api/orders
condition:
method: POST
pathRegex: /api/orders
isRetryable: true
timeout: 10s
3.4 TrafficSplit (SMI)
Linkerd는 SMI(Service Mesh Interface) 표준을 사용하여 트래픽 분할을 구현합니다.
apiVersion: split.smi-spec.io/v1alpha4
kind: TrafficSplit
metadata:
name: webapp-split
namespace: default
spec:
service: webapp
backends:
- service: webapp-v1
weight: 900
- service: webapp-v2
weight: 100
3.5 Linkerd 멀티클러스터
Linkerd는 멀티클러스터 통신을 네이티브로 지원합니다.
# 멀티클러스터 설치
linkerd multicluster install | kubectl apply -f -
# 원격 클러스터 연결
linkerd multicluster link --cluster-name=west \
--api-server-address="https://west.example.com:6443" | \
kubectl apply -f -
# 서비스 미러링 확인
linkerd multicluster gateways
4. Istio vs Linkerd 상세 비교
| 비교 항목 | Istio | Linkerd |
|---|---|---|
| 데이터 플레인 프록시 | Envoy (C++) | linkerd2-proxy (Rust) |
| 메모리 오버헤드 (Pod당) | 50-100MB | 10-20MB |
| P99 레이턴시 추가 | 2-5ms | 0.5-1ms |
| 설치 복잡도 | 높음 (다양한 프로필) | 낮음 (단일 명령) |
| CRD 수 | 50개 이상 | 10개 이하 |
| 학습 곡선 | 가파름 | 완만함 |
| 트래픽 관리 | 매우 풍부 (VirtualService) | 기본적 (ServiceProfile) |
| 보안 정책 | 세밀한 RBAC (AuthorizationPolicy) | 기본 mTLS + Server/Authorization |
| 프로토콜 지원 | HTTP, gRPC, TCP, WebSocket | HTTP, gRPC, TCP |
| Wasm 확장 | 지원 | 미지원 |
| 멀티클러스터 | 지원 (복잡) | 지원 (상대적으로 간단) |
| Ambient Mesh | 지원 (사이드카 없는 모드) | 해당 없음 |
| Gateway API | 완전 지원 | 부분 지원 |
| 커뮤니티 규모 | 매우 큼 (CNCF 졸업) | 큼 (CNCF 졸업) |
| 운영 복잡도 | 높음 | 낮음 |
| 적합한 환경 | 대규모, 복잡한 정책 필요 | 소중규모, 단순함 선호 |
선택 기준 요약
Istio를 선택해야 할 때:
- 세밀한 트래픽 관리가 필요한 경우 (가중치 기반 라우팅, 폴트 인젝션, 트래픽 미러링)
- 복잡한 보안 정책이 필요한 경우 (JWT 검증, 외부 인가)
- Wasm 기반 확장 플러그인이 필요한 경우
- Ambient Mesh(사이드카 없는 모드)를 사용하려는 경우
Linkerd를 선택해야 할 때:
- 리소스 오버헤드를 최소화하고 싶은 경우
- 빠른 도입과 간단한 운영을 원하는 경우
- 핵심 기능(mTLS, 메트릭, 재시도)만으로 충분한 경우
- 운영팀 규모가 작은 경우
5. mTLS (상호 TLS)
5.1 mTLS의 원리
Service Mesh에서 mTLS는 서비스 간 통신을 자동으로 암호화합니다.
서비스 A (클라이언트) 서비스 B (서버)
│ │
│── ClientHello ──────────► │
│◄─ ServerHello + 서버 인증서 │
│── 클라이언트 인증서 ──────► │
│◄─ 인증서 검증 완료 ────────│
│ │
│◄════ 암호화된 통신 ════════►│
일반 TLS와의 차이점: mTLS에서는 양쪽 모두 인증서를 제시하고 검증합니다. 이를 통해 서버도 클라이언트의 신원을 확인할 수 있습니다.
5.2 SPIFFE 신원 체계
Istio와 Linkerd 모두 SPIFFE(Secure Production Identity Framework For Everyone) 표준을 사용합니다.
SPIFFE ID 형식:
spiffe://cluster.local/ns/NAMESPACE/sa/SERVICE_ACCOUNT
예시:
spiffe://cluster.local/ns/production/sa/frontend
spiffe://cluster.local/ns/production/sa/backend-api
SPIFFE ID는 Kubernetes의 ServiceAccount에 매핑되어, Pod의 신원을 네트워크 레벨에서 증명합니다.
5.3 인증서 자동 로테이션
# Istio: 인증서 수명 설정 (MeshConfig)
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
defaultConfig:
# 워크로드 인증서 기본 24시간
# proxyMetadata를 통해 커스터마이즈 가능
certificates: []
values:
pilot:
env:
# 최대 인증서 수명
MAX_WORKLOAD_CERT_TTL: "48h"
# 기본 인증서 수명
DEFAULT_WORKLOAD_CERT_TTL: "24h"
Linkerd의 인증서 관리:
# Trust anchor 생성 (10년 수명)
step certificate create root.linkerd.cluster.local ca.crt ca.key \
--profile root-ca --no-password --insecure --not-after=87600h
# Issuer 인증서 생성 (48시간 수명, 자동 갱신)
step certificate create identity.linkerd.cluster.local issuer.crt issuer.key \
--profile intermediate-ca --not-after=48h --no-password --insecure \
--ca ca.crt --ca-key ca.key
# 인증서로 설치
linkerd install \
--identity-trust-anchors-file ca.crt \
--identity-issuer-certificate-file issuer.crt \
--identity-issuer-key-file issuer.key | kubectl apply -f -
6. 트래픽 관리
6.1 카나리 릴리스
# Istio - 점진적 카나리 배포
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- match:
- headers:
x-canary-user:
exact: "true"
route:
- destination:
host: reviews
subset: v2
- route:
- destination:
host: reviews
subset: v1
weight: 95
- destination:
host: reviews
subset: v2
weight: 5
Flagger를 사용한 자동 카나리:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: reviews
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: reviews
service:
port: 9080
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
6.2 트래픽 미러링 (Shadow Traffic)
프로덕션 트래픽의 복사본을 새 버전에 보내 실제 환경에서의 동작을 검증합니다.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews-mirror
spec:
hosts:
- reviews
http:
- route:
- destination:
host: reviews
subset: v1
mirror:
host: reviews
subset: v2
mirrorPercentage:
value: 100.0
미러링의 핵심 특성:
- 미러된 트래픽의 응답은 폐기됩니다 (클라이언트에 영향 없음)
Host헤더에-shadow접미사가 추가됩니다- 새 버전의 성능과 에러율을 실제 트래픽으로 검증 가능합니다
6.3 폴트 인젝션 (Fault Injection)
의도적으로 장애를 주입하여 시스템의 복원력을 테스트합니다.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: ratings-fault
spec:
hosts:
- ratings
http:
- fault:
delay:
percentage:
value: 10
fixedDelay: 5s
abort:
percentage:
value: 5
httpStatus: 503
route:
- destination:
host: ratings
6.4 서킷 브레이킹 (Circuit Breaking)
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews-circuit-breaker
spec:
host: reviews
trafficPolicy:
connectionPool:
tcp:
maxConnections: 50
http:
http1MaxPendingRequests: 100
http2MaxRequests: 100
maxRequestsPerConnection: 10
outlierDetection:
consecutive5xxErrors: 3
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 30
minHealthPercent: 70
6.5 재시도 및 타임아웃
# Istio
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews-retry
spec:
hosts:
- reviews
http:
- timeout: 10s
retries:
attempts: 3
perTryTimeout: 3s
retryOn: 5xx,reset,connect-failure,retriable-4xx
route:
- destination:
host: reviews
# Linkerd ServiceProfile
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
name: reviews.default.svc.cluster.local
spec:
routes:
- name: GET /reviews
condition:
method: GET
pathRegex: /reviews/.*
isRetryable: true
timeout: 10s
7. Observability (관찰 가능성)
7.1 메트릭 (Prometheus)
Service Mesh는 자동으로 다음 메트릭을 수집합니다.
골든 시그널 (Golden Signals)
================================
1. 레이턴시: 요청 처리 시간
2. 트래픽: 초당 요청 수
3. 에러율: 실패한 요청 비율
4. 포화도: 리소스 사용률
Istio 주요 메트릭:
- istio_requests_total: 총 요청 수 (소스, 대상, 응답 코드별)
- istio_request_duration_milliseconds: 요청 소요 시간
- istio_request_bytes / istio_response_bytes: 요청/응답 크기
Linkerd 주요 메트릭:
- request_total: 총 요청 수
- response_latency_ms: 응답 레이턴시
- tcp_open_total: TCP 연결 수
# Prometheus 스크래핑 설정 (Istio)
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
scrape_configs:
- job_name: 'envoy-stats'
metrics_path: /stats/prometheus
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
7.2 분산 트레이싱 (Jaeger / Zipkin)
서비스 메시는 트레이싱 헤더를 자동으로 전파하여 요청의 전체 경로를 추적합니다.
# Istio 텔레메트리 설정
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
namespace: istio-system
spec:
tracing:
- providers:
- name: jaeger
randomSamplingPercentage: 10
customTags:
environment:
literal:
value: "production"
중요: 애플리케이션은 다음 헤더를 전파해야 합니다 (자동 생성은 되지만 전파는 애플리케이션의 책임).
전파해야 할 트레이싱 헤더:
- x-request-id
- x-b3-traceid
- x-b3-spanid
- x-b3-parentspanid
- x-b3-sampled
- x-b3-flags
- traceparent (W3C Trace Context)
- tracestate
7.3 Kiali 대시보드
Kiali는 Istio 전용 관찰 가능성 대시보드입니다.
# Kiali 설치
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/addons/kiali.yaml
# 대시보드 접속
istioctl dashboard kiali
Kiali의 주요 기능:
- 서비스 토폴로지 그래프 시각화
- 실시간 트래픽 흐름 모니터링
- Istio 설정 검증 및 오류 탐지
- 분산 트레이싱 통합
- 메트릭 기반 건강 상태 표시
7.4 Grafana 대시보드
# Grafana + 사전 구성된 대시보드 설치
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/addons/grafana.yaml
# 대시보드 접속
istioctl dashboard grafana
주요 대시보드:
- Mesh Dashboard: 전체 메시 트래픽 개요
- Service Dashboard: 개별 서비스 메트릭
- Workload Dashboard: 워크로드별 상세 정보
- Performance Dashboard: P50/P90/P99 레이턴시
8. Kubernetes Gateway API
8.1 Gateway API란?
Kubernetes Gateway API는 기존 Ingress를 대체하는 차세대 트래픽 관리 표준입니다. 역할 기반 설계로 인프라/클러스터/애플리케이션 관리자의 책임을 명확히 분리합니다.
# GatewayClass: 인프라 관리자가 정의
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: istio
spec:
controllerName: istio.io/gateway-controller
---
# Gateway: 클러스터 관리자가 정의
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: bookinfo-gateway
spec:
gatewayClassName: istio
listeners:
- name: https
protocol: HTTPS
port: 443
tls:
mode: Terminate
certificateRefs:
- name: bookinfo-tls
allowedRoutes:
namespaces:
from: Selector
selector:
matchLabels:
expose: "true"
---
# HTTPRoute: 애플리케이션 개발자가 정의
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: bookinfo-route
spec:
parentRefs:
- name: bookinfo-gateway
hostnames:
- "bookinfo.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /reviews
backendRefs:
- name: reviews
port: 9080
weight: 90
- name: reviews-v2
port: 9080
weight: 10
8.2 Istio Gateway vs Kubernetes Gateway API
기존 Istio 방식:
Gateway + VirtualService + DestinationRule
Kubernetes Gateway API 방식:
GatewayClass + Gateway + HTTPRoute
이점:
- 표준화된 API (여러 구현 간 이식성)
- 역할 기반 접근 제어
- 더 나은 네임스페이스 격리
- Istio, Linkerd, Cilium 등에서 동일한 API 사용 가능
9. Ambient Mesh
9.1 사이드카의 한계
기존 사이드카 방식의 문제점:
- Pod당 50-100MB 추가 메모리
- 모든 요청에 프록시 홉 추가 (레이턴시)
- 사이드카 주입으로 인한 Pod 재시작 필요
- 리소스 오버프로비저닝
9.2 Ambient Mesh 아키텍처
Istio의 Ambient Mesh는 사이드카 없이 서비스 메시를 구현하는 새로운 모드입니다.
기존 사이드카 모드:
┌────────────┐ ┌────────────┐
│ App + Envoy│───►│ App + Envoy│
└────────────┘ └────────────┘
Ambient Mesh 모드:
┌────────────┐ ┌────────────┐
│ App │ │ App │
└─────┬──────┘ └──────┬─────┘
│ │
┌─────┴──────────────────┴─────┐ ← ztunnel (노드당 1개, L4)
└──────────────┬───────────────┘
│
┌──────┴──────┐ ← waypoint proxy (선택, L7)
│ Waypoint │
└─────────────┘
ztunnel (Zero Trust Tunnel):
- 노드당 하나의 데몬셋으로 실행
- L4 기능만 담당: mTLS, 기본 인증
- Rust로 작성, 매우 가벼움
- Pod 재시작 불필요
Waypoint Proxy:
- L7 기능이 필요한 경우에만 배포
- 네임스페이스 또는 서비스별로 배포 가능
- Envoy 기반, 전체 L7 기능 제공
# Ambient 모드로 Istio 설치
istioctl install --set profile=ambient -y
# 네임스페이스를 Ambient 메시에 추가
kubectl label namespace default istio.io/dataplane-mode=ambient
# Waypoint Proxy 배포 (L7 기능 필요 시)
istioctl waypoint apply --namespace default --name reviews-waypoint
9.3 Ambient Mesh의 이점
리소스 절감 비교 (100 Pod 클러스터 기준):
========================================
사이드카 모드 Ambient 모드
메모리: 5-10GB 추가 200-500MB 추가
CPU: 상당한 오버헤드 최소 오버헤드
운영: 사이드카 관리 ztunnel 데몬셋만 관리
업그레이드: Pod 재시작 필요 ztunnel 롤링 업데이트
10. 보안 심층 분석
10.1 RBAC (역할 기반 접근 제어)
# 네임스페이스 레벨 거부 정책
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: deny-all
namespace: production
spec:
# 규칙이 비어있으면 모든 요청 거부
{}
---
# 특정 서비스만 허용
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: allow-frontend-to-api
namespace: production
spec:
selector:
matchLabels:
app: api-server
action: ALLOW
rules:
- from:
- source:
namespaces: ["production"]
principals: ["cluster.local/ns/production/sa/frontend"]
to:
- operation:
methods: ["GET", "POST"]
paths: ["/api/v1/*"]
when:
- key: request.headers[x-api-version]
values: ["v1", "v2"]
10.2 JWT 검증
# RequestAuthentication: JWT 검증 정의
apiVersion: security.istio.io/v1beta1
kind: RequestAuthentication
metadata:
name: jwt-auth
namespace: production
spec:
selector:
matchLabels:
app: api-server
jwtRules:
- issuer: "https://auth.example.com"
jwksUri: "https://auth.example.com/.well-known/jwks.json"
forwardOriginalToken: true
outputPayloadToHeader: "x-jwt-payload"
---
# JWT 클레임 기반 인가
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: require-jwt
namespace: production
spec:
selector:
matchLabels:
app: api-server
action: ALLOW
rules:
- from:
- source:
requestPrincipals: ["https://auth.example.com/*"]
when:
- key: request.auth.claims[role]
values: ["admin", "editor"]
10.3 외부 인가 (External Authorization)
# 외부 인가 서비스 연동
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: ext-authz
namespace: production
spec:
selector:
matchLabels:
app: api-server
action: CUSTOM
provider:
name: "opa-ext-authz"
rules:
- to:
- operation:
paths: ["/admin/*"]
# MeshConfig에 외부 인가 프로바이더 등록
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
extensionProviders:
- name: "opa-ext-authz"
envoyExtAuthzGrpc:
service: "opa.opa-system.svc.cluster.local"
port: 9191
includeRequestBodyInCheck:
maxRequestBytes: 1024
11. 프로덕션 운영 베스트 프랙티스
11.1 리소스 제한 설정
# Istio sidecar 리소스 제한
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
defaultConfig:
concurrency: 2
values:
global:
proxy:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
11.2 점진적 롤아웃 전략
# 1단계: PERMISSIVE mTLS (기존 트래픽 허용)
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: default
spec:
mtls:
mode: PERMISSIVE
EOF
# 2단계: 메트릭 모니터링 (mTLS 트래픽 비율 확인)
# istio_requests_total 메트릭에서 connection_security_policy 확인
# 3단계: STRICT mTLS 전환
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: default
spec:
mtls:
mode: STRICT
EOF
11.3 디버깅 도구
# Istio 프록시 상태 확인
istioctl proxy-status
# Envoy 설정 덤프
istioctl proxy-config all POD_NAME -o json
# 라우팅 규칙 확인
istioctl proxy-config route POD_NAME
# 클러스터 설정 확인
istioctl proxy-config cluster POD_NAME
# 분석 도구 (설정 오류 탐지)
istioctl analyze --all-namespaces
# Linkerd 진단
linkerd check
linkerd diagnostics proxy-metrics POD_NAME
linkerd viz stat deploy
linkerd viz top deploy/webapp
linkerd viz tap deploy/webapp
11.4 Horizontal Pod Autoscaler 연동
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: reviews-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: reviews
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: istio_requests_per_second
target:
type: AverageValue
averageValue: "100"
- type: Pods
pods:
metric:
name: istio_request_duration_milliseconds_p99
target:
type: AverageValue
averageValue: "500"
11.5 업그레이드 전략
# Istio 카나리 업그레이드
# 1. 새 버전 컨트롤 플레인 설치 (리비전 기반)
istioctl install --set revision=1-24-0
# 2. 네임스페이스 라벨 변경으로 점진적 전환
kubectl label namespace default istio.io/rev=1-24-0 --overwrite
# 3. Pod 재시작으로 새 프록시 적용
kubectl rollout restart deployment -n default
# 4. 이전 버전 제거
istioctl uninstall --revision 1-23-0
12. Service Mesh를 사용하지 말아야 할 때
Service Mesh는 강력하지만 모든 환경에 적합한 것은 아닙니다.
사용하지 말아야 할 상황:
-
서비스 수가 적은 경우: 5개 이하의 서비스라면 Service Mesh의 복잡성이 이점보다 클 수 있습니다.
-
팀이 Kubernetes에 익숙하지 않은 경우: Service Mesh는 Kubernetes 위에 추가되는 복잡성입니다.
-
리소스가 극도로 제한된 경우: 사이드카 프록시의 메모리/CPU 오버헤드를 감당하기 어려울 때.
-
성능이 극도로 중요한 경우: 마이크로초 단위의 레이턴시가 중요한 HFT(고빈도 거래) 같은 환경.
대안 고려:
단순한 mTLS만 필요: cert-manager + 서비스 자체 TLS
기본 관찰 가능성: OpenTelemetry 직접 계측
간단한 로드 밸런싱: Kubernetes Service (ClusterIP)
인그레스만 필요: NGINX Ingress Controller 또는 Traefik
네트워크 정책: Kubernetes NetworkPolicy 또는 Cilium
퀴즈
Q1: Service Mesh에서 데이터 플레인과 컨트롤 플레인의 역할을 설명하세요.
데이터 플레인: 사이드카 프록시들의 집합으로, 실제 서비스 트래픽을 가로채서 처리합니다. mTLS 암호화, 로드 밸런싱, 메트릭 수집, 재시도/타임아웃 등을 수행합니다. Istio는 Envoy, Linkerd는 linkerd2-proxy를 사용합니다.
컨트롤 플레인: 데이터 플레인의 프록시들을 중앙에서 관리하고 설정합니다. 서비스 디스커버리, 인증서 발급, 정책 배포 등을 담당합니다. Istio는 Istiod, Linkerd는 destination/identity/proxy-injector 컴포넌트로 구성됩니다.
Q2: mTLS에서 일반 TLS와의 핵심 차이점은 무엇인가요?
일반 TLS에서는 클라이언트만 서버의 인증서를 검증합니다. mTLS(상호 TLS)에서는 양쪽 모두 인증서를 제시하고 검증합니다.
- 클라이언트가 서버의 인증서를 검증 (일반 TLS와 동일)
- 서버도 클라이언트의 인증서를 검증 (mTLS의 추가 단계)
- 이를 통해 서비스 간 양방향 신원 확인이 가능합니다
- SPIFFE 표준을 사용하여 서비스의 신원을 Kubernetes ServiceAccount에 매핑합니다
Q3: Istio의 Ambient Mesh가 해결하는 문제와 아키텍처를 설명하세요.
해결하는 문제: 기존 사이드카 방식은 Pod당 50-100MB 메모리 오버헤드, 사이드카 주입을 위한 Pod 재시작 필요, 모든 요청에 프록시 홉 추가 등의 문제가 있습니다.
아키텍처:
- ztunnel: 노드당 하나의 데몬셋으로 실행되는 L4 프록시. Rust로 작성되어 매우 가볍고, mTLS와 기본 인증만 담당합니다.
- Waypoint Proxy: L7 기능이 필요한 경우에만 선택적으로 배포. Envoy 기반으로 VirtualService, 트래픽 관리 등 전체 L7 기능을 제공합니다.
100개 Pod 기준으로 메모리 사용량이 5-10GB(사이드카)에서 200-500MB(Ambient)로 대폭 절감됩니다.
Q4: Istio와 Linkerd 중 어떤 상황에서 각각을 선택해야 하나요?
Istio 선택 기준:
- 세밀한 트래픽 관리가 필요 (가중치 라우팅, 폴트 인젝션, 미러링)
- 복잡한 보안 정책 (JWT 검증, 외부 인가, RBAC)
- Wasm 확장 플러그인 필요
- Ambient Mesh(사이드카 없는 모드) 사용
Linkerd 선택 기준:
- 리소스 오버헤드 최소화 (Pod당 10-20MB)
- 빠른 도입과 간단한 운영
- 핵심 기능(mTLS, 메트릭, 재시도)만으로 충분
- 소규모 팀 운영
Q5: Service Mesh를 도입하지 말아야 할 상황은 언제인가요?
- 서비스 수가 5개 이하: 복잡성이 이점보다 큽니다
- 팀이 Kubernetes에 미숙: Service Mesh는 추가 복잡성 레이어입니다
- 극도의 리소스 제한: 사이드카 메모리/CPU 오버헤드 감당 불가
- 극도의 저지연 요구: 마이크로초 단위 레이턴시가 중요한 환경 (HFT 등)
대안: cert-manager(mTLS), OpenTelemetry(관찰 가능성), Kubernetes NetworkPolicy(네트워크 보안), NGINX Ingress(인그레스)
참고 자료
Service Mesh Complete Guide 2025: Istio vs Linkerd, mTLS, Traffic Management, Observability
Introduction: Why Service Mesh?
As microservices architecture has become the standard, environments with dozens to hundreds of services communicating over the network are now commonplace. Several recurring problems emerge in this complex inter-service communication landscape.
Security concerns: Without encryption, service-to-service communication is vulnerable to eavesdropping even within internal networks. Implementing TLS individually in every service and managing certificates is a massive operational burden.
Observability gaps: When requests traverse multiple services, identifying where latency occurs or which service returns errors becomes incredibly difficult.
Traffic control challenges: Advanced traffic management like canary deployments, A/B testing, and circuit breakers must be implemented directly in application code.
Service Mesh solves all of these problems at the infrastructure layer. Without modifying a single line of application code, you can transparently add security, observability, and traffic control at the network level.
1. Service Mesh Architecture
A Service Mesh consists of two main planes.
1.1 Data Plane
The data plane is the collection of proxies that handle actual service traffic. Deployed as sidecars in each service Pod, they intercept all inbound/outbound traffic.
┌─────────────────────────────────────────────┐
│ Pod │
│ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Application │◄──►│ Sidecar Proxy │ │
│ │ Container │ │ (Envoy/linkerd2) │ │
│ └─────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────┘
Key responsibilities of sidecar proxies:
- Transparently intercept all traffic (using iptables rules)
- Perform mTLS encryption/decryption
- Load balancing (round robin, least connections, etc.)
- Collect metrics and propagate distributed tracing headers
- Apply retries, timeouts, and circuit breaking
1.2 Control Plane
The control plane centrally manages and configures the data plane proxies.
Istio Control Plane (Istiod):
# Key functions managed by Istiod
- Service discovery: Syncs service list from Kubernetes API
- Configuration distribution: Converts VirtualService, DestinationRule to Envoy config
- Certificate management: Issues/renews mTLS certificates (built-in CA)
- Policy enforcement: Distributes AuthorizationPolicy, PeerAuthentication
Linkerd Control Plane:
# Linkerd control plane components
- destination: Service discovery + policy distribution
- identity: mTLS certificate issuance (trust anchor based)
- proxy-injector: Automatic sidecar injection on Pod creation
- heartbeat: Telemetry collection
2. Istio Deep Dive
2.1 Istio Architecture Overview
Istio is the most feature-rich Service Mesh. Co-developed by Google, IBM, and Lyft, it is now a CNCF graduated project.
# Install Istio (istioctl)
curl -L https://istio.io/downloadIstio | sh -
cd istio-1.24.0
export PATH=$PWD/bin:$PATH
# Profile-based installation
istioctl install --set profile=demo -y
# Enable automatic sidecar injection for namespace
kubectl label namespace default istio-injection=enabled
2.2 Envoy Sidecar Proxy
Istio's data plane uses Envoy proxy. Envoy is a high-performance L4/L7 proxy written in C++.
# Envoy core features
- HTTP/1.1, HTTP/2, gRPC support
- Automatic retries and circuit breaking
- Dynamic configuration updates (xDS API)
- Rich metrics and tracing
- WebAssembly (Wasm) extension support
- Hot restart (graceful restart)
Memory overhead is approximately 40-100MB per Pod, with CPU overhead in the low milliseconds per request range.
2.3 VirtualService
VirtualService is the core resource for defining traffic routing rules in Istio.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews-route
spec:
hosts:
- reviews
http:
# Canary deployment: 90% v1, 10% v2
- route:
- destination:
host: reviews
subset: v1
weight: 90
- destination:
host: reviews
subset: v2
weight: 10
timeout: 5s
retries:
attempts: 3
perTryTimeout: 2s
retryOn: 5xx,reset,connect-failure
2.4 DestinationRule
DestinationRule defines policies applied to traffic after routing decisions are made.
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews-destination
spec:
host: reviews
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: DEFAULT
http1MaxPendingRequests: 100
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
loadBalancer:
simple: LEAST_REQUEST
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
2.5 Gateway
Istio Gateway manages traffic entering the mesh from external sources.
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: bookinfo-gateway
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 443
name: https
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: bookinfo-cert
hosts:
- "bookinfo.example.com"
2.6 PeerAuthentication
PeerAuthentication defines mTLS policies between services.
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system
spec:
# Apply STRICT mTLS across the mesh
mtls:
mode: STRICT
---
# PERMISSIVE mode for specific namespace
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: legacy-compat
namespace: legacy-apps
spec:
mtls:
mode: PERMISSIVE
2.7 AuthorizationPolicy
AuthorizationPolicy defines access control between services.
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: reviews-viewer
namespace: default
spec:
selector:
matchLabels:
app: reviews
action: ALLOW
rules:
- from:
- source:
principals: ["cluster.local/ns/default/sa/productpage"]
to:
- operation:
methods: ["GET"]
paths: ["/reviews/*"]
3. Linkerd Deep Dive
3.1 Linkerd Architecture Overview
Linkerd is a Service Mesh focused on simplicity and lightness. Developed by Buoyant, it is a CNCF graduated project.
# Install Linkerd CLI
curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh
export PATH=$HOME/.linkerd2/bin:$PATH
# Pre-flight checks
linkerd check --pre
# Install
linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -
# Verify
linkerd check
# Viz extension (dashboard + metrics)
linkerd viz install | kubectl apply -f -
3.2 linkerd2-proxy: Micro-Proxy Written in Rust
Linkerd's key differentiator is its data plane proxy. linkerd2-proxy is written in Rust, offering these advantages:
Performance Comparison (linkerd2-proxy vs Envoy)
========================================
Memory usage: ~20MB vs ~50-100MB
P99 latency: ~1ms added vs ~2-5ms added
Binary size: ~13MB vs ~50MB
Security: Rust memory safety guaranteed
Feature scope: Service Mesh dedicated vs general-purpose proxy
linkerd2-proxy achieves its lightweight footprint by implementing only the features needed for Service Mesh. Unlike Envoy, it is not a general-purpose proxy, so features like Wasm extensions are absent, but it delivers excellent performance for core functionality.
3.3 ServiceProfile
Linkerd's ServiceProfile defines per-service routing and observability settings.
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
name: webapp.default.svc.cluster.local
namespace: default
spec:
routes:
- name: GET /api/users
condition:
method: GET
pathRegex: /api/users
responseClasses:
- condition:
status:
min: 500
max: 599
isFailure: true
- name: POST /api/orders
condition:
method: POST
pathRegex: /api/orders
isRetryable: true
timeout: 10s
3.4 TrafficSplit (SMI)
Linkerd uses the SMI (Service Mesh Interface) standard for traffic splitting.
apiVersion: split.smi-spec.io/v1alpha4
kind: TrafficSplit
metadata:
name: webapp-split
namespace: default
spec:
service: webapp
backends:
- service: webapp-v1
weight: 900
- service: webapp-v2
weight: 100
3.5 Linkerd Multi-cluster
Linkerd natively supports multi-cluster communication.
# Install multi-cluster extension
linkerd multicluster install | kubectl apply -f -
# Link remote cluster
linkerd multicluster link --cluster-name=west \
--api-server-address="https://west.example.com:6443" | \
kubectl apply -f -
# Verify service mirroring
linkerd multicluster gateways
4. Istio vs Linkerd Detailed Comparison
| Dimension | Istio | Linkerd |
|---|---|---|
| Data Plane Proxy | Envoy (C++) | linkerd2-proxy (Rust) |
| Memory Overhead (per Pod) | 50-100MB | 10-20MB |
| P99 Latency Added | 2-5ms | 0.5-1ms |
| Installation Complexity | High (various profiles) | Low (single command) |
| CRD Count | 50+ | Under 10 |
| Learning Curve | Steep | Gradual |
| Traffic Management | Very rich (VirtualService) | Basic (ServiceProfile) |
| Security Policies | Fine-grained RBAC (AuthorizationPolicy) | Basic mTLS + Server/Authorization |
| Protocol Support | HTTP, gRPC, TCP, WebSocket | HTTP, gRPC, TCP |
| Wasm Extensions | Supported | Not supported |
| Multi-cluster | Supported (complex) | Supported (relatively simple) |
| Ambient Mesh | Supported (sidecar-less mode) | N/A |
| Gateway API | Full support | Partial support |
| Community Size | Very large (CNCF graduated) | Large (CNCF graduated) |
| Operational Complexity | High | Low |
| Best For | Large scale, complex policies | Small-medium, simplicity preferred |
Selection Criteria Summary
Choose Istio when:
- You need fine-grained traffic management (weighted routing, fault injection, traffic mirroring)
- Complex security policies are required (JWT validation, external authorization)
- Wasm-based extension plugins are needed
- You want to use Ambient Mesh (sidecar-less mode)
Choose Linkerd when:
- Minimizing resource overhead is a priority
- You want fast adoption and simple operations
- Core features (mTLS, metrics, retries) are sufficient
- Your operations team is small
5. mTLS (Mutual TLS)
5.1 How mTLS Works
In a Service Mesh, mTLS automatically encrypts service-to-service communication.
Service A (client) Service B (server)
| |
|-- ClientHello -----------> |
|<- ServerHello + ServerCert |
|-- Client Certificate ----> |
|<- Certificate Verified --- |
| |
|<=== Encrypted Traffic ===> |
The key difference from regular TLS: in mTLS, both sides present and verify certificates, enabling the server to verify the client's identity as well.
5.2 SPIFFE Identity Framework
Both Istio and Linkerd use the SPIFFE (Secure Production Identity Framework For Everyone) standard.
SPIFFE ID format:
spiffe://cluster.local/ns/NAMESPACE/sa/SERVICE_ACCOUNT
Examples:
spiffe://cluster.local/ns/production/sa/frontend
spiffe://cluster.local/ns/production/sa/backend-api
SPIFFE IDs map to Kubernetes ServiceAccounts, proving Pod identity at the network level.
5.3 Automatic Certificate Rotation
# Istio: Certificate lifetime configuration (MeshConfig)
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
defaultConfig:
# Default workload certificate is 24 hours
# Customizable through proxyMetadata
certificates: []
values:
pilot:
env:
# Maximum certificate lifetime
MAX_WORKLOAD_CERT_TTL: "48h"
# Default certificate lifetime
DEFAULT_WORKLOAD_CERT_TTL: "24h"
Linkerd certificate management:
# Create trust anchor (10-year lifetime)
step certificate create root.linkerd.cluster.local ca.crt ca.key \
--profile root-ca --no-password --insecure --not-after=87600h
# Create issuer certificate (48-hour lifetime, auto-renewed)
step certificate create identity.linkerd.cluster.local issuer.crt issuer.key \
--profile intermediate-ca --not-after=48h --no-password --insecure \
--ca ca.crt --ca-key ca.key
# Install with certificates
linkerd install \
--identity-trust-anchors-file ca.crt \
--identity-issuer-certificate-file issuer.crt \
--identity-issuer-key-file issuer.key | kubectl apply -f -
6. Traffic Management
6.1 Canary Releases
# Istio - Progressive canary deployment
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- match:
- headers:
x-canary-user:
exact: "true"
route:
- destination:
host: reviews
subset: v2
- route:
- destination:
host: reviews
subset: v1
weight: 95
- destination:
host: reviews
subset: v2
weight: 5
Automated canary with Flagger:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: reviews
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: reviews
service:
port: 9080
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
6.2 Traffic Mirroring (Shadow Traffic)
Send a copy of production traffic to a new version to validate behavior in a real environment.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews-mirror
spec:
hosts:
- reviews
http:
- route:
- destination:
host: reviews
subset: v1
mirror:
host: reviews
subset: v2
mirrorPercentage:
value: 100.0
Key characteristics of mirroring:
- Responses from mirrored traffic are discarded (no client impact)
- The
-shadowsuffix is added to theHostheader - Validate performance and error rates of the new version with real traffic
6.3 Fault Injection
Intentionally inject failures to test system resilience.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: ratings-fault
spec:
hosts:
- ratings
http:
- fault:
delay:
percentage:
value: 10
fixedDelay: 5s
abort:
percentage:
value: 5
httpStatus: 503
route:
- destination:
host: ratings
6.4 Circuit Breaking
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews-circuit-breaker
spec:
host: reviews
trafficPolicy:
connectionPool:
tcp:
maxConnections: 50
http:
http1MaxPendingRequests: 100
http2MaxRequests: 100
maxRequestsPerConnection: 10
outlierDetection:
consecutive5xxErrors: 3
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 30
minHealthPercent: 70
6.5 Retries and Timeouts
# Istio
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews-retry
spec:
hosts:
- reviews
http:
- timeout: 10s
retries:
attempts: 3
perTryTimeout: 3s
retryOn: 5xx,reset,connect-failure,retriable-4xx
route:
- destination:
host: reviews
# Linkerd ServiceProfile
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
name: reviews.default.svc.cluster.local
spec:
routes:
- name: GET /reviews
condition:
method: GET
pathRegex: /reviews/.*
isRetryable: true
timeout: 10s
7. Observability
7.1 Metrics (Prometheus)
Service Mesh automatically collects the following metrics:
Golden Signals
================================
1. Latency: Request processing time
2. Traffic: Requests per second
3. Error rate: Percentage of failed requests
4. Saturation: Resource utilization
Istio key metrics:
- istio_requests_total: Total request count (by source, destination, response code)
- istio_request_duration_milliseconds: Request duration
- istio_request_bytes / istio_response_bytes: Request/response sizes
Linkerd key metrics:
- request_total: Total request count
- response_latency_ms: Response latency
- tcp_open_total: TCP connection count
# Prometheus scraping configuration (Istio)
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
scrape_configs:
- job_name: 'envoy-stats'
metrics_path: /stats/prometheus
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
7.2 Distributed Tracing (Jaeger / Zipkin)
Service Mesh automatically propagates tracing headers to track the complete request path.
# Istio telemetry configuration
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
namespace: istio-system
spec:
tracing:
- providers:
- name: jaeger
randomSamplingPercentage: 10
customTags:
environment:
literal:
value: "production"
Important: Applications must propagate the following headers (auto-generation happens, but propagation is the application's responsibility):
Tracing headers to propagate:
- x-request-id
- x-b3-traceid
- x-b3-spanid
- x-b3-parentspanid
- x-b3-sampled
- x-b3-flags
- traceparent (W3C Trace Context)
- tracestate
7.3 Kiali Dashboard
Kiali is a dedicated observability dashboard for Istio.
# Install Kiali
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/addons/kiali.yaml
# Access dashboard
istioctl dashboard kiali
Key Kiali features:
- Service topology graph visualization
- Real-time traffic flow monitoring
- Istio configuration validation and error detection
- Distributed tracing integration
- Metric-based health status display
7.4 Grafana Dashboards
# Install Grafana + pre-configured dashboards
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/addons/grafana.yaml
# Access dashboard
istioctl dashboard grafana
Key dashboards:
- Mesh Dashboard: Overall mesh traffic overview
- Service Dashboard: Individual service metrics
- Workload Dashboard: Workload-level details
- Performance Dashboard: P50/P90/P99 latency
8. Kubernetes Gateway API
8.1 What is Gateway API?
Kubernetes Gateway API is the next-generation traffic management standard replacing Ingress. Its role-based design clearly separates responsibilities between infrastructure, cluster, and application administrators.
# GatewayClass: Defined by infrastructure admin
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: istio
spec:
controllerName: istio.io/gateway-controller
---
# Gateway: Defined by cluster admin
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: bookinfo-gateway
spec:
gatewayClassName: istio
listeners:
- name: https
protocol: HTTPS
port: 443
tls:
mode: Terminate
certificateRefs:
- name: bookinfo-tls
allowedRoutes:
namespaces:
from: Selector
selector:
matchLabels:
expose: "true"
---
# HTTPRoute: Defined by application developer
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: bookinfo-route
spec:
parentRefs:
- name: bookinfo-gateway
hostnames:
- "bookinfo.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /reviews
backendRefs:
- name: reviews
port: 9080
weight: 90
- name: reviews-v2
port: 9080
weight: 10
8.2 Istio Gateway vs Kubernetes Gateway API
Legacy Istio approach:
Gateway + VirtualService + DestinationRule
Kubernetes Gateway API approach:
GatewayClass + Gateway + HTTPRoute
Benefits:
- Standardized API (portability across implementations)
- Role-based access control
- Better namespace isolation
- Same API usable across Istio, Linkerd, Cilium, etc.
9. Ambient Mesh
9.1 Limitations of Sidecars
Problems with the traditional sidecar approach:
- 50-100MB additional memory per Pod
- Extra proxy hop on every request (latency)
- Pod restart required for sidecar injection
- Resource over-provisioning
9.2 Ambient Mesh Architecture
Istio's Ambient Mesh implements Service Mesh without sidecars.
Traditional Sidecar Mode:
+--------------+ +--------------+
| App + Envoy |--->| App + Envoy |
+--------------+ +--------------+
Ambient Mesh Mode:
+--------------+ +--------------+
| App | | App |
+------+-------+ +-------+------+
| |
+------+--------------------+------+ <-- ztunnel (1 per node, L4)
+------------------+---------------+
|
+------+------+ <-- waypoint proxy (optional, L7)
| Waypoint |
+-------------+
ztunnel (Zero Trust Tunnel):
- Runs as a single DaemonSet per node
- Handles L4 functions only: mTLS, basic authentication
- Written in Rust, extremely lightweight
- No Pod restart required
Waypoint Proxy:
- Deployed only when L7 features are needed
- Can be deployed per-namespace or per-service
- Envoy-based, providing full L7 capabilities
# Install Istio in Ambient mode
istioctl install --set profile=ambient -y
# Add namespace to Ambient mesh
kubectl label namespace default istio.io/dataplane-mode=ambient
# Deploy Waypoint Proxy (when L7 features needed)
istioctl waypoint apply --namespace default --name reviews-waypoint
9.3 Benefits of Ambient Mesh
Resource Savings Comparison (100-Pod cluster):
========================================
Sidecar Mode Ambient Mode
Memory: 5-10GB added 200-500MB added
CPU: Significant Minimal overhead
Operations: Manage sidecars Only ztunnel DaemonSet
Upgrades: Pod restart ztunnel rolling update
10. Security Deep Dive
10.1 RBAC (Role-Based Access Control)
# Namespace-level deny policy
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: deny-all
namespace: production
spec:
# Empty rules deny all requests
{}
---
# Allow specific services only
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: allow-frontend-to-api
namespace: production
spec:
selector:
matchLabels:
app: api-server
action: ALLOW
rules:
- from:
- source:
namespaces: ["production"]
principals: ["cluster.local/ns/production/sa/frontend"]
to:
- operation:
methods: ["GET", "POST"]
paths: ["/api/v1/*"]
when:
- key: request.headers[x-api-version]
values: ["v1", "v2"]
10.2 JWT Validation
# RequestAuthentication: Define JWT validation
apiVersion: security.istio.io/v1beta1
kind: RequestAuthentication
metadata:
name: jwt-auth
namespace: production
spec:
selector:
matchLabels:
app: api-server
jwtRules:
- issuer: "https://auth.example.com"
jwksUri: "https://auth.example.com/.well-known/jwks.json"
forwardOriginalToken: true
outputPayloadToHeader: "x-jwt-payload"
---
# JWT claims-based authorization
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: require-jwt
namespace: production
spec:
selector:
matchLabels:
app: api-server
action: ALLOW
rules:
- from:
- source:
requestPrincipals: ["https://auth.example.com/*"]
when:
- key: request.auth.claims[role]
values: ["admin", "editor"]
10.3 External Authorization
# External authorization service integration
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: ext-authz
namespace: production
spec:
selector:
matchLabels:
app: api-server
action: CUSTOM
provider:
name: "opa-ext-authz"
rules:
- to:
- operation:
paths: ["/admin/*"]
# Register external authz provider in MeshConfig
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
extensionProviders:
- name: "opa-ext-authz"
envoyExtAuthzGrpc:
service: "opa.opa-system.svc.cluster.local"
port: 9191
includeRequestBodyInCheck:
maxRequestBytes: 1024
11. Production Best Practices
11.1 Resource Limits
# Istio sidecar resource limits
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
defaultConfig:
concurrency: 2
values:
global:
proxy:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
11.2 Progressive Rollout Strategy
# Step 1: PERMISSIVE mTLS (allow legacy traffic)
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: default
spec:
mtls:
mode: PERMISSIVE
EOF
# Step 2: Monitor metrics (check mTLS traffic ratio)
# Check connection_security_policy in istio_requests_total metric
# Step 3: Switch to STRICT mTLS
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: default
spec:
mtls:
mode: STRICT
EOF
11.3 Debugging Tools
# Check Istio proxy status
istioctl proxy-status
# Dump Envoy configuration
istioctl proxy-config all POD_NAME -o json
# Check routing rules
istioctl proxy-config route POD_NAME
# Check cluster configuration
istioctl proxy-config cluster POD_NAME
# Analysis tool (detect config errors)
istioctl analyze --all-namespaces
# Linkerd diagnostics
linkerd check
linkerd diagnostics proxy-metrics POD_NAME
linkerd viz stat deploy
linkerd viz top deploy/webapp
linkerd viz tap deploy/webapp
11.4 Horizontal Pod Autoscaler Integration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: reviews-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: reviews
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: istio_requests_per_second
target:
type: AverageValue
averageValue: "100"
- type: Pods
pods:
metric:
name: istio_request_duration_milliseconds_p99
target:
type: AverageValue
averageValue: "500"
11.5 Upgrade Strategy
# Istio canary upgrade
# 1. Install new version control plane (revision-based)
istioctl install --set revision=1-24-0
# 2. Gradually switch by changing namespace labels
kubectl label namespace default istio.io/rev=1-24-0 --overwrite
# 3. Restart Pods to apply new proxy
kubectl rollout restart deployment -n default
# 4. Remove previous version
istioctl uninstall --revision 1-23-0
12. When NOT to Use Service Mesh
Service Mesh is powerful but not suitable for every environment.
Situations to avoid Service Mesh:
-
Few services: With 5 or fewer services, the complexity likely outweighs the benefits.
-
Team unfamiliar with Kubernetes: Service Mesh adds complexity on top of Kubernetes.
-
Extremely limited resources: When the memory/CPU overhead of sidecar proxies cannot be absorbed.
-
Extreme low-latency requirements: Environments where microsecond-level latency matters, such as high-frequency trading (HFT).
Alternatives to consider:
Simple mTLS only: cert-manager + service-level TLS
Basic observability: Direct OpenTelemetry instrumentation
Simple load balancing: Kubernetes Service (ClusterIP)
Ingress only: NGINX Ingress Controller or Traefik
Network policies: Kubernetes NetworkPolicy or Cilium
Quiz
Q1: Explain the roles of the data plane and control plane in a Service Mesh.
Data Plane: The collection of sidecar proxies that intercept and process actual service traffic. They perform mTLS encryption, load balancing, metrics collection, retries/timeouts, and more. Istio uses Envoy, and Linkerd uses linkerd2-proxy.
Control Plane: Centrally manages and configures the data plane proxies. Responsible for service discovery, certificate issuance, and policy distribution. Istio uses Istiod, while Linkerd consists of destination/identity/proxy-injector components.
Q2: What is the key difference between mTLS and regular TLS?
In regular TLS, only the client verifies the server's certificate. In mTLS (mutual TLS), both sides present and verify certificates.
- Client verifies the server's certificate (same as regular TLS)
- Server also verifies the client's certificate (the additional mTLS step)
- This enables bidirectional identity verification between services
- The SPIFFE standard maps service identity to Kubernetes ServiceAccounts
Q3: Explain the problems Istio's Ambient Mesh solves and its architecture.
Problems solved: The traditional sidecar approach has 50-100MB memory overhead per Pod, requires Pod restart for sidecar injection, and adds a proxy hop on every request.
Architecture:
- ztunnel: An L4 proxy running as a single DaemonSet per node. Written in Rust, extremely lightweight, handling only mTLS and basic authentication.
- Waypoint Proxy: Optionally deployed only when L7 features are needed. Envoy-based, providing full L7 features like VirtualService and traffic management.
For a 100-Pod cluster, memory usage drops from 5-10GB (sidecar) to 200-500MB (Ambient).
Q4: When should you choose Istio vs Linkerd?
Choose Istio when:
- Fine-grained traffic management is needed (weighted routing, fault injection, mirroring)
- Complex security policies are required (JWT validation, external authorization, RBAC)
- Wasm extension plugins are needed
- Using Ambient Mesh (sidecar-less mode)
Choose Linkerd when:
- Minimizing resource overhead (10-20MB per Pod)
- Fast adoption and simple operations desired
- Core features (mTLS, metrics, retries) are sufficient
- Small operations team
Q5: When should you NOT adopt a Service Mesh?
- 5 or fewer services: Complexity outweighs benefits
- Team inexperienced with Kubernetes: Service Mesh adds another complexity layer
- Extremely limited resources: Cannot absorb sidecar memory/CPU overhead
- Extreme low-latency requirements: Microsecond-level latency critical environments (e.g., HFT)
Alternatives: cert-manager (mTLS), OpenTelemetry (observability), Kubernetes NetworkPolicy (network security), NGINX Ingress (ingress)
References
- Istio Official Documentation
- Linkerd Official Documentation
- Envoy Proxy Official Documentation
- CNCF Service Mesh Landscape
- Kubernetes Gateway API
- SPIFFE Standard
- Istio Ambient Mesh Official Blog
- Linkerd Benchmarks
- Flagger - Progressive Delivery
- Kiali Official Documentation
- SMI (Service Mesh Interface)
- Istio in Action (Manning)
- NIST Zero Trust Architecture (SP 800-207)