Split View: Cilium Service Mesh: eBPF 기반 사이드카 없는 서비스 메시 구축과 운영 가이드

Cilium Service Mesh: eBPF 기반 사이드카 없는 서비스 메시 구축과 운영 가이드

들어가며
아키텍처와 핵심 개념
설치 및 서비스 메시 활성화
mTLS 설정: SPIRE 기반 상호 인증
L4/L7 트래픽 관리
성능 비교: Cilium vs Istio vs Linkerd
- 벤치마크 실행 방법
트러블슈팅: 장애 사례와 복구
운영 노트
운영 체크리스트
참고자료

Cilium Service Mesh eBPF

들어가며

서비스 메시(Service Mesh)는 마이크로서비스 간 통신을 안전하고 관찰 가능하게 만드는 인프라 계층입니다. Istio, Linkerd 등 기존 서비스 메시 솔루션은 각 Pod에 사이드카 프록시를 주입하는 방식을 사용합니다. 이 접근법은 동작하지만, 사이드카마다 별도의 CPU와 메모리를 소비하고, Pod 시작 시간이 늘어나며, 네트워크 홉이 추가되어 지연 시간이 증가하는 문제가 있습니다.

Cilium Service Mesh는 이 패러다임을 근본적으로 바꿉니다. eBPF를 활용하여 커널 수준에서 L4 트래픽을 처리하고, L7 기능이 필요한 경우에만 노드당 공유 Envoy 프록시를 사용합니다. 사이드카가 없으므로 리소스 오버헤드가 크게 줄어들고, Pod별 프록시 관리의 복잡성이 사라집니다.

이 글에서는 Cilium Service Mesh의 아키텍처 원리부터 설치, mTLS 설정, 트래픽 관리, 성능 비교, 그리고 프로덕션에서 겪을 수 있는 장애 사례와 복구 절차까지 종합적으로 다룹니다.

아키텍처와 핵심 개념

기존 사이드카 모델의 한계

전통적인 서비스 메시에서는 모든 Pod에 Envoy 사이드카가 주입됩니다. 100개의 Pod가 있으면 100개의 Envoy 인스턴스가 추가로 실행됩니다.

사이드카 모델의 비용:

Pod당 Envoy 사이드카가 약 50~100MB 메모리 소비
사이드카 초기화로 인한 Pod 시작 지연 (2~5초 추가)
모든 트래픽이 사이드카를 거치면서 발생하는 추가 지연 (~1ms)
사이드카 업그레이드 시 전체 Pod 롤링 재시작 필요

Cilium Service Mesh의 사이드카리스 아키텍처

Cilium Service Mesh는 두 계층으로 서비스 메시 기능을 구현합니다.

L4 계층 (eBPF): TCP 연결 관리, 로드밸런싱, mTLS 암호화/복호화, 네트워크 정책 적용 등을 커널 내 eBPF 프로그램으로 처리합니다. 사이드카 없이 커널에서 직접 동작하므로 오버헤드가 극히 작습니다.

L7 계층 (공유 Envoy): HTTP 라우팅, 헤더 기반 트래픽 분할, gRPC 필터링 등 L7 기능이 필요한 경우에만 노드당 하나의 공유 Envoy 프록시(DaemonSet)가 트래픽을 처리합니다. Pod별이 아닌 노드별이므로 Envoy 인스턴스 수가 대폭 줄어듭니다.

핵심 컴포넌트

Cilium Agent (DaemonSet): 각 노드에서 eBPF 프로그램을 로드하고 관리합니다. 서비스 메시의 데이터 플레인 역할을 합니다.
Cilium Operator (Deployment): 클러스터 수준의 리소스 관리, IP 풀 할당, CRD 동기화를 담당합니다.
Envoy DaemonSet: L7 정책이 적용된 트래픽만 처리하는 공유 프록시입니다. Cilium Agent가 자동으로 Envoy 설정을 주입합니다.
Hubble: 모든 네트워크 흐름을 실시간으로 관측하는 내장 관측성 도구입니다.

설치 및 서비스 메시 활성화

사전 요구사항

# 커널 버전 확인 (5.10+ 필수, 6.1+ 권장)
uname -r

# eBPF 지원 확인
cat /boot/config-$(uname -r) | grep CONFIG_BPF
# CONFIG_BPF=y
# CONFIG_BPF_SYSCALL=y
# CONFIG_BPF_JIT=y

# Cilium CLI 설치
CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)
curl -L --fail --remote-name-all \
  https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-amd64.tar.gz
sudo tar xzvf cilium-linux-amd64.tar.gz -C /usr/local/bin

Helm을 이용한 서비스 메시 설치

# Helm repo 추가
helm repo add cilium https://helm.cilium.io/
helm repo update

# Cilium Service Mesh 설치 (kube-proxy 대체 + 서비스 메시 활성화)
helm install cilium cilium/cilium --version 1.19.0 \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost="API_SERVER_IP" \
  --set k8sServicePort=6443 \
  --set envoyConfig.enabled=true \
  --set ingressController.enabled=true \
  --set ingressController.loadbalancerMode=shared \
  --set hubble.enabled=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true \
  --set encryption.enabled=true \
  --set encryption.type=wireguard \
  --set authentication.mutual.spire.enabled=true \
  --set authentication.mutual.spire.install.enabled=true

이 설정에서 핵심 파라미터를 설명합니다:

envoyConfig.enabled=true: CiliumEnvoyConfig CRD를 통한 L7 트래픽 관리 활성화
ingressController.enabled=true: Cilium을 Kubernetes Ingress 컨트롤러로 사용
authentication.mutual.spire.enabled=true: SPIRE 기반 mTLS 인증 활성화
encryption.type=wireguard: WireGuard 기반 노드 간 투명 암호화

설치 확인

# Cilium 전체 상태 확인
cilium status --wait

# 예상 출력:
#     /¯¯\
#  /¯¯\__/¯¯\    Cilium:             OK
#  \__/¯¯\__/    Operator:           OK
#  /¯¯\__/¯¯\    Envoy DaemonSet:    OK
#  \__/¯¯\__/    Hubble Relay:       OK
#     \__/        ClusterMesh:        disabled
#                 SPIRE Server:       OK
#                 SPIRE Agent:        OK

# 서비스 메시 기능 확인
cilium config view | grep -E "envoy|mesh|mutual"

# 연결 테스트 (서비스 메시 포함)
cilium connectivity test

mTLS 설정: SPIRE 기반 상호 인증

mTLS가 필요한 이유

서비스 메시의 핵심 기능 중 하나는 서비스 간 통신의 **상호 인증(mTLS)**입니다. mTLS는 클라이언트와 서버 모두가 인증서를 제시하여 상호 신원을 확인합니다. 이를 통해 중간자 공격(MITM)을 방지하고, 네트워크 정책과 별개로 워크로드 신원(identity) 기반의 보안을 구현합니다.

Cilium은 SPIFFE/SPIRE 프레임워크를 사용하여 mTLS를 구현합니다. 각 워크로드에 SPIFFE ID가 자동 할당되며, 인증서 발급과 갱신이 투명하게 이루어집니다.

mTLS 인증 정책 적용

# mtls-authentication-policy.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: require-mutual-auth
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: payment-service
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: order-service
      authentication:
        mode: required # mTLS 인증 필수
      toPorts:
        - ports:
            - port: '8080'
              protocol: TCP

이 정책은 payment-service에 대한 인바운드 트래픽에 mTLS 인증을 필수로 요구합니다. order-service가 유효한 SPIFFE ID를 가지고 있지 않으면 연결이 거부됩니다.

클러스터 전역 mTLS 강제

# cluster-wide-mtls.yaml
apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: enforce-mtls-cluster-wide
spec:
  endpointSelector:
    matchExpressions:
      - key: io.kubernetes.pod.namespace
        operator: NotIn
        values:
          - kube-system
  ingress:
    - fromEndpoints:
        - {}
      authentication:
        mode: required

주의사항: 클러스터 전역 mTLS를 활성화하기 전에 모든 워크로드가 SPIRE에 등록되어 있는지 확인해야 합니다. 미등록 워크로드는 즉시 통신이 차단됩니다. 반드시 스테이징 환경에서 먼저 테스트하고, 프로덕션에는 네임스페이스별로 단계적으로 적용하세요.

# SPIRE에 등록된 워크로드 확인
kubectl exec -n kube-system spire-server-0 -- \
  /opt/spire/bin/spire-server entry show

# mTLS 인증 상태 확인
cilium-dbg identity list | grep -i auth
hubble observe --namespace production --verdict DROPPED -o json | \
  jq 'select(.drop_reason_desc == "Authentication required")'

L4/L7 트래픽 관리

CiliumEnvoyConfig를 이용한 L7 라우팅

Cilium Service Mesh는 CiliumEnvoyConfig CRD를 통해 L7 트래픽 관리를 수행합니다. 이는 Istio의 VirtualService/DestinationRule과 유사한 역할을 합니다.

# l7-traffic-split.yaml
apiVersion: cilium.io/v2
kind: CiliumEnvoyConfig
metadata:
  name: api-traffic-split
  namespace: production
spec:
  services:
    - name: api-service
      namespace: production
  backendServices:
    - name: api-service-v1
      namespace: production
    - name: api-service-v2
      namespace: production
  resources:
    - '@type': type.googleapis.com/envoy.config.listener.v3.Listener
      name: api-listener
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                '@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: api-traffic
                route_config:
                  name: api-routes
                  virtual_hosts:
                    - name: api-host
                      domains: ['*']
                      routes:
                        - match:
                            prefix: '/'
                            headers:
                              - name: 'x-canary'
                                exact_match: 'true'
                          route:
                            cluster: 'production/api-service-v2'
                        - match:
                            prefix: '/'
                          route:
                            weighted_clusters:
                              clusters:
                                - name: 'production/api-service-v1'
                                  weight: 90
                                - name: 'production/api-service-v2'
                                  weight: 10
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      '@type': type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

이 설정은 두 가지 라우팅 규칙을 정의합니다:

x-canary: true 헤더가 있는 요청은 v2로 전달
나머지 요청은 v1에 90%, v2에 10%의 가중치로 분산 (카나리 배포)

L4 로드밸런싱 정책

L4 수준의 로드밸런싱은 eBPF에서 직접 처리되므로 Envoy를 거치지 않아 매우 효율적입니다.

# l4-lb-policy.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: backend-l4-policy
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: backend-service
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: api-gateway
      toPorts:
        - ports:
            - port: '8080'
              protocol: TCP
            - port: '9090'
              protocol: TCP
  egress:
    - toEndpoints:
        - matchLabels:
            app: database
      toPorts:
        - ports:
            - port: '5432'
              protocol: TCP
    - toEndpoints:
        - matchLabels:
            app: cache
      toPorts:
        - ports:
            - port: '6379'
              protocol: TCP

L7 HTTP 정책과 Rate Limiting

# l7-rate-limiting.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: api-l7-with-ratelimit
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: public-api
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: frontend
      toPorts:
        - ports:
            - port: '8080'
              protocol: TCP
          rules:
            http:
              - method: GET
                path: '/api/v1/products.*'
              - method: GET
                path: '/api/v1/categories.*'
              - method: POST
                path: '/api/v1/orders'
                headers:
                  - 'Content-Type: application/json'

성능 비교: Cilium vs Istio vs Linkerd

서비스 메시 선택 시 가장 중요한 판단 기준 중 하나가 성능입니다. 아래는 동일한 워크로드(gRPC 마이크로서비스, 100 RPS)에서 측정한 벤치마크 결과입니다.

항목	Cilium Service Mesh	Istio (사이드카)	Linkerd
아키텍처	사이드카리스 (eBPF + 노드 Envoy)	Pod별 사이드카 Envoy	Pod별 사이드카 linkerd2-proxy
P50 지연 추가	~0.1ms (L4), ~0.3ms (L7)	~1.0ms	~0.5ms
P99 지연 추가	~0.3ms (L4), ~0.8ms (L7)	~3.0ms	~1.5ms
Pod당 메모리 오버헤드	0MB (L4) / 노드당 ~150MB 공유	~50-100MB	~20-30MB
Pod당 CPU 오버헤드	거의 없음 (L4)	~10-50m	~5-20m
Pod 시작 지연	0초 (사이드카 없음)	2-5초 (사이드카 주입)	1-3초
mTLS	SPIRE/WireGuard	내장 (Citadel)	내장 (자체 PKI)
L7 기능	CiliumEnvoyConfig	VirtualService/DestinationRule	HTTPRoute/ServiceProfile
관측성	Hubble (내장)	Kiali, Jaeger (별도)	Linkerd Viz (내장)
Gateway API 지원	네이티브	네이티브	네이티브
커뮤니티	CNCF Graduated	CNCF Graduated	CNCF Graduated
학습 곡선	중간 (eBPF 이해 필요)	높음 (복잡한 CRD 체계)	낮음

벤치마크 실행 방법

# fortio를 이용한 성능 측정
kubectl run fortio-client --rm -it --image=fortio/fortio -- \
  load -c 50 -qps 1000 -t 60s -json - \
  http://api-service.production:8080/api/v1/health

# 결과 분석 포인트:
# - P50, P90, P99 레이턴시
# - 최대 QPS (Queries Per Second)
# - 에러율
# - CPU/메모리 사용량 (kubectl top pods로 별도 측정)

# Hubble 메트릭으로 L7 지연 시간 확인
hubble observe --namespace production --protocol http -o json | \
  jq '.l7.latency_ns / 1000000'

핵심 포인트: Cilium Service Mesh는 L4 트래픽 처리 시 사이드카 모델 대비 3~10배 낮은 지연 시간을 보여줍니다. L7 기능을 사용해도 공유 Envoy 모델 덕분에 Pod당 메모리 오버헤드가 0이며, 노드당 하나의 Envoy만으로 해당 노드의 모든 L7 트래픽을 처리합니다.

트러블슈팅: 장애 사례와 복구

사례 1: mTLS 인증 실패로 서비스 간 통신 불가

증상: 특정 서비스가 갑자기 다른 서비스와 통신하지 못하고, Hubble에서 Authentication required 드롭이 관찰됨

# 인증 실패 트래픽 확인
hubble observe --namespace production --verdict DROPPED -o compact

# SPIRE 에이전트 상태 확인
kubectl get pods -n kube-system -l app=spire-agent
kubectl logs -n kube-system -l app=spire-agent --tail=50

# 특정 워크로드의 SVID(인증서) 상태 확인
kubectl exec -n kube-system spire-server-0 -- \
  /opt/spire/bin/spire-server entry show -selector k8s:ns:production

# Cilium 엔드포인트의 인증 상태 확인
kubectl exec -n kube-system ds/cilium -- \
  cilium-dbg endpoint list -o json | jq '.[].status.policy.realized.auth'

원인과 해결:

SPIRE 에이전트 OOMKilled: SPIRE 에이전트의 메모리 limit을 높입니다. 대규모 클러스터에서는 512Mi 이상 권장
인증서 만료: SPIRE의 SVID TTL 기본값은 1시간입니다. SPIRE 서버가 다운되면 인증서가 갱신되지 않아 1시간 후 통신이 끊깁니다. SPIRE 서버의 고가용성을 반드시 확보하세요
잘못된 셀렉터: CiliumNetworkPolicy의 authentication.mode: required가 의도하지 않은 범위에 적용되었을 수 있습니다. 먼저 특정 네임스페이스만 대상으로 테스트하세요

사례 2: Envoy DaemonSet 장애로 L7 정책 동작 불능

증상: L7 HTTP 정책(경로/헤더 기반 필터링)이 동작하지 않고, L4 정책만 적용됨

# Envoy DaemonSet 상태 확인
kubectl get ds -n kube-system cilium-envoy
kubectl describe ds -n kube-system cilium-envoy

# 특정 노드의 Envoy 로그 확인
kubectl logs -n kube-system -l k8s-app=cilium-envoy --tail=100

# Cilium Agent와 Envoy 간 연결 확인
kubectl exec -n kube-system ds/cilium -- \
  cilium-dbg status --verbose | grep -A5 "Envoy"

# CiliumEnvoyConfig 상태 확인
kubectl get cec -n production -o yaml

원인과 해결:

Envoy Pod 리소스 부족: 노드의 L7 트래픽이 많은 경우 Envoy 메모리가 부족해질 수 있습니다. Helm values에서 envoy.resources.limits.memory를 늘리세요
잘못된 CiliumEnvoyConfig: Envoy 설정 문법 오류 시 해당 리스너가 로드되지 않습니다. cilium-dbg envoy config 명령으로 실제 로드된 설정을 확인하세요
Node Affinity 문제: 특정 노드에서만 Envoy가 스케줄되지 않는 경우 해당 노드의 L7 정책이 동작하지 않습니다

사례 3: 서비스 메시 업그레이드 후 연결 끊김

증상: Cilium 버전 업그레이드 후 일부 Pod 간 통신이 간헐적으로 실패

# Cilium Agent 롤링 재시작 상태 확인
kubectl rollout status ds/cilium -n kube-system

# eBPF 맵 동기화 상태 확인
kubectl exec -n kube-system ds/cilium -- \
  cilium-dbg bpf endpoint list

# 엔드포인트 복구 대기
kubectl exec -n kube-system ds/cilium -- \
  cilium-dbg endpoint list | grep -v ready

복구 절차:

먼저 Cilium Agent가 모든 노드에서 정상적으로 재시작되었는지 확인합니다
eBPF 맵이 정상적으로 재로드되었는지 cilium-dbg bpf endpoint list로 확인합니다
문제가 지속되면 영향을 받는 Pod를 재시작하여 엔드포인트를 재등록합니다
최후의 수단으로 cilium-dbg endpoint regenerate --all을 실행하여 모든 엔드포인트의 eBPF 프로그램을 재생성합니다

주의: 업그레이드 시 반드시 helm diff upgrade로 변경사항을 사전 확인하고, 한 번에 한 마이너 버전씩 업그레이드하세요. Cilium 1.17에서 1.19로 건너뛰는 것은 지원되지 않습니다.

사례 4: Hubble 관측 데이터가 수집되지 않음

증상: hubble observe 명령이 빈 결과를 반환하거나 타임아웃

# Hubble Relay 상태 확인
kubectl get pods -n kube-system -l k8s-app=hubble-relay
kubectl logs -n kube-system -l k8s-app=hubble-relay --tail=50

# Hubble 연결 테스트
cilium hubble port-forward &
hubble status

# Cilium Agent의 Hubble 모니터링 상태
kubectl exec -n kube-system ds/cilium -- \
  cilium-dbg monitor --type drop --type trace

해결: Hubble Relay의 gRPC 연결이 끊어진 경우가 가장 흔합니다. kubectl rollout restart deployment hubble-relay -n kube-system으로 Relay를 재시작하세요.

운영 노트

Istio에서 Cilium Service Mesh로의 마이그레이션

기존 Istio 환경에서 Cilium Service Mesh로 마이그레이션하는 경우, 한 번에 전환하지 말고 단계적으로 진행해야 합니다.

1단계 - 공존 (2~4주): Cilium을 CNI로 설치하되 서비스 메시 기능은 비활성화합니다. Istio 사이드카가 계속 동작합니다.

2단계 - 네임스페이스별 전환: 비핵심 네임스페이스부터 Istio 사이드카 주입을 비활성화하고 Cilium의 mTLS와 L7 정책을 활성화합니다.

3단계 - 완전 전환: 모든 네임스페이스에서 Istio를 제거하고 Cilium Service Mesh로 전환합니다.

# 네임스페이스별 Istio 사이드카 주입 비활성화
kubectl label namespace staging istio-injection-
kubectl rollout restart deployment -n staging

# Cilium mTLS 정책 적용
kubectl apply -f cilium-mtls-policy-staging.yaml

# 전환 후 확인
hubble observe --namespace staging --protocol http

리소스 사이징 가이드라인

Cilium Agent (DaemonSet):

소규모 클러스터 (노드 10개 이하): CPU 200m / 메모리 256Mi
중규모 클러스터 (노드 50개 이하): CPU 500m / 메모리 512Mi
대규모 클러스터 (노드 100개 이상): CPU 1000m / 메모리 1Gi

Envoy DaemonSet:

L7 트래픽이 적은 경우: CPU 100m / 메모리 128Mi
L7 트래픽이 많은 경우: CPU 500m / 메모리 512Mi
매우 높은 L7 처리량: CPU 1000m / 메모리 1Gi

SPIRE Server:

워크로드 1,000개 이하: CPU 200m / 메모리 256Mi
워크로드 5,000개 이상: CPU 500m / 메모리 512Mi, 반드시 HA 구성

모니터링 필수 메트릭

# Prometheus에서 수집해야 할 핵심 메트릭

# 1. Cilium Agent 상태
# cilium_agent_api_process_time_seconds - API 처리 시간
# cilium_agent_bootstrap_seconds - Agent 시작 시간
# cilium_bpf_map_ops_total - BPF 맵 작업 수

# 2. 서비스 메시 관련
# cilium_proxy_upstream_reply_seconds - L7 프록시 업스트림 응답 시간
# cilium_proxy_redirects - L7 프록시로 리다이렉트된 연결 수
# cilium_auth_map_entries - mTLS 인증 맵 엔트리 수

# 3. Hubble 관측성
# hubble_flows_processed_total - 처리된 흐름 수
# hubble_tcp_flags_total - TCP 플래그별 카운트

# Grafana 대시보드 임포트
# Cilium 공식 대시보드: https://grafana.com/grafana/dashboards/16611

운영 체크리스트

배포 전 체크리스트

커널 버전이 5.10 이상인지 확인 (6.1 이상 권장)
CONFIG_BPF, CONFIG_BPF_SYSCALL, CONFIG_BPF_JIT가 활성화되어 있는지 확인
kube-proxy를 대체 모드로 설치할 경우 기존 kube-proxy가 제거되었는지 확인
SPIRE 서버가 HA 구성으로 배포되었는지 확인 (프로덕션 필수)
CiliumNetworkPolicy가 기존 Kubernetes NetworkPolicy와 충돌하지 않는지 확인
PodDisruptionBudget이 Cilium DaemonSet과 SPIRE에 설정되어 있는지 확인
cilium connectivity test가 성공하는지 확인

업그레이드 체크리스트

Cilium 릴리스 노트에서 Breaking Changes를 반드시 확인
한 마이너 버전씩 순차 업그레이드 (1.17 -> 1.18 -> 1.19)
helm diff upgrade로 변경사항 사전 확인
스테이징 환경에서 먼저 업그레이드 후 최소 24시간 관찰
업그레이드 중 cilium status로 Agent 롤링 재시작 모니터링
업그레이드 후 cilium connectivity test 재실행

장애 대응 체크리스트

Cilium Agent 장애 시: 해당 노드의 Pod는 기존 eBPF 맵으로 계속 통신 가능하나 정책 업데이트는 중단됨
Envoy DaemonSet 장애 시: L7 정책만 영향, L4 정책은 eBPF에서 계속 동작
SPIRE 서버 장애 시: 기존 인증서의 TTL(기본 1시간)이 만료되기 전에 복구 필수
etcd 장애 시 (Cilium KVStore): Cilium Agent가 로컬 캐시로 동작하나 새로운 정책 반영 불가

참고자료

Cilium Service Mesh: Building and Operating Sidecarless Service Mesh with eBPF

Introduction
Architecture and Core Concepts
Installation and Service Mesh Enablement
mTLS Configuration: SPIRE-Based Mutual Authentication
L4/L7 Traffic Management
Performance Comparison: Cilium vs Istio vs Linkerd
- How to Run the Benchmark
Troubleshooting: Failure Scenarios and Recovery
Operations Notes
Operations Checklist
References
Quiz

Cilium Service Mesh eBPF

Introduction

Service Mesh is an infrastructure layer that makes communication between microservices secure and observable. Traditional service mesh solutions like Istio and Linkerd inject a sidecar proxy into each Pod. While this approach works, each sidecar consumes additional CPU and memory, Pod startup time increases, and extra network hops add latency.

Cilium Service Mesh fundamentally changes this paradigm. It uses eBPF to handle L4 traffic at the kernel level and only uses a per-node shared Envoy proxy (DaemonSet) when L7 functionality is needed. Without sidecars, resource overhead is dramatically reduced and the complexity of managing per-Pod proxies disappears.

This article comprehensively covers the architectural principles of Cilium Service Mesh, installation, mTLS configuration, traffic management, performance comparison, as well as failure scenarios and recovery procedures you may encounter in production.

Architecture and Core Concepts

Limitations of the Traditional Sidecar Model

In a traditional service mesh, an Envoy sidecar is injected into every Pod. If you have 100 Pods, 100 additional Envoy instances are running.

Cost of the Sidecar Model:

Each Envoy sidecar per Pod consumes approximately 50-100MB of memory
Pod startup delay due to sidecar initialization (additional 2-5 seconds)
Additional latency from all traffic passing through the sidecar (~1ms)
Full Pod rolling restart required when upgrading sidecars

Cilium Service Mesh Sidecarless Architecture

Cilium Service Mesh implements service mesh functionality in two layers.

L4 Layer (eBPF): TCP connection management, load balancing, mTLS encryption/decryption, and network policy enforcement are handled by eBPF programs inside the kernel. Since it operates directly in the kernel without sidecars, overhead is minimal.

L7 Layer (Shared Envoy): Only when L7 functionality is needed, such as HTTP routing, header-based traffic splitting, and gRPC filtering, a single shared Envoy proxy per node (DaemonSet) handles the traffic. Since it is per-node rather than per-Pod, the number of Envoy instances is dramatically reduced.

Core Components

Cilium Agent (DaemonSet): Loads and manages eBPF programs on each node. Serves as the data plane of the service mesh.
Cilium Operator (Deployment): Handles cluster-level resource management, IP pool allocation, and CRD synchronization.
Envoy DaemonSet: A shared proxy that only handles traffic with L7 policies applied. The Cilium Agent automatically injects Envoy configuration.
Hubble: A built-in observability tool that monitors all network flows in real time.

Installation and Service Mesh Enablement

Prerequisites

# Check kernel version (5.10+ required, 6.1+ recommended)
uname -r

# Verify eBPF support
cat /boot/config-$(uname -r) | grep CONFIG_BPF
# CONFIG_BPF=y
# CONFIG_BPF_SYSCALL=y
# CONFIG_BPF_JIT=y

# Install Cilium CLI
CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)
curl -L --fail --remote-name-all \
  https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-amd64.tar.gz
sudo tar xzvf cilium-linux-amd64.tar.gz -C /usr/local/bin

Installing Service Mesh with Helm

# Add Helm repo
helm repo add cilium https://helm.cilium.io/
helm repo update

# Install Cilium Service Mesh (kube-proxy replacement + service mesh enabled)
helm install cilium cilium/cilium --version 1.19.0 \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost="API_SERVER_IP" \
  --set k8sServicePort=6443 \
  --set envoyConfig.enabled=true \
  --set ingressController.enabled=true \
  --set ingressController.loadbalancerMode=shared \
  --set hubble.enabled=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true \
  --set encryption.enabled=true \
  --set encryption.type=wireguard \
  --set authentication.mutual.spire.enabled=true \
  --set authentication.mutual.spire.install.enabled=true

Here is an explanation of the key parameters in this configuration:

envoyConfig.enabled=true: Enables L7 traffic management via the CiliumEnvoyConfig CRD
ingressController.enabled=true: Uses Cilium as the Kubernetes Ingress controller
authentication.mutual.spire.enabled=true: Enables SPIRE-based mTLS authentication
encryption.type=wireguard: WireGuard-based transparent encryption between nodes

Verifying the Installation

# Check overall Cilium status
cilium status --wait

# Expected output:
#     /¯¯\
#  /¯¯\__/¯¯\    Cilium:             OK
#  \__/¯¯\__/    Operator:           OK
#  /¯¯\__/¯¯\    Envoy DaemonSet:    OK
#  \__/¯¯\__/    Hubble Relay:       OK
#     \__/        ClusterMesh:        disabled
#                 SPIRE Server:       OK
#                 SPIRE Agent:        OK

# Verify service mesh features
cilium config view | grep -E "envoy|mesh|mutual"

# Connectivity test (including service mesh)
cilium connectivity test

mTLS Configuration: SPIRE-Based Mutual Authentication

Why mTLS Is Needed

One of the core features of a service mesh is mutual authentication (mTLS) for inter-service communication. mTLS requires both the client and server to present certificates, verifying each other's identity. This prevents man-in-the-middle (MITM) attacks and implements workload identity-based security independent of network policies.

Cilium implements mTLS using the SPIFFE/SPIRE framework. Each workload is automatically assigned a SPIFFE ID, and certificate issuance and renewal happen transparently.

Applying mTLS Authentication Policies

# mtls-authentication-policy.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: require-mutual-auth
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: payment-service
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: order-service
      authentication:
        mode: required # mTLS authentication required
      toPorts:
        - ports:
            - port: '8080'
              protocol: TCP

This policy requires mTLS authentication for inbound traffic to payment-service. If order-service does not have a valid SPIFFE ID, the connection will be rejected.

Enforcing Cluster-Wide mTLS

# cluster-wide-mtls.yaml
apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: enforce-mtls-cluster-wide
spec:
  endpointSelector:
    matchExpressions:
      - key: io.kubernetes.pod.namespace
        operator: NotIn
        values:
          - kube-system
  ingress:
    - fromEndpoints:
        - {}
      authentication:
        mode: required

Important Note: Before enabling cluster-wide mTLS, make sure all workloads are registered with SPIRE. Unregistered workloads will have their communication blocked immediately. Always test in a staging environment first, and apply to production gradually on a per-namespace basis.

# Check workloads registered with SPIRE
kubectl exec -n kube-system spire-server-0 -- \
  /opt/spire/bin/spire-server entry show

# Check mTLS authentication status
cilium-dbg identity list | grep -i auth
hubble observe --namespace production --verdict DROPPED -o json | \
  jq 'select(.drop_reason_desc == "Authentication required")'

L4/L7 Traffic Management

L7 Routing with CiliumEnvoyConfig

Cilium Service Mesh performs L7 traffic management through the CiliumEnvoyConfig CRD. This serves a similar role to Istio's VirtualService/DestinationRule.

# l7-traffic-split.yaml
apiVersion: cilium.io/v2
kind: CiliumEnvoyConfig
metadata:
  name: api-traffic-split
  namespace: production
spec:
  services:
    - name: api-service
      namespace: production
  backendServices:
    - name: api-service-v1
      namespace: production
    - name: api-service-v2
      namespace: production
  resources:
    - '@type': type.googleapis.com/envoy.config.listener.v3.Listener
      name: api-listener
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                '@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: api-traffic
                route_config:
                  name: api-routes
                  virtual_hosts:
                    - name: api-host
                      domains: ['*']
                      routes:
                        - match:
                            prefix: '/'
                            headers:
                              - name: 'x-canary'
                                exact_match: 'true'
                          route:
                            cluster: 'production/api-service-v2'
                        - match:
                            prefix: '/'
                          route:
                            weighted_clusters:
                              clusters:
                                - name: 'production/api-service-v1'
                                  weight: 90
                                - name: 'production/api-service-v2'
                                  weight: 10
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      '@type': type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

This configuration defines two routing rules:

Requests with the x-canary: true header are routed to v2
Remaining requests are distributed with 90% weight to v1 and 10% to v2 (canary deployment)

L4 Load Balancing Policy

L4 load balancing is handled directly by eBPF without going through Envoy, making it very efficient.

# l4-lb-policy.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: backend-l4-policy
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: backend-service
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: api-gateway
      toPorts:
        - ports:
            - port: '8080'
              protocol: TCP
            - port: '9090'
              protocol: TCP
  egress:
    - toEndpoints:
        - matchLabels:
            app: database
      toPorts:
        - ports:
            - port: '5432'
              protocol: TCP
    - toEndpoints:
        - matchLabels:
            app: cache
      toPorts:
        - ports:
            - port: '6379'
              protocol: TCP

L7 HTTP Policies and Rate Limiting

# l7-rate-limiting.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: api-l7-with-ratelimit
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: public-api
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: frontend
      toPorts:
        - ports:
            - port: '8080'
              protocol: TCP
          rules:
            http:
              - method: GET
                path: '/api/v1/products.*'
              - method: GET
                path: '/api/v1/categories.*'
              - method: POST
                path: '/api/v1/orders'
                headers:
                  - 'Content-Type: application/json'

Performance Comparison: Cilium vs Istio vs Linkerd

Performance is one of the most important criteria when choosing a service mesh. Below are benchmark results measured on the same workload (gRPC microservice, 100 RPS).

Item	Cilium Service Mesh	Istio (Sidecar)	Linkerd
Architecture	Sidecarless (eBPF + Node Envoy)	Per-Pod Sidecar Envoy	Per-Pod Sidecar linkerd2-proxy
P50 Latency Overhead	~0.1ms (L4), ~0.3ms (L7)	~1.0ms	~0.5ms
P99 Latency Overhead	~0.3ms (L4), ~0.8ms (L7)	~3.0ms	~1.5ms
Memory Overhead per Pod	0MB (L4) / ~150MB shared per node	~50-100MB	~20-30MB
CPU Overhead per Pod	Nearly none (L4)	~10-50m	~5-20m
Pod Startup Delay	0s (no sidecar)	2-5s (sidecar injection)	1-3s
mTLS	SPIRE/WireGuard	Built-in (Citadel)	Built-in (own PKI)
L7 Features	CiliumEnvoyConfig	VirtualService/DestinationRule	HTTPRoute/ServiceProfile
Observability	Hubble (built-in)	Kiali, Jaeger (separate)	Linkerd Viz (built-in)
Gateway API Support	Native	Native	Native
Community	CNCF Graduated	CNCF Graduated	CNCF Graduated
Learning Curve	Medium (eBPF understanding required)	High (complex CRD system)	Low

How to Run the Benchmark

# Performance measurement using fortio
kubectl run fortio-client --rm -it --image=fortio/fortio -- \
  load -c 50 -qps 1000 -t 60s -json - \
  http://api-service.production:8080/api/v1/health

# Key analysis points:
# - P50, P90, P99 latency
# - Maximum QPS (Queries Per Second)
# - Error rate
# - CPU/Memory usage (measured separately with kubectl top pods)

# Check L7 latency using Hubble metrics
hubble observe --namespace production --protocol http -o json | \
  jq '.l7.latency_ns / 1000000'

Key Takeaway: Cilium Service Mesh shows 3-10x lower latency compared to the sidecar model when processing L4 traffic. Even when using L7 features, the shared Envoy model results in zero per-Pod memory overhead, with a single Envoy per node handling all L7 traffic on that node.

Troubleshooting: Failure Scenarios and Recovery

Case 1: Inter-Service Communication Failure Due to mTLS Authentication Failure

Symptoms: A specific service suddenly cannot communicate with other services, and Authentication required drops are observed in Hubble

# Check authentication failure traffic
hubble observe --namespace production --verdict DROPPED -o compact

# Check SPIRE agent status
kubectl get pods -n kube-system -l app=spire-agent
kubectl logs -n kube-system -l app=spire-agent --tail=50

# Check SVID (certificate) status for a specific workload
kubectl exec -n kube-system spire-server-0 -- \
  /opt/spire/bin/spire-server entry show -selector k8s:ns:production

# Check authentication status of Cilium endpoints
kubectl exec -n kube-system ds/cilium -- \
  cilium-dbg endpoint list -o json | jq '.[].status.policy.realized.auth'

Causes and Solutions:

SPIRE Agent OOMKilled: Increase the memory limit of the SPIRE agent. For large clusters, 512Mi or more is recommended
Certificate Expiration: The default SVID TTL in SPIRE is 1 hour. If the SPIRE server goes down, certificates cannot be renewed, and communication will break after 1 hour. Ensure high availability of the SPIRE server
Incorrect Selectors: authentication.mode: required in the CiliumNetworkPolicy may have been applied to an unintended scope. Test with a specific namespace first

Case 2: L7 Policy Failure Due to Envoy DaemonSet Outage

Symptoms: L7 HTTP policies (path/header-based filtering) are not working, and only L4 policies are applied

# Check Envoy DaemonSet status
kubectl get ds -n kube-system cilium-envoy
kubectl describe ds -n kube-system cilium-envoy

# Check Envoy logs on a specific node
kubectl logs -n kube-system -l k8s-app=cilium-envoy --tail=100

# Check connectivity between Cilium Agent and Envoy
kubectl exec -n kube-system ds/cilium -- \
  cilium-dbg status --verbose | grep -A5 "Envoy"

# Check CiliumEnvoyConfig status
kubectl get cec -n production -o yaml

Causes and Solutions:

Envoy Pod Resource Shortage: When a node has heavy L7 traffic, Envoy may run out of memory. Increase envoy.resources.limits.memory in Helm values
Invalid CiliumEnvoyConfig: If there are Envoy configuration syntax errors, the corresponding listener will not be loaded. Use the cilium-dbg envoy config command to verify the actually loaded configuration
Node Affinity Issues: If Envoy is not scheduled on specific nodes, L7 policies will not work on those nodes

Case 3: Connection Disruption After Service Mesh Upgrade

Symptoms: After a Cilium version upgrade, communication between some Pods fails intermittently

# Check Cilium Agent rolling restart status
kubectl rollout status ds/cilium -n kube-system

# Check eBPF map synchronization status
kubectl exec -n kube-system ds/cilium -- \
  cilium-dbg bpf endpoint list

# Wait for endpoint recovery
kubectl exec -n kube-system ds/cilium -- \
  cilium-dbg endpoint list | grep -v ready

Recovery Procedure:

First, verify that the Cilium Agent has restarted successfully on all nodes
Confirm that eBPF maps have been reloaded properly using cilium-dbg bpf endpoint list
If the problem persists, restart the affected Pods to re-register the endpoints
As a last resort, run cilium-dbg endpoint regenerate --all to regenerate the eBPF programs for all endpoints

Note: When upgrading, always verify changes beforehand with helm diff upgrade and upgrade one minor version at a time. Skipping from Cilium 1.17 to 1.19 is not supported.

Case 4: Hubble Observation Data Not Being Collected

Symptoms: The hubble observe command returns empty results or times out

# Check Hubble Relay status
kubectl get pods -n kube-system -l k8s-app=hubble-relay
kubectl logs -n kube-system -l k8s-app=hubble-relay --tail=50

# Test Hubble connectivity
cilium hubble port-forward &
hubble status

# Cilium Agent's Hubble monitoring status
kubectl exec -n kube-system ds/cilium -- \
  cilium-dbg monitor --type drop --type trace

Solution: The most common cause is a disconnected gRPC connection for Hubble Relay. Restart the Relay with kubectl rollout restart deployment hubble-relay -n kube-system.

Operations Notes

Migration from Istio to Cilium Service Mesh

When migrating from an existing Istio environment to Cilium Service Mesh, proceed gradually rather than switching all at once.

Phase 1 - Coexistence (2-4 weeks): Install Cilium as the CNI but keep service mesh features disabled. Istio sidecars continue to operate.

Phase 2 - Per-Namespace Transition: Starting with non-critical namespaces, disable Istio sidecar injection and enable Cilium's mTLS and L7 policies.

Phase 3 - Full Transition: Remove Istio from all namespaces and switch entirely to Cilium Service Mesh.

# Disable Istio sidecar injection per namespace
kubectl label namespace staging istio-injection-
kubectl rollout restart deployment -n staging

# Apply Cilium mTLS policy
kubectl apply -f cilium-mtls-policy-staging.yaml

# Verify after transition
hubble observe --namespace staging --protocol http

Resource Sizing Guidelines

Cilium Agent (DaemonSet):

Small clusters (10 nodes or fewer): CPU 200m / Memory 256Mi
Medium clusters (50 nodes or fewer): CPU 500m / Memory 512Mi
Large clusters (100 nodes or more): CPU 1000m / Memory 1Gi

Envoy DaemonSet:

Low L7 traffic: CPU 100m / Memory 128Mi
High L7 traffic: CPU 500m / Memory 512Mi
Very high L7 throughput: CPU 1000m / Memory 1Gi

SPIRE Server:

Under 1,000 workloads: CPU 200m / Memory 256Mi
Over 5,000 workloads: CPU 500m / Memory 512Mi, HA configuration is mandatory

Essential Monitoring Metrics

# Key metrics to collect in Prometheus

# 1. Cilium Agent status
# cilium_agent_api_process_time_seconds - API processing time
# cilium_agent_bootstrap_seconds - Agent startup time
# cilium_bpf_map_ops_total - BPF map operation count

# 2. Service mesh related
# cilium_proxy_upstream_reply_seconds - L7 proxy upstream response time
# cilium_proxy_redirects - Number of connections redirected to L7 proxy
# cilium_auth_map_entries - mTLS authentication map entry count

# 3. Hubble observability
# hubble_flows_processed_total - Total processed flows
# hubble_tcp_flags_total - Count by TCP flags

# Import Grafana dashboards
# Official Cilium dashboard: https://grafana.com/grafana/dashboards/16611

Operations Checklist

Pre-Deployment Checklist

Verify the kernel version is 5.10 or higher (6.1 or higher recommended)
Verify that CONFIG_BPF, CONFIG_BPF_SYSCALL, and CONFIG_BPF_JIT are enabled
When installing in kube-proxy replacement mode, verify that the existing kube-proxy has been removed
Verify that the SPIRE server is deployed in an HA configuration (required for production)
Verify that CiliumNetworkPolicy does not conflict with existing Kubernetes NetworkPolicy
Verify that PodDisruptionBudget is configured for Cilium DaemonSet and SPIRE
Verify that cilium connectivity test passes

Upgrade Checklist

Always check Breaking Changes in the Cilium release notes
Upgrade one minor version at a time sequentially (1.17 -> 1.18 -> 1.19)
Verify changes beforehand with helm diff upgrade
Upgrade in the staging environment first and observe for at least 24 hours
Monitor Agent rolling restart during upgrade with cilium status
Re-run cilium connectivity test after upgrade

Incident Response Checklist

Cilium Agent failure: Pods on the affected node can continue communicating using existing eBPF maps, but policy updates are suspended
Envoy DaemonSet failure: Only L7 policies are affected; L4 policies continue to work via eBPF
SPIRE server failure: Recovery is required before the TTL (default 1 hour) of existing certificates expires
etcd failure (Cilium KVStore): Cilium Agent operates with local cache, but new policies cannot be applied

References

Quiz

Q1: What is the main topic covered in "Cilium Service Mesh: Building and Operating Sidecarless Service Mesh with eBPF"?

A comprehensive guide covering Cilium Service Mesh sidecarless architecture based on eBPF, Envoy integration, mTLS configuration, L4/L7 traffic management, performance comparison with Istio, and production troubleshooting.

Q2: Describe the Architecture and Core Concepts.

Limitations of the Traditional Sidecar Model In a traditional service mesh, an Envoy sidecar is injected into every Pod. If you have 100 Pods, 100 additional Envoy instances are running.

Q3: What are the key steps for Installation and Service Mesh Enablement?

Prerequisites Installing Service Mesh with Helm Here is an explanation of the key parameters in this configuration: envoyConfig.enabled=true: Enables L7 traffic management via the CiliumEnvoyConfig CRD ingressController.enabled=true: Uses Cilium as the Kubernetes Ingress controll...

Q4: What are the key steps for mTLS Configuration: SPIRE-Based Mutual Authentication?

Why mTLS Is Needed One of the core features of a service mesh is mutual authentication (mTLS) for inter-service communication. mTLS requires both the client and server to present certificates, verifying each other's identity.

Q5: How does L4/L7 Traffic Management work?

L7 Routing with CiliumEnvoyConfig Cilium Service Mesh performs L7 traffic management through the CiliumEnvoyConfig CRD. This serves a similar role to Istio's VirtualService/DestinationRule.