Skip to content
Published on

Complete Guide to Kubernetes Network Policy and Service Mesh (Istio, Cilium, Calico Comparison)

Authors
  • Name
    Twitter
Kubernetes Network Policy and Service Mesh Architecture

Introduction

In a Kubernetes cluster, Pod-to-Pod communication is allow-all by default. This means any Pod can freely access any other Pod within the same cluster. While this isn't a major issue in small-scale development environments, it becomes a serious security threat in production environments where dozens or hundreds of microservices are running.

If an attacker compromises a single Pod, lateral movement to all services within the cluster becomes possible. To prevent this, Network Policy for network segmentation and Service Mesh for mTLS encryption and zero trust architecture are essential.

This article covers Kubernetes Network Policy from basics to advanced topics, and provides a comparative analysis of three major Service Mesh solutions: Istio, Cilium, and Calico. It includes real-world troubleshooting cases and performance benchmarks to help you make the best choice for your environment.

Kubernetes Network Policy Basics

What is Network Policy?

Network Policy is a Kubernetes-native resource that acts as a firewall rule controlling inbound (Ingress) and outbound (Egress) traffic at the Pod level. It operates based on label selectors and provides consistent policy enforcement even when Pods are restarted or moved between nodes.

Important prerequisite: Even if you create Network Policy resources, the policies will not take effect without a CNI plugin (Calico, Cilium, Antrea, etc.) that implements them. Default kubenet and Flannel do not support Network Policy.

Default Deny Policy

The starting point for all network security is the Default Deny policy. First block all traffic, then apply a whitelist approach that explicitly allows only necessary communication.

# default-deny-all.yaml
# Block all Ingress/Egress traffic for all Pods in the namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {} # Empty selector = all Pods in namespace
  policyTypes:
    - Ingress
    - Egress

Once this policy is applied, all Pods in the production namespace will have both inbound and outbound traffic completely blocked. Since DNS lookups will also fail, you must apply a DNS allow policy alongside it.

# allow-dns.yaml
# Policy to allow access to kube-dns (CoreDNS)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

Allowing Specific Pod-to-Pod Communication

Here's an example of allowing frontend access to the backend API from a Default Deny state.

# allow-frontend-to-backend.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-backend
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend-api
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - protocol: TCP
          port: 8080

This policy allows only inbound traffic from Pods with the app: frontend label to TCP port 8080 of app: backend-api Pods.

Advanced Network Policy: Egress, CIDR, and Port Control

Controlling External Access with Egress Policy

When microservices need to access external APIs or databases, Egress policies can precisely restrict allowed targets.

# egress-external-api-and-db.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-egress-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend-api
  policyTypes:
    - Egress
  egress:
    # 1. Allow access to Redis in the same namespace
    - to:
        - podSelector:
            matchLabels:
              app: redis
      ports:
        - protocol: TCP
          port: 6379
    # 2. Allow access to external PostgreSQL RDS (CIDR-based)
    - to:
        - ipBlock:
            cidr: 10.100.0.0/16
      ports:
        - protocol: TCP
          port: 5432
    # 3. Allow external HTTPS API access
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
              - 10.0.0.0/8
              - 172.16.0.0/12
              - 192.168.0.0/16
      ports:
        - protocol: TCP
          port: 443
    # 4. Allow DNS
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports:
        - protocol: UDP
          port: 53

Cross-Namespace Communication Control

Namespace isolation is essential in multi-tenant environments. Here's a pattern that allows access from specific namespaces only.

# cross-namespace-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-monitoring-access
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend-api
  policyTypes:
    - Ingress
  ingress:
    # Allow metrics scraping only from Prometheus in the monitoring namespace
    - from:
        - namespaceSelector:
            matchLabels:
              team: monitoring
          podSelector:
            matchLabels:
              app: prometheus
      ports:
        - protocol: TCP
          port: 9090

Limitations of Network Policy

Kubernetes basic Network Policy only supports L3/L4 level (IP, port, protocol) control. The following requirements cannot be addressed with basic Network Policy:

  • L7 (HTTP path, method, header) based filtering
  • mTLS encryption and service authentication
  • Traffic observability and distributed tracing
  • Advanced traffic management (canary deployments, circuit breakers, retries)
  • FQDN (domain) based Egress control

Service Mesh comes into play when these advanced features are needed.

Service Mesh Architecture Comparison (Istio vs Cilium vs Calico)

Here's a comparison of the architecture and features of three major solutions.

CategoryIstio (Ambient Mode)Cilium Service MeshCalico (Enterprise)
Data Planeztunnel(L4) + Waypoint Proxy(L7)eBPF(L3/L4) + per-node Envoy(L7)iptables/eBPF + Envoy(L7)
SidecarNot required (Ambient Mode)Not requiredOptional
mTLSAutomatic (HBONE protocol)WireGuard/IPsecWireGuard manual setup
L7 PolicyAuthorizationPolicyCiliumNetworkPolicyGlobalNetworkPolicy
ObservabilityKiali, Jaeger, PrometheusHubble (built-in)Calico Enterprise UI
Performance OverheadMedium (via ztunnel)Low (kernel level)Medium
CPU UsageModerate30% less (L4 baseline)Moderate
QPS PerformanceHigh (excellent at low connections)High (excellent at high connections)Moderate
Multi-clusterSupported (East-West Gateway)Cluster Mesh supportedFederation supported
Learning CurveHighMediumMedium
CommunityVery large (CNCF Graduated)Large (CNCF Graduated)Large (Tigera-led)
Windows NodesNot supportedNot supportedSupported
Best Use CaseLarge multi-cluster, precise L7 controlHigh-performance L4, eBPF-based observabilityHybrid environments, enterprise compliance

Architecture Selection Criteria

  • Only L3/L4 network security needed: Kubernetes basic Network Policy + Calico/Cilium CNI
  • L7 traffic management + mTLS is key: Istio Ambient Mode
  • High performance + kernel-level observability: Cilium Service Mesh
  • Enterprise compliance + hybrid: Calico Enterprise

Building Service Mesh with Istio

Installing Istio Ambient Mode

Istio Ambient Mode became GA starting from Istio 1.24. It provides mTLS and L7 traffic management while reducing CPU/memory overhead by over 90% compared to the traditional sidecar approach.

# Install istioctl
curl -L https://istio.io/downloadIstio | ISTIO_VERSION=1.24.2 sh -
export PATH="$HOME/istio-1.24.2/bin:$PATH"

# Install with Ambient profile
istioctl install --set profile=ambient --skip-confirmation

# Verify installation
kubectl get pods -n istio-system
# NAME                                   READY   STATUS    RESTARTS   AGE
# istiod-7b69f4b6c-xxxxx                 1/1     Running   0          60s
# ztunnel-xxxxx                          1/1     Running   0          60s
# istio-cni-node-xxxxx                   1/1     Running   0          60s

# Enable Ambient mode for namespace
kubectl label namespace production istio.io/dataplane-mode=ambient

Deploying Waypoint Proxy (for L7 policies)

If only L4-level mTLS is needed, ztunnel alone is sufficient. Deploy a Waypoint Proxy when you need L7-level precise traffic control.

# Create Waypoint Proxy
istioctl waypoint apply --namespace production --name backend-waypoint

# Connect Waypoint to specific service
kubectl label service backend-api \
  istio.io/use-waypoint=backend-waypoint \
  -n production

Istio AuthorizationPolicy Configuration

Istio's L7 policies control down to HTTP methods, paths, and headers through AuthorizationPolicy.

# istio-auth-policy.yaml
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: backend-api-policy
  namespace: production
spec:
  targetRefs:
    - kind: Service
      group: ''
      name: backend-api
  action: ALLOW
  rules:
    - from:
        - source:
            principals:
              - 'cluster.local/ns/production/sa/frontend'
      to:
        - operation:
            methods: ['GET', 'POST']
            paths: ['/api/v1/*']
    - from:
        - source:
            principals:
              - 'cluster.local/ns/monitoring/sa/prometheus'
      to:
        - operation:
            methods: ['GET']
            paths: ['/metrics']

This policy allows only GET and POST to /api/v1/ paths from the frontend service account, and only GET to /metrics from Prometheus. All other requests are rejected with 403 Forbidden.

Cilium eBPF-based Service Mesh

Installing Cilium and Enabling Service Mesh

Cilium leverages eBPF to handle networking at the kernel level. It processes L4 traffic without sidecar proxies and uses a per-node shared Envoy proxy only when L7 processing is needed.

# Install Cilium CLI
CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)
GOOS=$(go env GOOS)
GOARCH=$(go env GOARCH)
curl -L --fail --remote-name-all \
  "https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-${GOOS}-${GOARCH}.tar.gz"
sudo tar xzvfC "cilium-${GOOS}-${GOARCH}.tar.gz" /usr/local/bin

# Install Cilium with Helm (Service Mesh + Hubble enabled)
helm repo add cilium https://helm.cilium.io/
helm repo update

helm install cilium cilium/cilium --version 1.17.0 \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set hubble.enabled=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true \
  --set envoy.enabled=true \
  --set encryption.enabled=true \
  --set encryption.type=wireguard

# Verify installation
cilium status --wait
cilium connectivity test

Applying L7 Policies with CiliumNetworkPolicy

Cilium supports precise L7 protocol-level policies for HTTP, gRPC, Kafka, and more through its own CRD, CiliumNetworkPolicy.

# cilium-l7-policy.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: backend-l7-policy
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: backend-api
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: frontend
      toPorts:
        - ports:
            - port: '8080'
              protocol: TCP
          rules:
            http:
              - method: 'GET'
                path: '/api/v1/products'
              - method: 'POST'
                path: '/api/v1/orders'
              - method: 'GET'
                path: '/healthz'
    - fromEndpoints:
        - matchLabels:
            app: prometheus
      toPorts:
        - ports:
            - port: '9090'
              protocol: TCP
          rules:
            http:
              - method: 'GET'
                path: '/metrics'

Network Observability with Hubble

Cilium's Hubble monitors all network flows in real-time based on eBPF.

# Observe after installing Hubble CLI
hubble observe --namespace production --follow

# Filter traffic for specific Pod
hubble observe --namespace production \
  --to-label app=backend-api \
  --verdict DROPPED

# Observe HTTP requests (L7)
hubble observe --namespace production \
  --protocol http \
  --http-status 5xx

# Visualize network flows (Hubble UI)
cilium hubble port-forward &
# Access http://localhost:12000 in browser

Hubble output example:

TIMESTAMP             SOURCE                  DESTINATION             TYPE      VERDICT   SUMMARY
Mar 14 10:23:01.123   production/frontend     production/backend-api  L7/HTTP   FORWARDED GET /api/v1/products => 200
Mar 14 10:23:01.456   production/attacker     production/backend-api  L7/HTTP   DROPPED   POST /api/v1/admin => Policy denied
Mar 14 10:23:02.789   production/backend-api  production/redis        L4/TCP    FORWARDED TCP 6379

mTLS and Zero Trust Networking

What is Zero Trust?

Zero Trust is a security model of "trust nothing, verify everything." It encrypts communication within the cluster and verifies identity in every service-to-service call. Since Network Policy alone cannot encrypt traffic, a Service Mesh providing mTLS (mutual TLS) is required.

Istio Ambient Mode's mTLS

Istio Ambient Mode uses the HBONE (HTTP-Based Overlay Network Environment) protocol to automatically mTLS-encrypt all traffic. ztunnel manages certificates at the node level and issues a unique SPIFFE-based workload ID for each Pod.

# istio-peer-auth.yaml
# STRICT mode: Reject plaintext traffic without mTLS
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: strict-mtls
  namespace: production
spec:
  mtls:
    mode: STRICT

When STRICT mode is set, all services in the namespace will only accept mTLS connections. Plaintext requests from services not included in the mesh are all rejected.

Cilium's WireGuard-based Encryption

Cilium uses kernel-built-in WireGuard to automatically encrypt inter-node traffic. Unlike Istio's mTLS, it operates at the kernel level rather than the application level, resulting in less performance overhead.

# Check WireGuard encryption status
cilium encrypt status

# Output example:
# Encryption:  Wireguard
# Keys in use: 2
# Errors:      0
# Interfaces:  cilium_wg0

# Check encryption key list
cilium encrypt get

Differences between WireGuard and mTLS:

  • WireGuard: L3-level inter-node encryption, kernel-level processing, no individual Pod identity
  • mTLS (Istio): L7-level inter-service encryption, SPIFFE-based workload ID, fine-grained authorization policies

In production environments, a dual security strategy is sometimes used: applying Cilium WireGuard for inter-node encryption while additionally implementing Istio mTLS for workload-level authentication.

Operational Considerations and Troubleshooting

1. Network Policy Application Order

Network Policies work in an additive (union) manner. When multiple policies apply to the same Pod, allow rules are aggregated. When conflicting policies exist, deny rules do NOT take priority; if any single policy allows the traffic, it passes through.

# Check all Network Policies applied to specific Pod
kubectl get networkpolicy -n production -o wide

# Check Pod labels (verify policy selector matching)
kubectl get pods -n production --show-labels

# For Calico, verify policy matching
calicoctl get networkpolicy -n production -o yaml

2. Network Policy Ignored Without CNI Plugin

This is the most common mistake. Even if Network Policy resources are created, policies are not enforced at all without a CNI plugin that implements them.

# Check CNI plugin
kubectl get pods -n kube-system | grep -E 'calico|cilium|antrea'

# Test to verify Network Policy is actually enforced
# 1. Apply Default Deny
kubectl apply -f default-deny-all.yaml

# 2. Test communication (should be blocked)
kubectl exec -n production deploy/frontend -- \
  curl -s --connect-timeout 3 http://backend-api:8080/healthz
# Timeout indicates policy is properly enforced

3. DNS Resolution Failure

If DNS allow is omitted after applying a Default Deny policy, all service discovery will be disrupted.

# Diagnose DNS issue
kubectl exec -n production deploy/frontend -- nslookup backend-api
# ;; connection timed out; no servers could be reached

# Check CoreDNS Pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Re-test after applying DNS policy
kubectl apply -f allow-dns.yaml
kubectl exec -n production deploy/frontend -- nslookup backend-api
# Server:    10.96.0.10
# Name:      backend-api.production.svc.cluster.local

4. Istio ztunnel Failure Response

ztunnel runs as a per-node DaemonSet, and when it fails, all Ambient Mesh traffic on that node is disrupted.

# Check ztunnel status
kubectl get pods -n istio-system -l app=ztunnel

# Check ztunnel logs (diagnose certificate issues)
kubectl logs -n istio-system -l app=ztunnel --tail=50

# Restart ztunnel
kubectl rollout restart daemonset/ztunnel -n istio-system

# Check xDS connection status with Istiod
istioctl proxy-status

5. Cilium eBPF Map Capacity Exceeded

In large-scale clusters, the default eBPF map sizes in Cilium may be insufficient.

# Check eBPF map usage
cilium bpf ct list global | wc -l
cilium bpf policy get --all

# Increase map sizes (modify Helm values)
# bpf.ctGlobalTCPMax: 524288  (increase from default)
# bpf.ctGlobalAnyMax: 262144
# bpf.policyMapMax: 65536

# Apply changes
helm upgrade cilium cilium/cilium \
  --namespace kube-system \
  --reuse-values \
  --set bpf.ctGlobalTCPMax=524288

Failure Cases and Recovery Procedures

Case 1: Full Service Outage from Incorrect Egress Policy

Situation: The operations team applied Default Deny Egress for security hardening but omitted the DNS allow policy, disrupting all inter-service communication.

Symptoms: Access via service names failed from all Pods. Direct IP access still worked.

Recovery procedure:

# 1. Immediately identify the problem
kubectl get networkpolicy -n production

# 2. Urgently apply DNS allow policy
kubectl apply -f - <<'POLICY'
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: emergency-allow-dns
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
POLICY

# 3. Verify service recovery
kubectl exec -n production deploy/frontend -- nslookup backend-api

Lesson: Default Deny policies must always be applied alongside DNS allow policies. Always test in a staging environment before making policy changes, and prepare a rollback plan.

Case 2: mTLS Mismatch During Istio Upgrade

Situation: During an Istio version upgrade, mTLS handshake failures occurred between old-version sidecars and new-version ztunnel.

Symptoms: 503 errors and "upstream connect error" messages between some services.

Recovery procedure:

# 1. Temporarily change mTLS mode to PERMISSIVE (allow both plaintext+mTLS)
kubectl apply -f - <<'POLICY'
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: permissive-during-upgrade
  namespace: production
spec:
  mtls:
    mode: PERMISSIVE
POLICY

# 2. Restart all workloads to apply latest proxy
kubectl rollout restart deployment -n production

# 3. Restore STRICT after all Pods are replaced with new version
kubectl rollout status deployment -n production --timeout=300s
kubectl apply -f - <<'POLICY'
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: strict-mtls
  namespace: production
spec:
  mtls:
    mode: STRICT
POLICY

Case 3: Momentary Traffic Drop Due to Cilium Agent Restart

Situation: During a Cilium DaemonSet update, the eBPF programs on the node were briefly unloaded, interrupting Pod-to-Pod communication on that node for several seconds.

Recovery and prevention:

# Rolling Update strategy to update one node at a time
helm upgrade cilium cilium/cilium \
  --namespace kube-system \
  --reuse-values \
  --set upgradeCompatibility=1.16 \
  --set rollOutCiliumPods=true

# Check PodDisruptionBudget
kubectl get pdb -n kube-system

# Monitor update progress
kubectl rollout status daemonset/cilium -n kube-system --timeout=600s

Performance Benchmarks and Selection Guide

Measured Benchmark Results (2025 baseline)

Here's a summary of comparison test results from recent large-scale enterprise environments.

MetricNetwork Policy OnlyIstio AmbientCilium Service Mesh
P99 Latency (ms)1.23.82.1
QPS (req/s)45,00038,00042,000
QPS per Core-2,1781,815
CPU OverheadBaseline+15%+8%
Memory OverheadBaseline+120MB/node+80MB/node
Low-connection Perf-ExcellentModerate
High-connection Perf-ModerateExcellent

Note: Istio's higher QPS per Core includes L7 processing capability, and Cilium's CPU measurements exclude in-kernel WireGuard encryption costs.

Selection Guide Flowchart

  1. Only L3/L4 network isolation needed?

    • Yes: Kubernetes Network Policy + Calico or Cilium CNI is sufficient
    • No: Go to 2
  2. L7 traffic management (canary, retry, circuit breaker) needed?

    • Yes: Istio Ambient Mode or Cilium + Envoy
    • No: Go to 3
  3. mTLS-based zero trust required?

    • Yes: Istio Ambient Mode (SPIFFE-based workload ID)
    • No: Cilium WireGuard encryption for inter-node encryption
  4. High performance + kernel-level observability a priority?

    • Yes: Cilium Service Mesh + Hubble
    • No: Istio (richer L7 features)
  5. Mixed Windows nodes or hybrid environment?

    • Yes: Calico Enterprise
    • No: Istio or Cilium

Recommendations by Operational Scale

  • Small clusters (10 nodes or fewer): Cilium CNI + basic Network Policy. Service Mesh adoption has limited benefit relative to overhead.
  • Medium clusters (10-100 nodes): Cilium Service Mesh or Istio Ambient. Choose based on L7 requirements.
  • Large clusters (100+ nodes): Istio Ambient Mode. Mature multi-cluster support, rich ecosystem, stability.
  • Hybrid/Multi-cloud: Calico Enterprise or Cilium Cluster Mesh.

Conclusion

Kubernetes network security goes beyond simply applying Network Policy; it requires a comprehensive strategy tailored to your application characteristics and security requirements.

Key takeaways:

  • Network Policy is the absolute baseline: Start with Default Deny and apply a whitelist approach that allows only necessary communication. Don't forget the DNS allow policy.
  • Introduce Service Mesh when needed: Adopt it when mTLS, L7 policies, and advanced traffic management are actually required. Unnecessary complexity only increases operational burden.
  • Istio vs Cilium is a trade-off: Istio has rich L7 features and ecosystem, while Cilium excels at kernel-level performance and observability. Choose based on your requirements.
  • Gradual adoption is key: The safest approach is to start with Default Deny policies, thoroughly validate Network Policies, and then add Service Mesh as needed.
  • Automation and testing: Policy changes must always be validated in staging first through CI/CD pipelines, and always have a rollback plan ready.

Network security is not a one-time setup. As services evolve, policies must be continuously reviewed and updated. Quarterly policy audits are recommended, and use observability tools like Hubble or Kiali to optimize policies based on actual traffic patterns.

References