Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

1. Fundamental Principles of the Kubernetes Networking Model

Kubernetes networking is built on three fundamental principles.

1.1 Core Requirements

Every Pod gets a unique IP address -- can communicate with others without NAT
Every Pod can reach every other Pod -- whether on the same node or different nodes
The IP a Pod sees for itself is the same IP others see -- no NAT

┌────────────────────────────────────────────────────┐
│       Kubernetes Networking Fundamentals           │
│                                                    │
│  ┌──────────┐           ┌──────────┐              │
│  │  Node 1  │           │  Node 2  │              │
│  │          │           │          │              │
│  │ ┌──────┐ │   No NAT  │ ┌──────┐ │              │
│  │ │Pod A │<├───────────>┤│Pod C │ │              │
│  │ │10.1.1│ │           │ │10.1.2│ │              │
│  │ └──────┘ │           │ └──────┘ │              │
│  │ ┌──────┐ │           │ ┌──────┐ │              │
│  │ │Pod B │ │           │ │Pod D │ │              │
│  │ │10.1.1│ │           │ │10.1.2│ │              │
│  │ └──────┘ │           │ └──────┘ │              │
│  └──────────┘           └──────────┘              │
│                                                    │
│  Pod A(10.1.1.2) -> Pod C(10.1.2.3): Direct comm  │
│  No NAT, no port mapping. Each Pod owns unique IP  │
└────────────────────────────────────────────────────┘

1.2 Networking Layer Hierarchy

┌──────────────────────────────────────────┐
│       Kubernetes Networking Layers       │
├──────────────────────────────────────────┤
│ L7: Ingress / Gateway API               │
│     (HTTP routing, TLS termination)      │
├──────────────────────────────────────────┤
│ L4: Service                             │
│     (ClusterIP, NodePort, LoadBalancer)  │
├──────────────────────────────────────────┤
│ L3: Pod Networking (CNI)                │
│     (Pod-to-Pod, overlay/underlay)       │
├──────────────────────────────────────────┤
│ L2-L3: Node Networking                  │
│     (Physical/Virtual network)           │
└──────────────────────────────────────────┘

2. Pod-to-Pod Networking: How It Works Internally

2.1 Pod Communication on the Same Node

┌────────────────────────────────────────────┐
│                  Node                      │
│                                            │
│  ┌──────┐    veth pair    ┌──────────┐    │
│  │Pod A │<--------------->│           │    │
│  │eth0  │                │   cbr0    │    │
│  │      │                │  (bridge) │    │
│  └──────┘                │           │    │
│                          │           │    │
│  ┌──────┐    veth pair    │           │    │
│  │Pod B │<--------------->│           │    │
│  │eth0  │                └──────────┘    │
│  └──────┘                                │
│                                            │
│  1. Pod A sends packet to Pod B            │
│  2. Travels via veth pair to bridge (cbr0) │
│  3. Bridge finds destination veth by MAC   │
│  4. Delivered to Pod B via its veth pair    │
└────────────────────────────────────────────┘

veth pair: A virtual ethernet pair -- one end is eth0 in the Pod namespace, the other connects to the node's bridge.

2.2 Pod Communication Across Nodes (Overlay)

┌──────────┐                      ┌──────────┐
│  Node 1  │                      │  Node 2  │
│          │                      │          │
│ ┌──────┐ │     VXLAN/IPinIP     │ ┌──────┐ │
│ │Pod A │ │    ┌───────────┐     │ │Pod C │ │
│ │10.1.1│ │--->│ Tunneling │---->│ │10.1.2│ │
│ └──────┘ │    │           │     │ └──────┘ │
│          │    │Encapsulate│     │          │
│ cbr0     │    │original   │     │ cbr0     │
│ 10.1.1.0 │    │in outer   │     │ 10.1.2.0 │
│          │    └───────────┘     │          │
│ eth0     │                      │ eth0     │
│192.168.1│                      │192.168.2│
└──────────┘                      └──────────┘

VXLAN Encapsulation:
┌─────────────────────────────────────────┐
│OuterIP │ UDP  │ VXLAN │InnerIP│ Payload │
│ Hdr    │ Hdr  │ Hdr   │ Hdr   │         │
│192->192│      │       │10->10 │ Data    │
└─────────────────────────────────────────┘

2.3 Overlay vs Underlay Networking

Feature	Overlay (VXLAN/IPinIP)	Underlay (BGP/Direct)
Setup Difficulty	Easy	Hard
Network Requirements	L3 connectivity only	BGP support needed
Performance Overhead	Yes (encapsulation)	None
MTU Impact	Reduced (50-54 bytes)	None
Debugging	Hard	Easy
Use Case	Cloud, multi-tenant	Bare metal, high perf

3. Detailed CNI Plugin Comparison

3.1 What Is CNI

CNI (Container Network Interface) is the standard interface for container runtimes to interact with network plugins.

CNI Flow During Pod Creation:

1. kubelet requests container creation via CRI
2. Container runtime creates network namespace
3. kubelet invokes CNI plugin (ADD command)
4. CNI plugin:
   a. Creates veth pair
   b. Assigns IP address to Pod (IPAM)
   c. Configures routing rules
   d. Sets up overlay tunnel if needed
5. Pod is network-ready

3.2 Major CNI Plugin Comparison

Feature	Calico	Cilium	Flannel	Weave Net
Developer	Tigera	Isovalent (Cisco)	CoreOS	Weaveworks
Data Plane	iptables/eBPF	eBPF	VXLAN	VXLAN/Sleeve
Network Mode	BGP/VXLAN/IPinIP	VXLAN/Native	VXLAN	VXLAN
Network Policy	Rich	Very rich (L7)	None	Basic
Encryption	WireGuard	WireGuard/IPsec	None	IPsec
kube-proxy Replacement	eBPF mode	Native eBPF	No	No
Observability	Basic	Hubble (powerful)	None	Scope
Service Mesh	None	Built-in (optional)	None	None
Multi-cluster	Supported	ClusterMesh	Not supported	Supported
Performance	High (BGP)	Very high (eBPF)	Medium	Low
Complexity	Medium	High	Low	Low
Production Rec.	Strongly rec.	Strongly rec.	Small only	Not rec.

3.3 Calico Deep Dive

# Calico BGP mode configuration (IPPool)
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  name: default-ipv4-ippool
spec:
  cidr: 10.244.0.0/16
  encapsulation: None        # No encapsulation with BGP
  natOutgoing: true
  nodeSelector: all()
  blockSize: 26              # /26 = 64 IPs per node

---
# Calico BGP Peering configuration
apiVersion: crd.projectcalico.org/v1
kind: BGPPeer
metadata:
  name: rack-tor-switch
spec:
  peerIP: 192.168.1.1
  asNumber: 64512
  nodeSelector: rack == "rack-1"

# Calico VXLAN mode (cloud environments)
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  name: default-ipv4-ippool
spec:
  cidr: 10.244.0.0/16
  encapsulation: VXLAN       # VXLAN encapsulation
  natOutgoing: true
  vxlanMode: Always

3.4 Cilium Deep Dive

# Cilium Helm install (kube-proxy replacement mode)
# helm install cilium cilium/cilium --namespace kube-system \
#   --set kubeProxyReplacement=true \
#   --set k8sServiceHost=API_SERVER_IP \
#   --set k8sServicePort=6443 \
#   --set hubble.enabled=true \
#   --set hubble.relay.enabled=true \
#   --set hubble.ui.enabled=true \
#   --set encryption.enabled=true \
#   --set encryption.type=wireguard

# Verify Cilium status
# cilium status
# cilium connectivity test

# Cilium L7 Network Policy (HTTP-based policy)
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: l7-rule
  namespace: default
spec:
  endpointSelector:
    matchLabels:
      app: backend
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: frontend
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP
      rules:
        http:
        - method: "GET"
          path: "/api/v1/.*"
        - method: "POST"
          path: "/api/v1/orders"

4. Service Deep Dive: ClusterIP to LoadBalancer

4.1 How Services Work

┌──────────────────────────────────────────────────┐
│               Service Traffic Flow                │
│                                                  │
│  Client Pod --> Service VIP --> kube-proxy        │
│                (10.96.0.10)    /iptables/IPVS    │
│                                    |             │
│                           ┌────────┼────────┐    │
│                           v        v        v    │
│                        Pod A    Pod B    Pod C    │
│                       10.1.1.2 10.1.1.3 10.1.2.4│
│                                                  │
│  Service VIP is not bound to any real interface   │
│  iptables/IPVS rules DNAT to Pod IPs             │
└──────────────────────────────────────────────────┘

4.2 Service Types in Detail

# 1. ClusterIP (default) - internal access only
apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  type: ClusterIP
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 8080
  # Cluster IP: auto-assigned (e.g., 10.96.0.10)
  # Accessible only within cluster at 10.96.0.10:80

# 2. NodePort - external access via fixed port on all nodes
apiVersion: v1
kind: Service
metadata:
  name: my-nodeport-service
spec:
  type: NodePort
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 8080
    nodePort: 30080    # Range: 30000-32767
  # Accessible via port 30080 on any node
  # NodeIP:30080 -> ClusterIP:80 -> Pod:8080

# 3. LoadBalancer - auto-provisions cloud load balancer
apiVersion: v1
kind: Service
metadata:
  name: my-lb-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
spec:
  type: LoadBalancer
  selector:
    app: my-app
  ports:
  - port: 443
    targetPort: 8443
  # External LB IP -> NodePort -> ClusterIP -> Pod

# 4. ExternalName - maps external DNS to cluster service
apiVersion: v1
kind: Service
metadata:
  name: external-db
spec:
  type: ExternalName
  externalName: mydb.example.com
  # external-db.default.svc.cluster.local -> mydb.example.com

# 5. Headless Service - returns Pod IPs directly, no ClusterIP
apiVersion: v1
kind: Service
metadata:
  name: my-headless-service
spec:
  clusterIP: None            # Declares Headless
  selector:
    app: my-stateful-app
  ports:
  - port: 5432
  # DNS queries return individual Pod IPs, not Service IP
  # Used with StatefulSet: pod-0.my-headless-service.default.svc

4.3 Session Affinity

apiVersion: v1
kind: Service
metadata:
  name: sticky-service
spec:
  selector:
    app: web
  ports:
  - port: 80
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800  # 3 hours
  # Same client IP routes to same Pod

5. kube-proxy Modes: iptables vs IPVS vs eBPF

5.1 iptables Mode (Default)

iptables rule chain (per Service):

PREROUTING -> KUBE-SERVICES -> KUBE-SVC-XXXX -> KUBE-SEP-YYYY
                                               (probabilistic)

Service with 3 Pods:
KUBE-SVC-XXXX:
  -p tcp -d 10.96.0.10 --dport 80
    -> 33% probability KUBE-SEP-1 (DNAT -> 10.1.1.2:8080)
    -> 33% probability KUBE-SEP-2 (DNAT -> 10.1.1.3:8080)
    -> 34% probability KUBE-SEP-3 (DNAT -> 10.1.2.4:8080)

Problems:
- Rule count grows proportionally with Services/Pods O(n)
- Full chain rewrite on rule updates
- Severe degradation at 10,000+ Services

5.2 IPVS Mode

IPVS (IP Virtual Server):

┌───────────────────────────────────────────┐
│  IPVS Virtual Server: 10.96.0.10:80      │
│                                           │
│  Real Server 1: 10.1.1.2:8080  (weight 1)│
│  Real Server 2: 10.1.1.3:8080  (weight 1)│
│  Real Server 3: 10.1.2.4:8080  (weight 1)│
│                                           │
│  Algorithm: rr (Round Robin)              │
│  Others: lc, dh, sh, sed, nq             │
└───────────────────────────────────────────┘

Advantages:
- Hash table-based O(1) lookup
- Multiple load balancing algorithms
- Stable performance at scale
- Real-time statistics and connection tracking

# kube-proxy IPVS mode configuration
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
  scheduler: "rr"       # rr, lc, dh, sh, sed, nq
  syncPeriod: "30s"
  minSyncPeriod: "2s"

5.3 eBPF Mode (Cilium)

eBPF kube-proxy replacement:

┌────────────────────────────────────────────┐
│  Before: Pod -> iptables/IPVS -> Pod       │
│  eBPF:   Pod -> BPF program -> Pod (direct)│
│                                            │
│  ┌──────┐   BPF map    ┌──────┐           │
│  │Pod A │--(Service   --│Pod B │           │
│  │      │   lookup)     │      │           │
│  └──────┘               └──────┘           │
│                                            │
│  No iptables chain traversal!              │
│  Direct packet redirect in kernel space    │
└────────────────────────────────────────────┘

Comparison	iptables	IPVS	eBPF (Cilium)
Lookup Complexity	O(n)	O(1)	O(1)
Rule Updates	Full rewrite	Incremental	Incremental
Load Balancing	Probabilistic	Multiple algorithms	Maglev hash
Connection Tracking	conntrack	conntrack	BPF conntrack
Performance	Medium	High	Very high
L7 Policy	No	No	Yes
Observability	Limited	Basic	Hubble (powerful)
10K Services	Very slow	Fast	Very fast

6. DNS: CoreDNS and Service Discovery

6.1 CoreDNS Architecture

┌──────────────────────────────────────────────────┐
│                CoreDNS Operation                  │
│                                                  │
│  Pod --DNS query--> CoreDNS Pod                  │
│  (nameserver        (kube-system)                │
│   10.96.0.10)                                    │
│                      |                           │
│           ┌──────────┼──────────┐                │
│           v          v          v                │
│     K8s API     Corefile     Upstream            │
│   (Service/Pod  (Plugin      DNS                 │
│    records)      config)   (External DNS)        │
└──────────────────────────────────────────────────┘

6.2 DNS Record Format

Service DNS:
  my-service.my-namespace.svc.cluster.local
  |--svcname-| |-namespace--|

Pod DNS:
  10-1-1-2.my-namespace.pod.cluster.local
  |IP(dots->dashes)|

Headless Service Pod DNS (StatefulSet):
  pod-0.my-headless.my-namespace.svc.cluster.local
  |podname| |-svcname--|

SRV Records:
  _http._tcp.my-service.my-namespace.svc.cluster.local
  -> Returns port number and hostname

6.3 CoreDNS Corefile Configuration

# CoreDNS Corefile (ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
          lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
          ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
          max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }

6.4 DNS Debugging

# Create a debug Pod for DNS testing
kubectl run dnsutils --image=registry.k8s.io/e2e-test-images/jessie-dnsutils:1.3 \
  --restart=Never -- sleep infinity

# Service DNS lookup
kubectl exec dnsutils -- nslookup my-service.default.svc.cluster.local

# Pod DNS lookup
kubectl exec dnsutils -- nslookup 10-1-1-2.default.pod.cluster.local

# Detailed DNS response
kubectl exec dnsutils -- dig my-service.default.svc.cluster.local +search +short

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns -f

# Check resolv.conf
kubectl exec dnsutils -- cat /etc/resolv.conf
# nameserver 10.96.0.10
# search default.svc.cluster.local svc.cluster.local cluster.local
# options ndots:5

6.5 The ndots:5 Problem and Solutions

# Problem with default ndots:5:
# Looking up "api.example.com" (fewer than 5 dots):
# 1. api.example.com.default.svc.cluster.local (fail)
# 2. api.example.com.svc.cluster.local (fail)
# 3. api.example.com.cluster.local (fail)
# 4. api.example.com (success)
# -> 3 unnecessary queries for external DNS lookups!

# Solution 1: Customize Pod DNS config
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  dnsConfig:
    options:
    - name: ndots
      value: "2"       # Reduce ndots to 2
  containers:
  - name: app
    image: my-app

# Solution 2: Use FQDN directly (trailing dot)
# api.example.com.  <- trailing dot means FQDN

7. Ingress: HTTP/HTTPS Routing

7.1 Ingress Architecture

┌──────────────────────────────────────────────────┐
│               Ingress Architecture                │
│                                                  │
│  External Traffic                                │
│      |                                           │
│      v                                           │
│  ┌─────────────────────┐                         │
│  │  Ingress Controller │  (nginx, Traefik, etc.) │
│  │  (actual proxy)     │                         │
│  └─────────┬───────────┘                         │
│            |                                     │
│     Watches Ingress resources                    │
│     (routing rules)                              │
│            |                                     │
│     ┌──────┼──────┐                              │
│     v      v      v                              │
│   Svc A  Svc B  Svc C                            │
│   /api   /web   /docs                            │
└──────────────────────────────────────────────────┘

7.2 Ingress Resource Example

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - app.example.com
    secretName: app-tls-secret
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 80
      - path: /
        pathType: Prefix
        backend:
          service:
            name: frontend-service
            port:
              number: 80

7.3 Ingress Controller Comparison

Feature	nginx-ingress	Traefik	HAProxy	Contour
Developer	Kubernetes/F5	Traefik Labs	HAProxy Tech	VMware
Protocols	HTTP/HTTPS/gRPC	HTTP/HTTPS/gRPC/TCP	HTTP/HTTPS/TCP	HTTP/HTTPS/gRPC
Config Reload	Restart needed	Hot reload	Hot reload	Hot reload
Rate Limiting	Annotations	Middleware	Built-in	Not supported
Auth	Basic/OAuth	Forward Auth	Built-in	Limited
Performance	High	Medium	Very high	High
Gateway API	Supported	Supported	Partial	Full support
Community	Very large	Large	Medium	Medium

8. Gateway API: The Evolution of Ingress

8.1 Why Gateway API

Limitations of Ingress:

Non-standard configuration dependent on annotations
Only L7 HTTP supported (no TCP/UDP/gRPC)
Difficult role separation with a single resource
No advanced features like traffic splitting, header matching

8.2 Gateway API Resource Hierarchy

┌──────────────────────────────────────────────────┐
│          Gateway API Resource Hierarchy           │
│                                                  │
│  Infrastructure Admin:                           │
│  ┌─────────────┐                                 │
│  │GatewayClass │  Which controller to use        │
│  └──────┬──────┘                                 │
│         |                                        │
│  Cluster Operator:                               │
│  ┌──────v──────┐                                 │
│  │  Gateway    │  Listeners (port, protocol, TLS)│
│  └──────┬──────┘                                 │
│         |                                        │
│  Application Developer:                          │
│  ┌──────v──────┐                                 │
│  │ HTTPRoute   │  Routing rules (host,path,hdr)  │
│  │ TCPRoute    │                                 │
│  │ GRPCRoute   │                                 │
│  └─────────────┘                                 │
└──────────────────────────────────────────────────┘

8.3 Gateway API Practical Example

# 1. GatewayClass - Defined by infra admin
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: cilium
spec:
  controllerName: io.cilium/gateway-controller

---
# 2. Gateway - Defined by cluster operator
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: my-gateway
  namespace: gateway-infra
spec:
  gatewayClassName: cilium
  listeners:
  - name: https
    protocol: HTTPS
    port: 443
    tls:
      mode: Terminate
      certificateRefs:
      - kind: Secret
        name: wildcard-tls
    allowedRoutes:
      namespaces:
        from: Selector
        selector:
          matchLabels:
            shared-gateway: "true"

---
# 3. HTTPRoute - Defined by developer
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: app-routes
  namespace: my-app
spec:
  parentRefs:
  - name: my-gateway
    namespace: gateway-infra
  hostnames:
  - "app.example.com"
  rules:
  # Header-based routing
  - matches:
    - headers:
      - name: "X-Canary"
        value: "true"
    backendRefs:
    - name: app-canary
      port: 80
      weight: 100
  # Traffic splitting (Canary deployment)
  - matches:
    - path:
        type: PathPrefix
        value: /api
    backendRefs:
    - name: app-stable
      port: 80
      weight: 90
    - name: app-canary
      port: 80
      weight: 10
  # Default routing
  - backendRefs:
    - name: app-stable
      port: 80

8.4 Ingress vs Gateway API Comparison

Feature	Ingress	Gateway API
Role Separation	Single resource	GatewayClass/Gateway/Route
Protocols	HTTP/HTTPS only	HTTP/HTTPS/TCP/UDP/gRPC
Traffic Splitting	Annotations (non-standard)	Native weight support
Header Matching	Annotations (non-standard)	Standard spec
TLS Config	Basic	Fine-grained control
Cross-namespace	Difficult	Native support
Extensibility	Annotations only	Policy API extension
Status	GA (stable)	GA (v1.0+, 2023)

9. Network Policy: Microsegmentation

9.1 Basic Concept

Network Policy is a Pod-level firewall. It allows or blocks traffic to selected Pods.

Important: Without Network Policy, all traffic is allowed. Once any Network Policy applies to a Pod, only explicitly allowed traffic passes through.

9.2 Default Deny Policy

# Deny all Ingress in namespace (default deny)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}           # Applies to all Pods
  policyTypes:
  - Ingress                 # Block Ingress only (Egress allowed)

---
# Deny all Egress too
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

9.3 Practical Network Policy Examples

# Backend Pod: Allow access only from frontend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    # Allow only frontend Pods in same namespace
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080
  egress:
  # Allow DB access
  - to:
    - podSelector:
        matchLabels:
          app: postgres
    ports:
    - protocol: TCP
      port: 5432
  # Allow DNS (essential!)
  - to:
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

9.4 Cilium Network Policy (L7 Extension)

# Cilium: HTTP method/path based policy
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: api-l7-policy
spec:
  endpointSelector:
    matchLabels:
      app: api-server
  ingress:
  - fromEndpoints:
    - matchLabels:
        role: reader
    toPorts:
    - ports:
      - port: "443"
        protocol: TCP
      rules:
        http:
        - method: GET                # Read only
  - fromEndpoints:
    - matchLabels:
        role: writer
    toPorts:
    - ports:
      - port: "443"
        protocol: TCP
      rules:
        http:
        - method: GET
        - method: POST               # Write allowed too
        - method: PUT
        - method: DELETE

---
# Cilium: DNS-based Egress policy
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: dns-egress-policy
spec:
  endpointSelector:
    matchLabels:
      app: payment
  egress:
  - toFQDNs:
    - matchPattern: "*.stripe.com"   # Only Stripe API
    - matchName: "api.paypal.com"    # Only PayPal API
    toPorts:
    - ports:
      - port: "443"
        protocol: TCP

10. eBPF Networking: Beyond iptables

10.1 What Is eBPF

eBPF (extended Berkeley Packet Filter) enables running programs in kernel space without modifying the kernel.

┌──────────────────────────────────────────────────┐
│              eBPF Architecture                    │
│                                                  │
│  User Space                                      │
│  ┌──────────────────────────────────────────┐   │
│  │  Cilium Agent / hubble                   │   │
│  │  (eBPF program mgmt, policy enforcement) │   │
│  └──────────────┬───────────────────────────┘   │
│                 | BPF Syscall                    │
│  ───────────────┼────────────────────────────── │
│  Kernel Space   |                                │
│  ┌──────────────v───────────────────────────┐   │
│  │  BPF Program (verified, JIT compiled)    │   │
│  │                                          │   │
│  │  Hook Points:                            │   │
│  │  ┌─────┐ ┌──────┐ ┌────────┐ ┌──────┐  │   │
│  │  │ XDP │ │tc/cls│ │ socket │ │kprobe│  │   │
│  │  └─────┘ └──────┘ └────────┘ └──────┘  │   │
│  │                                          │   │
│  │  BPF Maps (data sharing between progs)   │   │
│  │  ┌──────────────────────────────────┐   │   │
│  │  │ Service Map | Endpoint Map | CT  │   │   │
│  │  └──────────────────────────────────┘   │   │
│  └──────────────────────────────────────────┘   │
└──────────────────────────────────────────────────┘

10.2 Why eBPF Replaces iptables

Comparison	iptables	eBPF
Execution	Kernel netfilter framework	Kernel hook points
Rule Matching	Linear scan O(n)	Hash map O(1)
Updates	Full chain rewrite	Map entry update
Observability	Limited (counters only)	Rich metrics, events
L7 Processing	Not possible	Possible (HTTP, DNS, etc.)
Conn. Tracking	conntrack (shared)	BPF CT (efficient)
CPU Usage	High (at scale)	Low
Scalability	Degrades at 10K services	Stable at 100K+

10.3 Cilium Hubble: eBPF-Based Observability

# Monitor traffic with Hubble CLI
hubble observe --namespace production

# Check traffic for specific Pod
hubble observe --pod production/frontend-abc123

# Monitor DNS queries
hubble observe --protocol DNS

# Check dropped traffic (Network Policy)
hubble observe --verdict DROPPED

# Monitor HTTP requests (L7)
hubble observe --protocol HTTP --http-method GET

# Example output:
# TIMESTAMP            SOURCE              DESTINATION         TYPE     VERDICT
# Apr 14 09:15:23.456  prod/frontend-xxx   prod/backend-yyy    L4/TCP   FORWARDED
# Apr 14 09:15:23.789  prod/backend-yyy    prod/postgres-zzz   L4/TCP   FORWARDED
# Apr 14 09:15:24.012  prod/frontend-xxx   prod/redis-aaa      L4/TCP   DROPPED

11. Practical Network Debugging Guide

11.1 Packet Capture with tcpdump

# Run tcpdump via ephemeral container (K8s 1.25+)
kubectl debug -it pod/my-app-xxx \
  --image=nicolaka/netshoot \
  --target=my-app-container \
  -- tcpdump -i eth0 -nn port 8080

# Capture traffic with specific host
kubectl debug -it pod/my-app-xxx \
  --image=nicolaka/netshoot \
  --target=my-app-container \
  -- tcpdump -i eth0 -nn host 10.1.2.3

# Capture DNS queries
kubectl debug -it pod/my-app-xxx \
  --image=nicolaka/netshoot \
  --target=my-app-container \
  -- tcpdump -i eth0 -nn port 53

11.2 Network Connectivity Testing

# Comprehensive testing with netshoot Pod
kubectl run netshoot --image=nicolaka/netshoot --rm -it -- bash

# TCP connection test
curl -v telnet://my-service:8080

# DNS resolution check
dig my-service.default.svc.cluster.local
dig +trace my-service.default.svc.cluster.local

# MTU check
ping -M do -s 1472 target-pod-ip  # 1472 + 28 = 1500

# Routing table check
ip route show

# ARP table check
ip neigh show

# iptables rules check (on node)
iptables -t nat -L KUBE-SERVICES -n --line-numbers

11.3 Network Policy Troubleshooting

# 1. Check current Network Policies
kubectl get networkpolicy -A

# 2. Check policies applied to specific Pod
kubectl describe networkpolicy -n production

# 3. For Cilium: endpoint policy status
cilium endpoint list
cilium policy get

# 4. Connection test
kubectl exec -it frontend-pod -- curl -v http://backend-service:8080

# 5. Verify DNS is allowed (essential for Egress Policy)
kubectl exec -it my-pod -- nslookup kubernetes.default

# Common mistakes:
# - Missing DNS (UDP/TCP 53) in Egress Policy
# - Using podSelector without namespaceSelector
#   (won't match Pods in other namespaces)
# - Missing Egress in policyTypes (Egress rules ignored)

11.4 DNS Troubleshooting Checklist

# 1. Check CoreDNS Pod status
kubectl get pods -n kube-system -l k8s-app=kube-dns

# 2. Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

# 3. Check DNS service endpoints
kubectl get endpoints kube-dns -n kube-system

# 4. Check resolv.conf
kubectl exec my-pod -- cat /etc/resolv.conf

# 5. Direct DNS query test
kubectl exec my-pod -- nslookup kubernetes.default.svc.cluster.local 10.96.0.10

# 6. External DNS resolution test
kubectl exec my-pod -- nslookup google.com

# For DNS performance issues:
# - Reduce ndots:5 to ndots:2
# - Use NodeLocal DNSCache
# - Enable autopath plugin

12. Performance Tuning

12.1 MTU Optimization

MTU Chain:
Pod (MTU) -> Overlay (VXLAN -50, IPinIP -20) -> Physical NIC (MTU)

Recommended settings:
- Physical NIC: Jumbo Frame 9000 (if possible)
- VXLAN overlay: Physical MTU - 50 = 8950
- IPinIP overlay: Physical MTU - 20 = 8980
- WireGuard encryption: Physical MTU - 60 = 8940

Default 1500 NIC:
- VXLAN: 1500 - 50 = 1450
- IPinIP: 1500 - 20 = 1480

# Calico MTU configuration
apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
metadata:
  name: default
spec:
  mtu: 8950              # Physical NIC 9000 - VXLAN 50
  vxlanMTU: 8950
  wireguardMTU: 8940

12.2 NodeLocal DNSCache

Caches DNS queries locally on nodes to reduce CoreDNS load.

# NodeLocal DNSCache DaemonSet (simplified)
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-local-dns
  namespace: kube-system
spec:
  selector:
    matchLabels:
      k8s-app: node-local-dns
  template:
    spec:
      containers:
      - name: node-cache
        image: registry.k8s.io/dns/k8s-dns-node-cache:1.23.0
        args:
        - "-localip"
        - "169.254.20.10"    # Link-local IP
        - "-conf"
        - "/etc/Corefile"
        - "-upstreamsvc"
        - "kube-dns"

NodeLocal DNSCache Flow:

Pod -> 169.254.20.10 (NodeLocal) -> CoreDNS (on cache miss)
                |
          On cache hit: immediate response

Benefits:
- DNS latency reduced by 50%+
- CoreDNS load reduced by 70-80%
- Eliminates conntrack contention
- Prevents UDP DNS packet loss

12.3 TCP Tuning

# Node-level kernel parameter tuning

# Connection tracking table size
net.netfilter.nf_conntrack_max = 1048576

# TCP buffer sizes
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# TCP connection reuse
net.ipv4.tcp_tw_reuse = 1

# SYN backlog
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535

# Keep-alive
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 10

13. Multi-Cluster Networking

13.1 Cilium ClusterMesh

┌──────────────┐         ┌──────────────┐
│  Cluster A   │         │  Cluster B   │
│  (us-east-1) │         │  (eu-west-1) │
│              │         │              │
│  ┌────────┐  │ Tunnel  │  ┌────────┐  │
│  │Cilium  │<-┼---------┼->│Cilium  │  │
│  │Agent   │  │         │  │Agent   │  │
│  └────────┘  │         │  └────────┘  │
│              │         │              │
│  Pod CIDR:   │         │  Pod CIDR:   │
│  10.1.0.0/16 │         │  10.2.0.0/16 │
│              │         │              │
│  Service:    │         │  Service:    │
│  shared-db   │ Global  │  shared-db   │
│  (annotated) │ Service │  (annotated) │
└──────────────┘         └──────────────┘

# Global Service configuration (on both clusters)
apiVersion: v1
kind: Service
metadata:
  name: shared-database
  annotations:
    service.cilium.io/global: "true"
    service.cilium.io/shared: "true"
spec:
  selector:
    app: postgres
  ports:
  - port: 5432

14. Quiz

Q1. Why does every Pod in Kubernetes get a unique IP and communicate without NAT?

Answer: Because of Kubernetes networking model's three fundamental principles:

Every Pod gets a unique IP address
Every Pod can reach every other Pod without NAT
The IP a Pod sees for itself is the same IP others see

This model eliminates the complexity of port mapping and NAT, simplifying service discovery and network policies. CNI plugins implement these requirements.

Q2. Why is eBPF more suitable for Kubernetes networking than iptables?

Answer: iptables traverses rules linearly (O(n)) and requires full chain rewrites on updates. At 10,000+ Services, performance degrades severely.

eBPF offers:

Hash map-based O(1) lookup with consistent performance regardless of Service count
Map entry updates only for fast rule changes
L7 policies (HTTP method/path) processed directly in kernel
Hubble for rich observability

Cilium uses eBPF to fully replace kube-proxy.

Q3. What are the key reasons Gateway API replaces Ingress?

Answer: Core limitations of Ingress:

Non-standard annotations -- configuration differs per controller
HTTP/HTTPS only -- no TCP/UDP/gRPC support
Single resource -- hard to separate roles between infra admins, operators, and developers
Advanced features like traffic splitting and header matching not in standard

Gateway API uses hierarchical resources (GatewayClass/Gateway/Route) for role separation and supports traffic splitting, header matching, and multiple protocols as standard spec.

Q4. What is the most common mistake when setting up Egress Network Policy rules?

Answer: Forgetting to allow DNS (UDP/TCP port 53).

When Egress Policy is set, all outbound traffic is blocked. DNS queries are also blocked, so Service name resolution fails. You must allow UDP/TCP port 53 Egress to kube-dns (CoreDNS) Pods.

Other common mistakes:

Missing Egress in policyTypes
Trying to match Pods in other namespaces without namespaceSelector
CIDR range errors in Pod CIDR-based rules

Q5. Why does ndots:5 degrade external DNS lookup performance, and what are the solutions?

Answer: With the default Kubernetes ndots:5 setting, looking up domains like "api.example.com" (fewer than 5 dots) first tries appending search domains:

api.example.com.default.svc.cluster.local (fail)
api.example.com.svc.cluster.local (fail)
api.example.com.cluster.local (fail)
api.example.com (success)

This causes 3 unnecessary queries for a single external DNS lookup.

Solutions:

Reduce ndots to 2 in Pod dnsConfig
Use FQDN (add trailing dot: api.example.com.)
Use NodeLocal DNSCache for caching
Enable CoreDNS autopath plugin

15. References

Kubernetes Networking Model - Official Documentation
- https://kubernetes.io/docs/concepts/services-networking/
Cilium Documentation - eBPF-based Networking
- https://docs.cilium.io/
Calico Documentation - Project Calico
- https://docs.tigera.io/calico/latest/about/
Gateway API - Official Specification
- https://gateway-api.sigs.k8s.io/
CoreDNS - DNS for Service Discovery
- https://coredns.io/manual/toc/
Network Policies - Kubernetes Documentation
- https://kubernetes.io/docs/concepts/services-networking/network-policies/
eBPF.io - Introduction to eBPF
- https://ebpf.io/what-is-ebpf/
Hubble - Network Observability for Kubernetes
- https://docs.cilium.io/en/stable/gettingstarted/hubble/
IPVS-based kube-proxy - Kubernetes Blog
- https://kubernetes.io/blog/2018/07/09/ipvs-based-in-cluster-load-balancing-deep-dive/
CNI Specification - Container Network Interface
- https://www.cni.dev/docs/spec/
NodeLocal DNSCache - Kubernetes Documentation
- https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/
Cilium ClusterMesh - Multi-Cluster Networking
- https://docs.cilium.io/en/stable/network/clustermesh/
Life of a Packet in Kubernetes - Conference Talk
- https://www.youtube.com/watch?v=0Omvgd7Hg1I

Table of Contents