Skip to content

✍️ 필사 모드: Kubernetes Networking Deep Dive Guide 2025: CNI, Service, Ingress, DNS, Network Policy

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Table of Contents

1. Fundamental Principles of the Kubernetes Networking Model

Kubernetes networking is built on three fundamental principles.

1.1 Core Requirements

  1. Every Pod gets a unique IP address -- can communicate with others without NAT
  2. Every Pod can reach every other Pod -- whether on the same node or different nodes
  3. The IP a Pod sees for itself is the same IP others see -- no NAT
┌────────────────────────────────────────────────────┐
Kubernetes Networking Fundamentals│                                                    │
│  ┌──────────┐           ┌──────────┐              │
│  │  Node 1  │           │  Node 2  │              │
│  │          │           │          │              │
│  │ ┌──────┐ │   No NAT  │ ┌──────┐ │              │
│  │ │Pod A<├───────────>┤│Pod C │ │              │
│  │ │10.1.1│ │           │ │10.1.2│ │              │
│  │ └──────┘ │           │ └──────┘ │              │
│  │ ┌──────┐ │           │ ┌──────┐ │              │
│  │ │Pod B │ │           │ │Pod D │ │              │
│  │ │10.1.1│ │           │ │10.1.2│ │              │
│  │ └──────┘ │           │ └──────┘ │              │
│  └──────────┘           └──────────┘              │
│                                                    │
Pod A(10.1.1.2) -> Pod C(10.1.2.3): Direct comm  │
No NAT, no port mapping. Each Pod owns unique IP└────────────────────────────────────────────────────┘

1.2 Networking Layer Hierarchy

┌──────────────────────────────────────────┐
Kubernetes Networking Layers├──────────────────────────────────────────┤
L7: Ingress / Gateway API     (HTTP routing, TLS termination)├──────────────────────────────────────────┤
L4: Service     (ClusterIP, NodePort, LoadBalancer)├──────────────────────────────────────────┤
L3: Pod Networking (CNI)     (Pod-to-Pod, overlay/underlay)├──────────────────────────────────────────┤
L2-L3: Node Networking     (Physical/Virtual network)└──────────────────────────────────────────┘

2. Pod-to-Pod Networking: How It Works Internally

2.1 Pod Communication on the Same Node

┌────────────────────────────────────────────┐
Node│                                            │
│  ┌──────┐    veth pair    ┌──────────┐    │
│  │Pod A<--------------->│           │    │
│  │eth0  │                │   cbr0    │    │
│  │      │                  (bridge) │    │
│  └──────┘                │           │    │
│                          │           │    │
│  ┌──────┐    veth pair    │           │    │
│  │Pod B<--------------->│           │    │
│  │eth0  │                └──────────┘    │
│  └──────┘                                │
│                                            │
1. Pod A sends packet to Pod B2. Travels via veth pair to bridge (cbr0)3. Bridge finds destination veth by MAC4. Delivered to Pod B via its veth pair    │
└────────────────────────────────────────────┘

veth pair: A virtual ethernet pair -- one end is eth0 in the Pod namespace, the other connects to the node's bridge.

2.2 Pod Communication Across Nodes (Overlay)

┌──────────┐                      ┌──────────┐
Node 1  │                      │  Node 2│          │                      │          │
│ ┌──────┐ │     VXLAN/IPinIP     │ ┌──────┐ │
│ │Pod A │ │    ┌───────────┐     │ │Pod C │ │
│ │10.1.1│ │--->Tunneling---->│ │10.1.2│ │
│ └──────┘ │    │           │     │ └──────┘ │
│          │    │Encapsulate│     │          │
│ cbr0     │    │original   │     │ cbr0     │
10.1.1.0 │    │in outer   │     │ 10.1.2.0│          │    └───────────┘     │          │
│ eth0     │                      │ eth0     │
192.168.1│                      │192.168.2└──────────┘                      └──────────┘

VXLAN Encapsulation:
┌─────────────────────────────────────────┐
│OuterIP │ UDPVXLAN │InnerIP│ PayloadHdrHdrHdrHdr   │         │
192->192│      │       │10->10Data└─────────────────────────────────────────┘

2.3 Overlay vs Underlay Networking

FeatureOverlay (VXLAN/IPinIP)Underlay (BGP/Direct)
Setup DifficultyEasyHard
Network RequirementsL3 connectivity onlyBGP support needed
Performance OverheadYes (encapsulation)None
MTU ImpactReduced (50-54 bytes)None
DebuggingHardEasy
Use CaseCloud, multi-tenantBare metal, high perf

3. Detailed CNI Plugin Comparison

3.1 What Is CNI

CNI (Container Network Interface) is the standard interface for container runtimes to interact with network plugins.

CNI Flow During Pod Creation:

1. kubelet requests container creation via CRI
2. Container runtime creates network namespace
3. kubelet invokes CNI plugin (ADD command)
4. CNI plugin:
   a. Creates veth pair
   b. Assigns IP address to Pod (IPAM)
   c. Configures routing rules
   d. Sets up overlay tunnel if needed
5. Pod is network-ready

3.2 Major CNI Plugin Comparison

FeatureCalicoCiliumFlannelWeave Net
DeveloperTigeraIsovalent (Cisco)CoreOSWeaveworks
Data Planeiptables/eBPFeBPFVXLANVXLAN/Sleeve
Network ModeBGP/VXLAN/IPinIPVXLAN/NativeVXLANVXLAN
Network PolicyRichVery rich (L7)NoneBasic
EncryptionWireGuardWireGuard/IPsecNoneIPsec
kube-proxy ReplacementeBPF modeNative eBPFNoNo
ObservabilityBasicHubble (powerful)NoneScope
Service MeshNoneBuilt-in (optional)NoneNone
Multi-clusterSupportedClusterMeshNot supportedSupported
PerformanceHigh (BGP)Very high (eBPF)MediumLow
ComplexityMediumHighLowLow
Production Rec.Strongly rec.Strongly rec.Small onlyNot rec.

3.3 Calico Deep Dive

# Calico BGP mode configuration (IPPool)
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  name: default-ipv4-ippool
spec:
  cidr: 10.244.0.0/16
  encapsulation: None        # No encapsulation with BGP
  natOutgoing: true
  nodeSelector: all()
  blockSize: 26              # /26 = 64 IPs per node

---
# Calico BGP Peering configuration
apiVersion: crd.projectcalico.org/v1
kind: BGPPeer
metadata:
  name: rack-tor-switch
spec:
  peerIP: 192.168.1.1
  asNumber: 64512
  nodeSelector: rack == "rack-1"
# Calico VXLAN mode (cloud environments)
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  name: default-ipv4-ippool
spec:
  cidr: 10.244.0.0/16
  encapsulation: VXLAN       # VXLAN encapsulation
  natOutgoing: true
  vxlanMode: Always

3.4 Cilium Deep Dive

# Cilium Helm install (kube-proxy replacement mode)
# helm install cilium cilium/cilium --namespace kube-system \
#   --set kubeProxyReplacement=true \
#   --set k8sServiceHost=API_SERVER_IP \
#   --set k8sServicePort=6443 \
#   --set hubble.enabled=true \
#   --set hubble.relay.enabled=true \
#   --set hubble.ui.enabled=true \
#   --set encryption.enabled=true \
#   --set encryption.type=wireguard

# Verify Cilium status
# cilium status
# cilium connectivity test
# Cilium L7 Network Policy (HTTP-based policy)
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: l7-rule
  namespace: default
spec:
  endpointSelector:
    matchLabels:
      app: backend
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: frontend
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP
      rules:
        http:
        - method: "GET"
          path: "/api/v1/.*"
        - method: "POST"
          path: "/api/v1/orders"

4. Service Deep Dive: ClusterIP to LoadBalancer

4.1 How Services Work

┌──────────────────────────────────────────────────┐
Service Traffic Flow│                                                  │
Client Pod --> Service VIP --> kube-proxy        │
                (10.96.0.10)    /iptables/IPVS|│                           ┌────────┼────────┐    │
│                           v        v        v    │
Pod A    Pod B    Pod C10.1.1.2 10.1.1.3 10.1.2.4│                                                  │
Service VIP is not bound to any real interface│  iptables/IPVS rules DNAT to Pod IPs└──────────────────────────────────────────────────┘

4.2 Service Types in Detail

# 1. ClusterIP (default) - internal access only
apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  type: ClusterIP
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 8080
  # Cluster IP: auto-assigned (e.g., 10.96.0.10)
  # Accessible only within cluster at 10.96.0.10:80
# 2. NodePort - external access via fixed port on all nodes
apiVersion: v1
kind: Service
metadata:
  name: my-nodeport-service
spec:
  type: NodePort
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 8080
    nodePort: 30080    # Range: 30000-32767
  # Accessible via port 30080 on any node
  # NodeIP:30080 -> ClusterIP:80 -> Pod:8080
# 3. LoadBalancer - auto-provisions cloud load balancer
apiVersion: v1
kind: Service
metadata:
  name: my-lb-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
spec:
  type: LoadBalancer
  selector:
    app: my-app
  ports:
  - port: 443
    targetPort: 8443
  # External LB IP -> NodePort -> ClusterIP -> Pod
# 4. ExternalName - maps external DNS to cluster service
apiVersion: v1
kind: Service
metadata:
  name: external-db
spec:
  type: ExternalName
  externalName: mydb.example.com
  # external-db.default.svc.cluster.local -> mydb.example.com
# 5. Headless Service - returns Pod IPs directly, no ClusterIP
apiVersion: v1
kind: Service
metadata:
  name: my-headless-service
spec:
  clusterIP: None            # Declares Headless
  selector:
    app: my-stateful-app
  ports:
  - port: 5432
  # DNS queries return individual Pod IPs, not Service IP
  # Used with StatefulSet: pod-0.my-headless-service.default.svc

4.3 Session Affinity

apiVersion: v1
kind: Service
metadata:
  name: sticky-service
spec:
  selector:
    app: web
  ports:
  - port: 80
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800  # 3 hours
  # Same client IP routes to same Pod

5. kube-proxy Modes: iptables vs IPVS vs eBPF

5.1 iptables Mode (Default)

iptables rule chain (per Service):

PREROUTING -> KUBE-SERVICES -> KUBE-SVC-XXXX -> KUBE-SEP-YYYY
                                               (probabilistic)

Service with 3 Pods:
KUBE-SVC-XXXX:
  -p tcp -d 10.96.0.10 --dport 80
    -> 33% probability KUBE-SEP-1 (DNAT -> 10.1.1.2:8080)
    -> 33% probability KUBE-SEP-2 (DNAT -> 10.1.1.3:8080)
    -> 34% probability KUBE-SEP-3 (DNAT -> 10.1.2.4:8080)

Problems:
- Rule count grows proportionally with Services/Pods O(n)
- Full chain rewrite on rule updates
- Severe degradation at 10,000+ Services

5.2 IPVS Mode

IPVS (IP Virtual Server):

┌───────────────────────────────────────────┐
IPVS Virtual Server: 10.96.0.10:80│                                           │
Real Server 1: 10.1.1.2:8080  (weight 1)Real Server 2: 10.1.1.3:8080  (weight 1)Real Server 3: 10.1.2.4:8080  (weight 1)│                                           │
Algorithm: rr (Round Robin)Others: lc, dh, sh, sed, nq             │
└───────────────────────────────────────────┘

Advantages:
- Hash table-based O(1) lookup
- Multiple load balancing algorithms
- Stable performance at scale
- Real-time statistics and connection tracking
# kube-proxy IPVS mode configuration
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
  scheduler: "rr"       # rr, lc, dh, sh, sed, nq
  syncPeriod: "30s"
  minSyncPeriod: "2s"

5.3 eBPF Mode (Cilium)

eBPF kube-proxy replacement:

┌────────────────────────────────────────────┐
Before: Pod -> iptables/IPVS -> Pod│  eBPF:   Pod -> BPF program -> Pod (direct)│                                            │
│  ┌──────┐   BPF map    ┌──────┐           │
│  │Pod A--(Service   --│Pod B │           │
│  │      │   lookup)     │      │           │
│  └──────┘               └──────┘           │
│                                            │
No iptables chain traversal!Direct packet redirect in kernel space    │
└────────────────────────────────────────────┘
ComparisoniptablesIPVSeBPF (Cilium)
Lookup ComplexityO(n)O(1)O(1)
Rule UpdatesFull rewriteIncrementalIncremental
Load BalancingProbabilisticMultiple algorithmsMaglev hash
Connection TrackingconntrackconntrackBPF conntrack
PerformanceMediumHighVery high
L7 PolicyNoNoYes
ObservabilityLimitedBasicHubble (powerful)
10K ServicesVery slowFastVery fast

6. DNS: CoreDNS and Service Discovery

6.1 CoreDNS Architecture

┌──────────────────────────────────────────────────┐
CoreDNS Operation│                                                  │
Pod --DNS query--> CoreDNS Pod  (nameserver        (kube-system)10.96.0.10)|│           ┌──────────┼──────────┐                │
│           v          v          v                │
K8s API     Corefile     Upstream   (Service/Pod  (Plugin      DNS│    records)      config)   (External DNS)└──────────────────────────────────────────────────┘

6.2 DNS Record Format

Service DNS:
  my-service.my-namespace.svc.cluster.local
  |--svcname-| |-namespace--|

Pod DNS:
  10-1-1-2.my-namespace.pod.cluster.local
  |IP(dots->dashes)|

Headless Service Pod DNS (StatefulSet):
  pod-0.my-headless.my-namespace.svc.cluster.local
  |podname| |-svcname--|

SRV Records:
  _http._tcp.my-service.my-namespace.svc.cluster.local
  -> Returns port number and hostname

6.3 CoreDNS Corefile Configuration

# CoreDNS Corefile (ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
          lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
          ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
          max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }

6.4 DNS Debugging

# Create a debug Pod for DNS testing
kubectl run dnsutils --image=registry.k8s.io/e2e-test-images/jessie-dnsutils:1.3 \
  --restart=Never -- sleep infinity

# Service DNS lookup
kubectl exec dnsutils -- nslookup my-service.default.svc.cluster.local

# Pod DNS lookup
kubectl exec dnsutils -- nslookup 10-1-1-2.default.pod.cluster.local

# Detailed DNS response
kubectl exec dnsutils -- dig my-service.default.svc.cluster.local +search +short

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns -f

# Check resolv.conf
kubectl exec dnsutils -- cat /etc/resolv.conf
# nameserver 10.96.0.10
# search default.svc.cluster.local svc.cluster.local cluster.local
# options ndots:5

6.5 The ndots:5 Problem and Solutions

# Problem with default ndots:5:
# Looking up "api.example.com" (fewer than 5 dots):
# 1. api.example.com.default.svc.cluster.local (fail)
# 2. api.example.com.svc.cluster.local (fail)
# 3. api.example.com.cluster.local (fail)
# 4. api.example.com (success)
# -> 3 unnecessary queries for external DNS lookups!

# Solution 1: Customize Pod DNS config
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  dnsConfig:
    options:
    - name: ndots
      value: "2"       # Reduce ndots to 2
  containers:
  - name: app
    image: my-app

# Solution 2: Use FQDN directly (trailing dot)
# api.example.com.  <- trailing dot means FQDN

7. Ingress: HTTP/HTTPS Routing

7.1 Ingress Architecture

┌──────────────────────────────────────────────────┐
Ingress Architecture│                                                  │
External Traffic|│      v                                           │
│  ┌─────────────────────┐                         │
│  │  Ingress Controller   (nginx, Traefik, etc.)  (actual proxy)     │                         │
│  └─────────┬───────────┘                         │
|Watches Ingress resources                    │
     (routing rules)|│     ┌──────┼──────┐                              │
│     v      v      v                              │
Svc A  Svc B  Svc C/api   /web   /docs                            │
└──────────────────────────────────────────────────┘

7.2 Ingress Resource Example

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - app.example.com
    secretName: app-tls-secret
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 80
      - path: /
        pathType: Prefix
        backend:
          service:
            name: frontend-service
            port:
              number: 80

7.3 Ingress Controller Comparison

Featurenginx-ingressTraefikHAProxyContour
DeveloperKubernetes/F5Traefik LabsHAProxy TechVMware
ProtocolsHTTP/HTTPS/gRPCHTTP/HTTPS/gRPC/TCPHTTP/HTTPS/TCPHTTP/HTTPS/gRPC
Config ReloadRestart neededHot reloadHot reloadHot reload
Rate LimitingAnnotationsMiddlewareBuilt-inNot supported
AuthBasic/OAuthForward AuthBuilt-inLimited
PerformanceHighMediumVery highHigh
Gateway APISupportedSupportedPartialFull support
CommunityVery largeLargeMediumMedium

8. Gateway API: The Evolution of Ingress

8.1 Why Gateway API

Limitations of Ingress:

  • Non-standard configuration dependent on annotations
  • Only L7 HTTP supported (no TCP/UDP/gRPC)
  • Difficult role separation with a single resource
  • No advanced features like traffic splitting, header matching

8.2 Gateway API Resource Hierarchy

┌──────────────────────────────────────────────────┐
Gateway API Resource Hierarchy│                                                  │
Infrastructure Admin:│  ┌─────────────┐                                 │
│  │GatewayClass │  Which controller to use        │
│  └──────┬──────┘                                 │
|Cluster Operator:│  ┌──────v──────┐                                 │
│  │  GatewayListeners (port, protocol, TLS)│  └──────┬──────┘                                 │
|Application Developer:│  ┌──────v──────┐                                 │
│  │ HTTPRouteRouting rules (host,path,hdr)│  │ TCPRoute    │                                 │
│  │ GRPCRoute   │                                 │
│  └─────────────┘                                 │
└──────────────────────────────────────────────────┘

8.3 Gateway API Practical Example

# 1. GatewayClass - Defined by infra admin
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: cilium
spec:
  controllerName: io.cilium/gateway-controller

---
# 2. Gateway - Defined by cluster operator
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: my-gateway
  namespace: gateway-infra
spec:
  gatewayClassName: cilium
  listeners:
  - name: https
    protocol: HTTPS
    port: 443
    tls:
      mode: Terminate
      certificateRefs:
      - kind: Secret
        name: wildcard-tls
    allowedRoutes:
      namespaces:
        from: Selector
        selector:
          matchLabels:
            shared-gateway: "true"

---
# 3. HTTPRoute - Defined by developer
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: app-routes
  namespace: my-app
spec:
  parentRefs:
  - name: my-gateway
    namespace: gateway-infra
  hostnames:
  - "app.example.com"
  rules:
  # Header-based routing
  - matches:
    - headers:
      - name: "X-Canary"
        value: "true"
    backendRefs:
    - name: app-canary
      port: 80
      weight: 100
  # Traffic splitting (Canary deployment)
  - matches:
    - path:
        type: PathPrefix
        value: /api
    backendRefs:
    - name: app-stable
      port: 80
      weight: 90
    - name: app-canary
      port: 80
      weight: 10
  # Default routing
  - backendRefs:
    - name: app-stable
      port: 80

8.4 Ingress vs Gateway API Comparison

FeatureIngressGateway API
Role SeparationSingle resourceGatewayClass/Gateway/Route
ProtocolsHTTP/HTTPS onlyHTTP/HTTPS/TCP/UDP/gRPC
Traffic SplittingAnnotations (non-standard)Native weight support
Header MatchingAnnotations (non-standard)Standard spec
TLS ConfigBasicFine-grained control
Cross-namespaceDifficultNative support
ExtensibilityAnnotations onlyPolicy API extension
StatusGA (stable)GA (v1.0+, 2023)

9. Network Policy: Microsegmentation

9.1 Basic Concept

Network Policy is a Pod-level firewall. It allows or blocks traffic to selected Pods.

Important: Without Network Policy, all traffic is allowed. Once any Network Policy applies to a Pod, only explicitly allowed traffic passes through.

9.2 Default Deny Policy

# Deny all Ingress in namespace (default deny)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}           # Applies to all Pods
  policyTypes:
  - Ingress                 # Block Ingress only (Egress allowed)

---
# Deny all Egress too
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

9.3 Practical Network Policy Examples

# Backend Pod: Allow access only from frontend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    # Allow only frontend Pods in same namespace
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080
  egress:
  # Allow DB access
  - to:
    - podSelector:
        matchLabels:
          app: postgres
    ports:
    - protocol: TCP
      port: 5432
  # Allow DNS (essential!)
  - to:
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

9.4 Cilium Network Policy (L7 Extension)

# Cilium: HTTP method/path based policy
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: api-l7-policy
spec:
  endpointSelector:
    matchLabels:
      app: api-server
  ingress:
  - fromEndpoints:
    - matchLabels:
        role: reader
    toPorts:
    - ports:
      - port: "443"
        protocol: TCP
      rules:
        http:
        - method: GET                # Read only
  - fromEndpoints:
    - matchLabels:
        role: writer
    toPorts:
    - ports:
      - port: "443"
        protocol: TCP
      rules:
        http:
        - method: GET
        - method: POST               # Write allowed too
        - method: PUT
        - method: DELETE

---
# Cilium: DNS-based Egress policy
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: dns-egress-policy
spec:
  endpointSelector:
    matchLabels:
      app: payment
  egress:
  - toFQDNs:
    - matchPattern: "*.stripe.com"   # Only Stripe API
    - matchName: "api.paypal.com"    # Only PayPal API
    toPorts:
    - ports:
      - port: "443"
        protocol: TCP

10. eBPF Networking: Beyond iptables

10.1 What Is eBPF

eBPF (extended Berkeley Packet Filter) enables running programs in kernel space without modifying the kernel.

┌──────────────────────────────────────────────────┐
│              eBPF Architecture│                                                  │
User Space│  ┌──────────────────────────────────────────┐   │
│  │  Cilium Agent / hubble                   │   │
  (eBPF program mgmt, policy enforcement) │   │
│  └──────────────┬───────────────────────────┘   │
| BPF Syscall│  ───────────────┼────────────────────────────── │
Kernel Space   |│  ┌──────────────v───────────────────────────┐   │
│  │  BPF Program (verified, JIT compiled)    │   │
│  │                                          │   │
│  │  Hook Points:                            │   │
│  │  ┌─────┐ ┌──────┐ ┌────────┐ ┌──────┐  │   │
│  │  │ XDP │ │tc/cls│ │ socket │ │kprobe│  │   │
│  │  └─────┘ └──────┘ └────────┘ └──────┘  │   │
│  │                                          │   │
│  │  BPF Maps (data sharing between progs)   │   │
│  │  ┌──────────────────────────────────┐   │   │
│  │  │ Service Map | Endpoint Map | CT  │   │   │
│  │  └──────────────────────────────────┘   │   │
│  └──────────────────────────────────────────┘   │
└──────────────────────────────────────────────────┘

10.2 Why eBPF Replaces iptables

ComparisoniptableseBPF
ExecutionKernel netfilter frameworkKernel hook points
Rule MatchingLinear scan O(n)Hash map O(1)
UpdatesFull chain rewriteMap entry update
ObservabilityLimited (counters only)Rich metrics, events
L7 ProcessingNot possiblePossible (HTTP, DNS, etc.)
Conn. Trackingconntrack (shared)BPF CT (efficient)
CPU UsageHigh (at scale)Low
ScalabilityDegrades at 10K servicesStable at 100K+

10.3 Cilium Hubble: eBPF-Based Observability

# Monitor traffic with Hubble CLI
hubble observe --namespace production

# Check traffic for specific Pod
hubble observe --pod production/frontend-abc123

# Monitor DNS queries
hubble observe --protocol DNS

# Check dropped traffic (Network Policy)
hubble observe --verdict DROPPED

# Monitor HTTP requests (L7)
hubble observe --protocol HTTP --http-method GET

# Example output:
# TIMESTAMP            SOURCE              DESTINATION         TYPE     VERDICT
# Apr 14 09:15:23.456  prod/frontend-xxx   prod/backend-yyy    L4/TCP   FORWARDED
# Apr 14 09:15:23.789  prod/backend-yyy    prod/postgres-zzz   L4/TCP   FORWARDED
# Apr 14 09:15:24.012  prod/frontend-xxx   prod/redis-aaa      L4/TCP   DROPPED

11. Practical Network Debugging Guide

11.1 Packet Capture with tcpdump

# Run tcpdump via ephemeral container (K8s 1.25+)
kubectl debug -it pod/my-app-xxx \
  --image=nicolaka/netshoot \
  --target=my-app-container \
  -- tcpdump -i eth0 -nn port 8080

# Capture traffic with specific host
kubectl debug -it pod/my-app-xxx \
  --image=nicolaka/netshoot \
  --target=my-app-container \
  -- tcpdump -i eth0 -nn host 10.1.2.3

# Capture DNS queries
kubectl debug -it pod/my-app-xxx \
  --image=nicolaka/netshoot \
  --target=my-app-container \
  -- tcpdump -i eth0 -nn port 53

11.2 Network Connectivity Testing

# Comprehensive testing with netshoot Pod
kubectl run netshoot --image=nicolaka/netshoot --rm -it -- bash

# TCP connection test
curl -v telnet://my-service:8080

# DNS resolution check
dig my-service.default.svc.cluster.local
dig +trace my-service.default.svc.cluster.local

# MTU check
ping -M do -s 1472 target-pod-ip  # 1472 + 28 = 1500

# Routing table check
ip route show

# ARP table check
ip neigh show

# iptables rules check (on node)
iptables -t nat -L KUBE-SERVICES -n --line-numbers

11.3 Network Policy Troubleshooting

# 1. Check current Network Policies
kubectl get networkpolicy -A

# 2. Check policies applied to specific Pod
kubectl describe networkpolicy -n production

# 3. For Cilium: endpoint policy status
cilium endpoint list
cilium policy get

# 4. Connection test
kubectl exec -it frontend-pod -- curl -v http://backend-service:8080

# 5. Verify DNS is allowed (essential for Egress Policy)
kubectl exec -it my-pod -- nslookup kubernetes.default

# Common mistakes:
# - Missing DNS (UDP/TCP 53) in Egress Policy
# - Using podSelector without namespaceSelector
#   (won't match Pods in other namespaces)
# - Missing Egress in policyTypes (Egress rules ignored)

11.4 DNS Troubleshooting Checklist

# 1. Check CoreDNS Pod status
kubectl get pods -n kube-system -l k8s-app=kube-dns

# 2. Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

# 3. Check DNS service endpoints
kubectl get endpoints kube-dns -n kube-system

# 4. Check resolv.conf
kubectl exec my-pod -- cat /etc/resolv.conf

# 5. Direct DNS query test
kubectl exec my-pod -- nslookup kubernetes.default.svc.cluster.local 10.96.0.10

# 6. External DNS resolution test
kubectl exec my-pod -- nslookup google.com

# For DNS performance issues:
# - Reduce ndots:5 to ndots:2
# - Use NodeLocal DNSCache
# - Enable autopath plugin

12. Performance Tuning

12.1 MTU Optimization

MTU Chain:
Pod (MTU) -> Overlay (VXLAN -50, IPinIP -20) -> Physical NIC (MTU)

Recommended settings:
- Physical NIC: Jumbo Frame 9000 (if possible)
- VXLAN overlay: Physical MTU - 50 = 8950
- IPinIP overlay: Physical MTU - 20 = 8980
- WireGuard encryption: Physical MTU - 60 = 8940

Default 1500 NIC:
- VXLAN: 1500 - 50 = 1450
- IPinIP: 1500 - 20 = 1480
# Calico MTU configuration
apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
metadata:
  name: default
spec:
  mtu: 8950              # Physical NIC 9000 - VXLAN 50
  vxlanMTU: 8950
  wireguardMTU: 8940

12.2 NodeLocal DNSCache

Caches DNS queries locally on nodes to reduce CoreDNS load.

# NodeLocal DNSCache DaemonSet (simplified)
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-local-dns
  namespace: kube-system
spec:
  selector:
    matchLabels:
      k8s-app: node-local-dns
  template:
    spec:
      containers:
      - name: node-cache
        image: registry.k8s.io/dns/k8s-dns-node-cache:1.23.0
        args:
        - "-localip"
        - "169.254.20.10"    # Link-local IP
        - "-conf"
        - "/etc/Corefile"
        - "-upstreamsvc"
        - "kube-dns"
NodeLocal DNSCache Flow:

Pod -> 169.254.20.10 (NodeLocal) -> CoreDNS (on cache miss)
                |
          On cache hit: immediate response

Benefits:
- DNS latency reduced by 50%+
- CoreDNS load reduced by 70-80%
- Eliminates conntrack contention
- Prevents UDP DNS packet loss

12.3 TCP Tuning

# Node-level kernel parameter tuning

# Connection tracking table size
net.netfilter.nf_conntrack_max = 1048576

# TCP buffer sizes
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# TCP connection reuse
net.ipv4.tcp_tw_reuse = 1

# SYN backlog
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535

# Keep-alive
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 10

13. Multi-Cluster Networking

13.1 Cilium ClusterMesh

┌──────────────┐         ┌──────────────┐
Cluster A   │         │  Cluster B  (us-east-1)  (eu-west-1)│              │         │              │
│  ┌────────┐  │ Tunnel  │  ┌────────┐  │
│  │Cilium  │<----------->│Cilium  │  │
│  │Agent   │  │         │  │Agent   │  │
│  └────────┘  │         │  └────────┘  │
│              │         │              │
Pod CIDR:   │         │  Pod CIDR:10.1.0.0/16 │         │  10.2.0.0/16│              │         │              │
Service:    │         │  Service:│  shared-db   │ Global  │  shared-db   │
  (annotated)Service   (annotated)└──────────────┘         └──────────────┘
# Global Service configuration (on both clusters)
apiVersion: v1
kind: Service
metadata:
  name: shared-database
  annotations:
    service.cilium.io/global: "true"
    service.cilium.io/shared: "true"
spec:
  selector:
    app: postgres
  ports:
  - port: 5432

14. Quiz

Q1. Why does every Pod in Kubernetes get a unique IP and communicate without NAT?

Answer: Because of Kubernetes networking model's three fundamental principles:

  1. Every Pod gets a unique IP address
  2. Every Pod can reach every other Pod without NAT
  3. The IP a Pod sees for itself is the same IP others see

This model eliminates the complexity of port mapping and NAT, simplifying service discovery and network policies. CNI plugins implement these requirements.

Q2. Why is eBPF more suitable for Kubernetes networking than iptables?

Answer: iptables traverses rules linearly (O(n)) and requires full chain rewrites on updates. At 10,000+ Services, performance degrades severely.

eBPF offers:

  • Hash map-based O(1) lookup with consistent performance regardless of Service count
  • Map entry updates only for fast rule changes
  • L7 policies (HTTP method/path) processed directly in kernel
  • Hubble for rich observability

Cilium uses eBPF to fully replace kube-proxy.

Q3. What are the key reasons Gateway API replaces Ingress?

Answer: Core limitations of Ingress:

  • Non-standard annotations -- configuration differs per controller
  • HTTP/HTTPS only -- no TCP/UDP/gRPC support
  • Single resource -- hard to separate roles between infra admins, operators, and developers
  • Advanced features like traffic splitting and header matching not in standard

Gateway API uses hierarchical resources (GatewayClass/Gateway/Route) for role separation and supports traffic splitting, header matching, and multiple protocols as standard spec.

Q4. What is the most common mistake when setting up Egress Network Policy rules?

Answer: Forgetting to allow DNS (UDP/TCP port 53).

When Egress Policy is set, all outbound traffic is blocked. DNS queries are also blocked, so Service name resolution fails. You must allow UDP/TCP port 53 Egress to kube-dns (CoreDNS) Pods.

Other common mistakes:

  • Missing Egress in policyTypes
  • Trying to match Pods in other namespaces without namespaceSelector
  • CIDR range errors in Pod CIDR-based rules
Q5. Why does ndots:5 degrade external DNS lookup performance, and what are the solutions?

Answer: With the default Kubernetes ndots:5 setting, looking up domains like "api.example.com" (fewer than 5 dots) first tries appending search domains:

  1. api.example.com.default.svc.cluster.local (fail)
  2. api.example.com.svc.cluster.local (fail)
  3. api.example.com.cluster.local (fail)
  4. api.example.com (success)

This causes 3 unnecessary queries for a single external DNS lookup.

Solutions:

  • Reduce ndots to 2 in Pod dnsConfig
  • Use FQDN (add trailing dot: api.example.com.)
  • Use NodeLocal DNSCache for caching
  • Enable CoreDNS autopath plugin

15. References

  1. Kubernetes Networking Model - Official Documentation
  2. Cilium Documentation - eBPF-based Networking
  3. Calico Documentation - Project Calico
  4. Gateway API - Official Specification
  5. CoreDNS - DNS for Service Discovery
  6. Network Policies - Kubernetes Documentation
  7. eBPF.io - Introduction to eBPF
  8. Hubble - Network Observability for Kubernetes
  9. IPVS-based kube-proxy - Kubernetes Blog
  10. CNI Specification - Container Network Interface
  11. NodeLocal DNSCache - Kubernetes Documentation
  12. Cilium ClusterMesh - Multi-Cluster Networking
  13. Life of a Packet in Kubernetes - Conference Talk

현재 단락 (1/875)

Kubernetes networking is built on three fundamental principles.

작성 글자: 0원문 글자: 28,421작성 단락: 0/875