Split View: 컨테이너 & 쿠버네티스 네트워크 디버깅 완전 가이드

컨테이너 & 쿠버네티스 네트워크 디버깅 완전 가이드

1. Docker 네트워킹 모델
2. Kubernetes 네트워킹 모델과 CNI
- 2.1 Kubernetes 네트워킹의 기본 원칙
- 2.2 CNI 플러그인 이해
3. Pod-to-Pod 통신 디버깅
- 3.1 같은 노드 내 Pod 간 통신
- 3.2 다른 노드의 Pod 간 통신
4. Pod-to-Service 통신 디버깅
- 4.1 Service 기본 동작 확인
- 4.2 kube-proxy와 iptables/IPVS 규칙
5. Service DNS 해석 문제
- 5.1 CoreDNS 동작 확인
- 5.2 DNS 문제 진단
6. 외부 통신 디버깅
- 6.1 Ingress / LoadBalancer 문제
- 6.2 Egress 트래픽 디버깅
7. Network Policy 디버깅
- 7.1 Network Policy 기본 진단
- 7.2 Network Policy 트러블슈팅 패턴
8. Calico 트러블슈팅
- 8.1 Calico 상태 확인
- 8.2 Calico 디버깅
9. Cilium 트러블슈팅
- 9.1 Cilium 상태 확인
- 9.2 Cilium 디버깅
10. Debug 컨테이너와 Ephemeral 컨테이너
- 10.1 kubectl debug 활용
- 10.2 netshoot을 활용한 네트워크 디버깅
11. 실전 디버깅 시나리오
12. 디버깅 체크리스트

1. Docker 네트워킹 모델

컨테이너 네트워크 디버깅의 기초는 Docker가 제공하는 네트워킹 모델을 정확히 이해하는 것에서 시작합니다.

1.1 Bridge 네트워크

Docker의 기본 네트워크 드라이버입니다. 각 컨테이너는 docker0 브릿지에 연결된 가상 이더넷(veth) 인터페이스를 할당받습니다.

# 브릿지 네트워크 상세 확인
docker network inspect bridge

# 컨테이너의 네트워크 네임스페이스 확인
docker inspect --format '{{.NetworkSettings.IPAddress}}' <container_id>

# veth 페어 확인
ip link show type veth

# 브릿지에 연결된 인터페이스 확인
brctl show docker0

브릿지 네트워크에서 흔히 발생하는 문제는 다음과 같습니다.

컨테이너 간 통신 불가: 동일 브릿지에 연결되어 있는지 확인
외부 통신 불가: iptables NAT 규칙과 IP 포워딩 설정 확인
포트 충돌: 호스트 포트 바인딩 중복 확인

# iptables NAT 규칙 확인
sudo iptables -t nat -L -n -v

# IP 포워딩 상태 확인
cat /proc/sys/net/ipv4/ip_forward

# 포트 바인딩 확인
docker port <container_id>

1.2 Host 네트워크

컨테이너가 호스트의 네트워크 스택을 직접 사용합니다. 네트워크 격리가 없으므로 성능은 좋지만 포트 충돌 위험이 있습니다.

# host 모드로 컨테이너 실행
docker run --network host nginx

# 컨테이너 내부에서 네트워크 인터페이스 확인
docker exec <container_id> ip addr show

# 호스트와 동일한 네트워크 스택을 사용하는지 확인
docker exec <container_id> ss -tlnp

1.3 Overlay 네트워크

여러 Docker 호스트에 걸쳐 컨테이너 간 통신을 가능하게 합니다. Docker Swarm이나 외부 키-값 저장소가 필요합니다.

# overlay 네트워크 생성
docker network create --driver overlay my-overlay

# VXLAN 터널 상태 확인
ip -d link show type vxlan

# overlay 네트워크의 피어 정보 확인
docker network inspect my-overlay --format '{{json .Peers}}'

overlay 네트워크 디버깅 시 주의할 점은 다음과 같습니다.

VXLAN 포트(UDP 4789)가 방화벽에서 허용되어 있는지 확인
MTU 설정이 올바른지 확인 (VXLAN 오버헤드 50바이트 고려)
노드 간 시간 동기화 상태 확인

2. Kubernetes 네트워킹 모델과 CNI

2.1 Kubernetes 네트워킹의 기본 원칙

Kubernetes 네트워킹 모델은 세 가지 핵심 원칙을 따릅니다.

모든 Pod는 NAT 없이 다른 모든 Pod와 통신할 수 있어야 한다
모든 노드는 NAT 없이 모든 Pod와 통신할 수 있어야 한다
Pod가 자신의 IP로 인식하는 주소가 다른 Pod가 보는 주소와 동일해야 한다

# 클러스터 네트워크 CIDR 확인
kubectl cluster-info dump | grep -m 1 cluster-cidr

# 노드의 Pod CIDR 확인
kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'

# 현재 사용 중인 CNI 플러그인 확인
ls /etc/cni/net.d/
cat /etc/cni/net.d/*.conflist

2.2 CNI 플러그인 이해

CNI(Container Network Interface)는 컨테이너의 네트워크 인터페이스를 구성하는 표준입니다.

# CNI 바이너리 위치 확인
ls /opt/cni/bin/

# CNI 설정 파일 확인
cat /etc/cni/net.d/10-calico.conflist

# kubelet의 CNI 관련 로그 확인
journalctl -u kubelet | grep -i cni

주요 CNI 플러그인별 특징은 다음과 같습니다.

CNI 플러그인	데이터 플레인	Network Policy	특징
Calico	iptables/eBPF	지원	BGP 기반 라우팅
Cilium	eBPF	지원	L7 정책, Hubble 관측
Flannel	VXLAN/host-gw	미지원	간단한 설정
Weave	VXLAN	지원	자동 메시 네트워크

3. Pod-to-Pod 통신 디버깅

3.1 같은 노드 내 Pod 간 통신

# Pod IP 확인
kubectl get pods -o wide

# Pod에서 다른 Pod로 ping 테스트
kubectl exec -it <pod-a> -- ping <pod-b-ip>

# 같은 노드의 Pod 간 경로 확인
kubectl exec -it <pod-a> -- traceroute <pod-b-ip>

# 노드에서 veth 인터페이스 확인
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- ip link show

3.2 다른 노드의 Pod 간 통신

# 노드 간 라우팅 테이블 확인
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- route -n

# VXLAN이나 IPIP 터널 상태 확인
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- ip tunnel show

# tcpdump로 패킷 캡처
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- \
  tcpdump -i any -nn host <target-pod-ip>

# MTU 문제 확인 (Path MTU Discovery)
kubectl exec -it <pod> -- ping -M do -s 1400 <target-pod-ip>

노드 간 통신 문제의 주요 원인은 다음과 같습니다.

클라우드 보안 그룹에서 Pod CIDR 트래픽 차단
MTU 불일치로 인한 패킷 단편화
노드 간 라우팅 테이블 불일치
IPIP/VXLAN 터널 인터페이스 다운

4. Pod-to-Service 통신 디버깅

4.1 Service 기본 동작 확인

# Service의 ClusterIP와 Endpoints 확인
kubectl get svc <service-name> -o wide
kubectl get endpoints <service-name>

# Service에 연결된 Pod 확인
kubectl get endpoints <service-name> -o yaml

# Service로 접근 테스트
kubectl exec -it <test-pod> -- curl -v http://<service-name>:<port>

# Service의 selector와 매칭되는 Pod 확인
kubectl get pods -l <label-selector>

4.2 kube-proxy와 iptables/IPVS 규칙

kube-proxy는 Service의 가상 IP를 실제 Pod IP로 변환하는 역할을 합니다.

# kube-proxy 모드 확인
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

# iptables 모드: Service 관련 iptables 규칙 확인
sudo iptables -t nat -L KUBE-SERVICES -n
sudo iptables -t nat -L KUBE-SVC-<hash> -n
sudo iptables -t nat -L KUBE-SEP-<hash> -n

# IPVS 모드: 가상 서버 목록 확인
sudo ipvsadm -Ln
sudo ipvsadm -Ln -t <cluster-ip>:<port>

# kube-proxy 로그 확인
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=100

iptables 규칙 분석 시 체인의 흐름을 이해해야 합니다.

PREROUTING -> KUBE-SERVICES -> KUBE-SVC-xxx -> KUBE-SEP-xxx (DNAT)

# 특정 Service의 전체 iptables 체인 추적
SERVICE_IP=$(kubectl get svc <service-name> -o jsonpath='{.spec.clusterIP}')
sudo iptables -t nat -L KUBE-SERVICES -n | grep $SERVICE_IP

# conntrack 테이블에서 Service 연결 추적
sudo conntrack -L -d $SERVICE_IP

5. Service DNS 해석 문제

5.1 CoreDNS 동작 확인

Kubernetes 클러스터 내부 DNS는 CoreDNS가 담당합니다.

# CoreDNS Pod 상태 확인
kubectl get pods -n kube-system -l k8s-app=kube-dns

# CoreDNS 설정 확인
kubectl get configmap coredns -n kube-system -o yaml

# DNS 해석 테스트
kubectl exec -it <test-pod> -- nslookup <service-name>
kubectl exec -it <test-pod> -- nslookup <service-name>.<namespace>.svc.cluster.local

# DNS 응답 시간 측정
kubectl exec -it <test-pod> -- dig <service-name>.default.svc.cluster.local +stats

5.2 DNS 문제 진단

# Pod의 DNS 설정 확인
kubectl exec -it <pod> -- cat /etc/resolv.conf

# CoreDNS 로그에서 에러 확인
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=200

# DNS 디버깅용 Pod 배포
kubectl run dnsutils --image=gcr.io/kubernetes-e2e-test-images/dnsutils:1.3 \
  --restart=Never -- sleep 3600

# 외부 도메인 해석 테스트
kubectl exec -it dnsutils -- nslookup google.com
kubectl exec -it dnsutils -- nslookup kubernetes.default

흔히 발생하는 DNS 문제와 해결 방법은 다음과 같습니다.

ndots 설정: resolv.conf의 ndots:5 설정 때문에 짧은 이름에 대해 여러 번 DNS 쿼리가 발생합니다. FQDN(마지막에 . 포함)을 사용하면 해결됩니다.
CoreDNS CrashLoopBackOff: 로그를 확인하고, 업스트림 DNS 서버 연결 상태를 점검합니다.
DNS 캐시 문제: CoreDNS의 cache 플러그인 설정을 확인합니다.

# ndots 문제 확인: 쿼리 횟수 비교
kubectl exec -it <pod> -- dig myservice.default.svc.cluster.local +search +showsearch
kubectl exec -it <pod> -- dig myservice.default.svc.cluster.local. +search +showsearch

6. 외부 통신 디버깅

6.1 Ingress / LoadBalancer 문제

# Ingress 리소스 상태 확인
kubectl get ingress -A
kubectl describe ingress <ingress-name>

# Ingress Controller 로그 확인
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100

# LoadBalancer Service의 External IP 확인
kubectl get svc -l type=LoadBalancer

# NodePort를 통한 외부 접근 테스트
curl -v http://<node-ip>:<node-port>

6.2 Egress 트래픽 디버깅

# Pod에서 외부로 나가는 트래픽 확인
kubectl exec -it <pod> -- curl -v https://httpbin.org/ip

# NAT 게이트웨이 및 라우팅 확인
kubectl exec -it <pod> -- traceroute 8.8.8.8

# SNAT 규칙 확인
sudo iptables -t nat -L POSTROUTING -n -v

7. Network Policy 디버깅

7.1 Network Policy 기본 진단

# 적용된 Network Policy 확인
kubectl get networkpolicy -A
kubectl describe networkpolicy <policy-name> -n <namespace>

# Pod에 적용되는 Network Policy 확인
kubectl get networkpolicy -n <namespace> -o json | \
  jq '.items[] | select(.spec.podSelector.matchLabels | to_entries[] |
  .key == "app" and .value == "myapp")'

# Network Policy 적용 전후 연결 테스트
kubectl exec -it <source-pod> -- nc -zv <target-pod-ip> <port>

7.2 Network Policy 트러블슈팅 패턴

# 모든 트래픽을 차단하는 기본 정책
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

# 정책 적용 후 통신 테스트
kubectl exec -it <pod> -- wget -qO- --timeout=2 http://<target-service>

# CNI별 Network Policy 로그 확인
# Calico
kubectl logs -n calico-system -l k8s-app=calico-node --tail=100

# Cilium
kubectl exec -n kube-system <cilium-pod> -- cilium policy get
kubectl exec -n kube-system <cilium-pod> -- cilium endpoint list

8. Calico 트러블슈팅

8.1 Calico 상태 확인

# Calico 노드 상태 확인
kubectl get pods -n calico-system
calicoctl node status

# BGP 피어 상태 확인
calicoctl get bgpPeer
calicoctl node status | grep -A 5 "BGP"

# IP Pool 확인
calicoctl get ippool -o wide

# Calico 네트워크 정책 확인 (Kubernetes + Calico 전용)
calicoctl get networkpolicy -A
calicoctl get globalnetworkpolicy

8.2 Calico 디버깅

# Felix 로그 레벨 변경 (디버깅 시)
calicoctl patch felixconfiguration default \
  --patch '{"spec":{"logSeverityScreen":"Debug"}}'

# BIRD 라우팅 데몬 상태 확인
kubectl exec -n calico-system <calico-node-pod> -- birdcl show route

# IP-in-IP 터널 상태
kubectl exec -n calico-system <calico-node-pod> -- ip tunnel show

# Calico가 프로그래밍한 iptables 규칙 확인
sudo iptables -L -n | grep cali

9. Cilium 트러블슈팅

9.1 Cilium 상태 확인

# Cilium 에이전트 상태 확인
kubectl exec -n kube-system <cilium-pod> -- cilium status --verbose

# Cilium 엔드포인트 목록
kubectl exec -n kube-system <cilium-pod> -- cilium endpoint list

# BPF 맵 상태 확인
kubectl exec -n kube-system <cilium-pod> -- cilium bpf lb list
kubectl exec -n kube-system <cilium-pod> -- cilium bpf ct list global

# Hubble로 네트워크 플로우 관찰
hubble observe --namespace <namespace>
hubble observe --pod <pod-name> --protocol TCP

9.2 Cilium 디버깅

# Cilium 연결성 테스트
kubectl exec -n kube-system <cilium-pod> -- cilium connectivity test

# eBPF 프로그램 상태 확인
kubectl exec -n kube-system <cilium-pod> -- cilium bpf prog list

# Cilium 정책 트레이싱
kubectl exec -n kube-system <cilium-pod> -- cilium policy trace \
  --src-k8s-pod <namespace>:<src-pod> \
  --dst-k8s-pod <namespace>:<dst-pod> \
  --dport <port>

# Cilium 모니터로 실시간 이벤트 관찰
kubectl exec -n kube-system <cilium-pod> -- cilium monitor --type drop
kubectl exec -n kube-system <cilium-pod> -- cilium monitor --type policy-verdict

10. Debug 컨테이너와 Ephemeral 컨테이너

10.1 kubectl debug 활용

# 실행 중인 Pod에 디버그 컨테이너 연결
kubectl debug -it <pod-name> --image=nicolaka/netshoot --target=<container-name>

# 노드 수준 디버깅
kubectl debug node/<node-name> -it --image=ubuntu

# Pod 복사본 생성하여 디버깅 (원본에 영향 없음)
kubectl debug <pod-name> -it --copy-to=debug-pod --image=nicolaka/netshoot

# 프로세스 네임스페이스를 공유하는 디버그 컨테이너
kubectl debug -it <pod-name> --image=busybox --share-processes

10.2 netshoot을 활용한 네트워크 디버깅

# netshoot 컨테이너 배포
kubectl run netshoot --image=nicolaka/netshoot --restart=Never -- sleep 3600

# 종합 네트워크 진단
kubectl exec -it netshoot -- bash

# 내부에서 실행할 명령어들
# TCP 연결 테스트
nc -zv <service-ip> <port>

# HTTP 요청 테스트
curl -v http://<service-name>.<namespace>.svc.cluster.local:<port>/health

# DNS 조회
dig +short <service-name>.<namespace>.svc.cluster.local

# 패킷 캡처
tcpdump -i eth0 -nn port 80

# 네트워크 경로 추적
mtr <target-ip>

# SSL/TLS 연결 확인
openssl s_client -connect <service-ip>:443

11. 실전 디버깅 시나리오

시나리오 1: Pod가 Service에 접근할 수 없는 경우

# 1단계: Service와 Endpoints 확인
kubectl get svc <service-name> -o yaml
kubectl get endpoints <service-name>

# 2단계: Endpoints가 비어있다면 -> selector 확인
kubectl get pods -l <label-selector> --show-labels

# 3단계: Pod가 Ready 상태인지 확인
kubectl get pods -l <label-selector> -o wide
kubectl describe pod <target-pod> | grep -A 5 "Conditions"

# 4단계: kube-proxy 규칙 확인
sudo iptables -t nat -L KUBE-SERVICES -n | grep <cluster-ip>

# 5단계: 직접 Pod IP로 접근 테스트
kubectl exec -it <source-pod> -- curl -v http://<pod-ip>:<target-port>

시나리오 2: 간헐적인 타임아웃 발생

# 1단계: conntrack 테이블 포화 확인
sudo sysctl net.netfilter.nf_conntrack_count
sudo sysctl net.netfilter.nf_conntrack_max

# 2단계: 패킷 드롭 확인
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- \
  netstat -s | grep -i drop

# 3단계: 네트워크 인터페이스 오류 확인
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- \
  ip -s link show

# 4단계: TCP 재전송 확인
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- \
  netstat -s | grep -i retrans

시나리오 3: DNS 해석이 느린 경우

# 1단계: DNS 응답 시간 측정
kubectl exec -it <pod> -- dig <service-name>.default.svc.cluster.local +stats | grep "Query time"

# 2단계: CoreDNS Pod 리소스 사용량 확인
kubectl top pods -n kube-system -l k8s-app=kube-dns

# 3단계: ndots 최적화 적용
# Pod spec에 다음 추가:
# dnsConfig:
#   options:
#   - name: ndots
#     value: "2"

# 4단계: CoreDNS 캐시 히트율 확인
kubectl exec -n kube-system <coredns-pod> -- \
  wget -qO- http://localhost:9153/metrics | grep coredns_cache

12. 디버깅 체크리스트

네트워크 문제를 체계적으로 접근하기 위한 체크리스트입니다.

[ ] Pod 상태 확인 (Running, Ready)
[ ] Service Endpoints 확인 (비어있지 않은지)
[ ] DNS 해석 확인 (nslookup/dig)
[ ] 직접 Pod IP로 연결 테스트
[ ] kube-proxy 규칙 확인 (iptables/IPVS)
[ ] Network Policy 확인 (Ingress/Egress 규칙)
[ ] CNI 플러그인 상태 확인
[ ] 노드 간 네트워크 연결 확인
[ ] MTU 설정 확인
[ ] 방화벽/보안그룹 규칙 확인
[ ] conntrack 테이블 상태 확인
[ ] CoreDNS 상태 및 로그 확인

네트워크 디버깅은 계층적으로 접근하는 것이 효과적입니다. L2(이더넷) -> L3(IP/라우팅) -> L4(TCP/UDP) -> L7(HTTP/DNS) 순서로 각 계층에서 문제가 없는지 확인하면, 복잡한 네트워크 문제도 체계적으로 해결할 수 있습니다.

The Complete Guide to Container & Kubernetes Network Debugging

1. Docker Networking Models
2. Kubernetes Networking Model and CNI
- 2.1 Core Principles of Kubernetes Networking
- 2.2 Understanding CNI Plugins
3. Pod-to-Pod Communication Debugging
- 3.1 Same-Node Pod Communication
- 3.2 Cross-Node Pod Communication
4. Pod-to-Service Communication Debugging
- 4.1 Verifying Service Basics
- 4.2 kube-proxy and iptables/IPVS Rules
5. Service DNS Resolution Issues
- 5.1 Verifying CoreDNS Operation
- 5.2 Diagnosing DNS Problems
6. External Traffic Debugging
- 6.1 Ingress / LoadBalancer Troubleshooting
- 6.2 Egress Traffic Debugging
7. Network Policy Debugging
- 7.1 Basic Network Policy Diagnostics
- 7.2 Network Policy Troubleshooting Patterns
8. Calico Troubleshooting
- 8.1 Checking Calico Status
- 8.2 Calico Debugging
9. Cilium Troubleshooting
- 9.1 Checking Cilium Status
- 9.2 Cilium Debugging
10. Debug Containers and Ephemeral Containers
- 10.1 Using kubectl debug
- 10.2 Network Debugging with netshoot
11. Practical Debugging Scenarios
12. Debugging Checklist
Quiz

1. Docker Networking Models

The foundation of container network debugging starts with a solid understanding of the networking models Docker provides.

1.1 Bridge Network

The bridge driver is Docker's default network driver. Each container receives a virtual ethernet (veth) interface connected to the docker0 bridge.

# Inspect the bridge network details
docker network inspect bridge

# Check a container's IP address
docker inspect --format '{{.NetworkSettings.IPAddress}}' <container_id>

# List veth pairs
ip link show type veth

# Show interfaces connected to docker0
brctl show docker0

Common issues with bridge networking include:

Container-to-container communication failure: Verify that both containers are on the same bridge network.
No external connectivity: Check iptables NAT rules and IP forwarding settings.
Port conflicts: Inspect host port binding overlaps.

# Check iptables NAT rules
sudo iptables -t nat -L -n -v

# Verify IP forwarding is enabled
cat /proc/sys/net/ipv4/ip_forward

# Check port bindings
docker port <container_id>

1.2 Host Network

The container shares the host's network stack directly. There is no network isolation, which yields better performance but introduces port conflict risks.

# Run a container in host network mode
docker run --network host nginx

# Verify the container sees host interfaces
docker exec <container_id> ip addr show

# Confirm shared network stack
docker exec <container_id> ss -tlnp

1.3 Overlay Network

Overlay networks enable communication between containers across multiple Docker hosts. They require Docker Swarm or an external key-value store.

# Create an overlay network
docker network create --driver overlay my-overlay

# Check VXLAN tunnel status
ip -d link show type vxlan

# Inspect overlay network peer information
docker network inspect my-overlay --format '{{json .Peers}}'

Key considerations when debugging overlay networks:

Ensure the VXLAN port (UDP 4789) is allowed through firewalls
Verify MTU settings account for VXLAN overhead (50 bytes)
Check time synchronization between nodes

2. Kubernetes Networking Model and CNI

2.1 Core Principles of Kubernetes Networking

The Kubernetes networking model is built on three fundamental requirements:

Every Pod can communicate with every other Pod without NAT
Every node can communicate with every Pod without NAT
The IP a Pod sees for itself is the same IP that other Pods see for it

# Check the cluster network CIDR
kubectl cluster-info dump | grep -m 1 cluster-cidr

# Get Pod CIDRs assigned to each node
kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'

# Identify the CNI plugin in use
ls /etc/cni/net.d/
cat /etc/cni/net.d/*.conflist

2.2 Understanding CNI Plugins

CNI (Container Network Interface) is the standard for configuring network interfaces in containers.

# Locate CNI binaries
ls /opt/cni/bin/

# Inspect the CNI configuration
cat /etc/cni/net.d/10-calico.conflist

# Check kubelet logs for CNI-related events
journalctl -u kubelet | grep -i cni

Comparison of popular CNI plugins:

CNI Plugin	Data Plane	Network Policy	Key Feature
Calico	iptables/eBPF	Supported	BGP-based routing
Cilium	eBPF	Supported	L7 policies, Hubble observability
Flannel	VXLAN/host-gw	Not supported	Simple configuration
Weave	VXLAN	Supported	Automatic mesh networking

3. Pod-to-Pod Communication Debugging

3.1 Same-Node Pod Communication

# Get Pod IPs
kubectl get pods -o wide

# Ping from one Pod to another
kubectl exec -it <pod-a> -- ping <pod-b-ip>

# Trace the route between Pods on the same node
kubectl exec -it <pod-a> -- traceroute <pod-b-ip>

# Inspect veth interfaces on the node
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- ip link show

3.2 Cross-Node Pod Communication

# Check the routing table on the node
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- route -n

# Verify VXLAN or IPIP tunnel status
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- ip tunnel show

# Capture packets with tcpdump
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- \
  tcpdump -i any -nn host <target-pod-ip>

# Test for MTU issues (Path MTU Discovery)
kubectl exec -it <pod> -- ping -M do -s 1400 <target-pod-ip>

Common causes of cross-node communication failures:

Cloud security groups blocking Pod CIDR traffic
MTU mismatch causing packet fragmentation
Inconsistent routing tables across nodes
IPIP/VXLAN tunnel interfaces being down

4. Pod-to-Service Communication Debugging

4.1 Verifying Service Basics

# Check the Service's ClusterIP and Endpoints
kubectl get svc <service-name> -o wide
kubectl get endpoints <service-name>

# Inspect the Endpoints in detail
kubectl get endpoints <service-name> -o yaml

# Test connectivity to the Service
kubectl exec -it <test-pod> -- curl -v http://<service-name>:<port>

# Verify Pods matching the Service selector
kubectl get pods -l <label-selector>

4.2 kube-proxy and iptables/IPVS Rules

kube-proxy translates Service virtual IPs into actual Pod IPs.

# Determine the kube-proxy mode
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

# iptables mode: inspect Service-related rules
sudo iptables -t nat -L KUBE-SERVICES -n
sudo iptables -t nat -L KUBE-SVC-<hash> -n
sudo iptables -t nat -L KUBE-SEP-<hash> -n

# IPVS mode: list virtual servers
sudo ipvsadm -Ln
sudo ipvsadm -Ln -t <cluster-ip>:<port>

# Check kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=100

Understanding the iptables chain flow is essential for debugging:

PREROUTING -> KUBE-SERVICES -> KUBE-SVC-xxx -> KUBE-SEP-xxx (DNAT)

# Trace the full iptables chain for a specific Service
SERVICE_IP=$(kubectl get svc <service-name> -o jsonpath='{.spec.clusterIP}')
sudo iptables -t nat -L KUBE-SERVICES -n | grep $SERVICE_IP

# Inspect conntrack entries for the Service
sudo conntrack -L -d $SERVICE_IP

5. Service DNS Resolution Issues

5.1 Verifying CoreDNS Operation

CoreDNS handles all in-cluster DNS resolution in Kubernetes.

# Check CoreDNS Pod status
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Inspect CoreDNS configuration
kubectl get configmap coredns -n kube-system -o yaml

# Test DNS resolution
kubectl exec -it <test-pod> -- nslookup <service-name>
kubectl exec -it <test-pod> -- nslookup <service-name>.<namespace>.svc.cluster.local

# Measure DNS response times
kubectl exec -it <test-pod> -- dig <service-name>.default.svc.cluster.local +stats

5.2 Diagnosing DNS Problems

# Check the Pod's DNS configuration
kubectl exec -it <pod> -- cat /etc/resolv.conf

# Review CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=200

# Deploy a DNS debugging Pod
kubectl run dnsutils --image=gcr.io/kubernetes-e2e-test-images/dnsutils:1.3 \
  --restart=Never -- sleep 3600

# Test external domain resolution
kubectl exec -it dnsutils -- nslookup google.com
kubectl exec -it dnsutils -- nslookup kubernetes.default

Common DNS issues and their resolutions:

ndots configuration: The default ndots:5 in resolv.conf causes multiple DNS queries for short names. Using FQDNs (with a trailing .) eliminates unnecessary lookups.
CoreDNS CrashLoopBackOff: Check logs and verify upstream DNS server connectivity.
DNS caching issues: Review the CoreDNS cache plugin configuration.

# Demonstrate the ndots effect: compare query counts
kubectl exec -it <pod> -- dig myservice.default.svc.cluster.local +search +showsearch
kubectl exec -it <pod> -- dig myservice.default.svc.cluster.local. +search +showsearch

6. External Traffic Debugging

6.1 Ingress / LoadBalancer Troubleshooting

# List all Ingress resources
kubectl get ingress -A
kubectl describe ingress <ingress-name>

# Check Ingress Controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100

# Verify LoadBalancer Service external IPs
kubectl get svc -l type=LoadBalancer

# Test NodePort access externally
curl -v http://<node-ip>:<node-port>

6.2 Egress Traffic Debugging

# Verify outbound connectivity from a Pod
kubectl exec -it <pod> -- curl -v https://httpbin.org/ip

# Check NAT gateway and routing
kubectl exec -it <pod> -- traceroute 8.8.8.8

# Inspect SNAT rules
sudo iptables -t nat -L POSTROUTING -n -v

7. Network Policy Debugging

7.1 Basic Network Policy Diagnostics

# List all Network Policies
kubectl get networkpolicy -A
kubectl describe networkpolicy <policy-name> -n <namespace>

# Find Network Policies applying to a specific Pod
kubectl get networkpolicy -n <namespace> -o json | \
  jq '.items[] | select(.spec.podSelector.matchLabels | to_entries[] |
  .key == "app" and .value == "myapp")'

# Test connectivity before and after policy application
kubectl exec -it <source-pod> -- nc -zv <target-pod-ip> <port>

7.2 Network Policy Troubleshooting Patterns

# Default deny-all policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

# Test connectivity after applying the policy
kubectl exec -it <pod> -- wget -qO- --timeout=2 http://<target-service>

# Check CNI-specific Network Policy logs
# Calico
kubectl logs -n calico-system -l k8s-app=calico-node --tail=100

# Cilium
kubectl exec -n kube-system <cilium-pod> -- cilium policy get
kubectl exec -n kube-system <cilium-pod> -- cilium endpoint list

8. Calico Troubleshooting

8.1 Checking Calico Status

# Verify Calico node health
kubectl get pods -n calico-system
calicoctl node status

# Check BGP peer status
calicoctl get bgpPeer
calicoctl node status | grep -A 5 "BGP"

# Inspect IP pools
calicoctl get ippool -o wide

# List Calico-specific network policies
calicoctl get networkpolicy -A
calicoctl get globalnetworkpolicy

8.2 Calico Debugging

# Increase Felix log level for debugging
calicoctl patch felixconfiguration default \
  --patch '{"spec":{"logSeverityScreen":"Debug"}}'

# Check BIRD routing daemon status
kubectl exec -n calico-system <calico-node-pod> -- birdcl show route

# Inspect IP-in-IP tunnel status
kubectl exec -n calico-system <calico-node-pod> -- ip tunnel show

# Review Calico-programmed iptables rules
sudo iptables -L -n | grep cali

9. Cilium Troubleshooting

9.1 Checking Cilium Status

# Get detailed Cilium agent status
kubectl exec -n kube-system <cilium-pod> -- cilium status --verbose

# List all Cilium endpoints
kubectl exec -n kube-system <cilium-pod> -- cilium endpoint list

# Inspect BPF maps
kubectl exec -n kube-system <cilium-pod> -- cilium bpf lb list
kubectl exec -n kube-system <cilium-pod> -- cilium bpf ct list global

# Observe network flows with Hubble
hubble observe --namespace <namespace>
hubble observe --pod <pod-name> --protocol TCP

9.2 Cilium Debugging

# Run Cilium connectivity test
kubectl exec -n kube-system <cilium-pod> -- cilium connectivity test

# Check eBPF program status
kubectl exec -n kube-system <cilium-pod> -- cilium bpf prog list

# Trace policy decisions between Pods
kubectl exec -n kube-system <cilium-pod> -- cilium policy trace \
  --src-k8s-pod <namespace>:<src-pod> \
  --dst-k8s-pod <namespace>:<dst-pod> \
  --dport <port>

# Monitor events in real time
kubectl exec -n kube-system <cilium-pod> -- cilium monitor --type drop
kubectl exec -n kube-system <cilium-pod> -- cilium monitor --type policy-verdict

10. Debug Containers and Ephemeral Containers

10.1 Using kubectl debug

# Attach a debug container to a running Pod
kubectl debug -it <pod-name> --image=nicolaka/netshoot --target=<container-name>

# Debug at the node level
kubectl debug node/<node-name> -it --image=ubuntu

# Create a copy of a Pod for debugging (no impact on original)
kubectl debug <pod-name> -it --copy-to=debug-pod --image=nicolaka/netshoot

# Debug container with shared process namespace
kubectl debug -it <pod-name> --image=busybox --share-processes

10.2 Network Debugging with netshoot

# Deploy a netshoot container
kubectl run netshoot --image=nicolaka/netshoot --restart=Never -- sleep 3600

# Enter the container for comprehensive diagnostics
kubectl exec -it netshoot -- bash

# Commands to run inside the container:
# TCP connectivity test
nc -zv <service-ip> <port>

# HTTP request test
curl -v http://<service-name>.<namespace>.svc.cluster.local:<port>/health

# DNS lookup
dig +short <service-name>.<namespace>.svc.cluster.local

# Packet capture
tcpdump -i eth0 -nn port 80

# Network path tracing
mtr <target-ip>

# SSL/TLS connection verification
openssl s_client -connect <service-ip>:443

11. Practical Debugging Scenarios

Scenario 1: Pod Cannot Reach a Service

# Step 1: Verify the Service and its Endpoints
kubectl get svc <service-name> -o yaml
kubectl get endpoints <service-name>

# Step 2: If Endpoints are empty, check the selector
kubectl get pods -l <label-selector> --show-labels

# Step 3: Verify target Pods are Ready
kubectl get pods -l <label-selector> -o wide
kubectl describe pod <target-pod> | grep -A 5 "Conditions"

# Step 4: Inspect kube-proxy rules
sudo iptables -t nat -L KUBE-SERVICES -n | grep <cluster-ip>

# Step 5: Test direct Pod IP connectivity
kubectl exec -it <source-pod> -- curl -v http://<pod-ip>:<target-port>

Scenario 2: Intermittent Timeouts

# Step 1: Check conntrack table saturation
sudo sysctl net.netfilter.nf_conntrack_count
sudo sysctl net.netfilter.nf_conntrack_max

# Step 2: Look for packet drops
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- \
  netstat -s | grep -i drop

# Step 3: Check network interface errors
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- \
  ip -s link show

# Step 4: Monitor TCP retransmissions
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- \
  netstat -s | grep -i retrans

Scenario 3: Slow DNS Resolution

# Step 1: Measure DNS response times
kubectl exec -it <pod> -- dig <service-name>.default.svc.cluster.local +stats | grep "Query time"

# Step 2: Check CoreDNS resource usage
kubectl top pods -n kube-system -l k8s-app=kube-dns

# Step 3: Apply ndots optimization
# Add to Pod spec:
# dnsConfig:
#   options:
#   - name: ndots
#     value: "2"

# Step 4: Check CoreDNS cache hit rates
kubectl exec -n kube-system <coredns-pod> -- \
  wget -qO- http://localhost:9153/metrics | grep coredns_cache

12. Debugging Checklist

A systematic checklist for approaching network issues:

[ ] Pod status verification (Running, Ready)
[ ] Service Endpoints verification (not empty)
[ ] DNS resolution test (nslookup/dig)
[ ] Direct Pod IP connectivity test
[ ] kube-proxy rules inspection (iptables/IPVS)
[ ] Network Policy review (Ingress/Egress rules)
[ ] CNI plugin health check
[ ] Cross-node network connectivity
[ ] MTU configuration verification
[ ] Firewall / security group rules
[ ] conntrack table status
[ ] CoreDNS health and logs

Network debugging is most effective when approached layer by layer: L2 (Ethernet) -> L3 (IP/Routing) -> L4 (TCP/UDP) -> L7 (HTTP/DNS). By systematically verifying each layer, even the most complex networking issues can be methodically diagnosed and resolved.

Quiz

Q1: What is the main topic covered in "The Complete Guide to Container & Kubernetes Network Debugging"?

A systematic guide to debugging container networking, from Docker networking models to Kubernetes CNI, Service DNS, Network Policies, and Calico/Cilium troubleshooting with practical commands.

Q2: What is Docker Networking Models?

The foundation of container network debugging starts with a solid understanding of the networking models Docker provides. 1.1 Bridge Network The bridge driver is Docker's default network driver.

Q3: Explain the core concept of Kubernetes Networking Model and CNI.

2.1 Core Principles of Kubernetes Networking The Kubernetes networking model is built on three fundamental requirements: Every Pod can communicate with every other Pod without NAT Every node can communicate with every Pod without NAT The IP a Pod sees for itself is the same IP th...

Q4: What approach is recommended for Pod-to-Pod Communication Debugging?

3.1 Same-Node Pod Communication 3.2 Cross-Node Pod Communication Common causes of cross-node communication failures: Cloud security groups blocking Pod CIDR traffic MTU mismatch causing packet fragmentation Inconsistent routing tables across nodes IPIP/VXLAN tunnel interfaces bei...

Q5: What approach is recommended for Pod-to-Service Communication Debugging?

4.1 Verifying Service Basics 4.2 kube-proxy and iptables/IPVS Rules kube-proxy translates Service virtual IPs into actual Pod IPs. Understanding the iptables chain flow is essential for debugging: