The Complete Guide to Container & Kubernetes Network Debugging

1. Docker Networking Models
2. Kubernetes Networking Model and CNI
- 2.1 Core Principles of Kubernetes Networking
- 2.2 Understanding CNI Plugins
3. Pod-to-Pod Communication Debugging
- 3.1 Same-Node Pod Communication
- 3.2 Cross-Node Pod Communication
4. Pod-to-Service Communication Debugging
- 4.1 Verifying Service Basics
- 4.2 kube-proxy and iptables/IPVS Rules
5. Service DNS Resolution Issues
- 5.1 Verifying CoreDNS Operation
- 5.2 Diagnosing DNS Problems
6. External Traffic Debugging
- 6.1 Ingress / LoadBalancer Troubleshooting
- 6.2 Egress Traffic Debugging
7. Network Policy Debugging
- 7.1 Basic Network Policy Diagnostics
- 7.2 Network Policy Troubleshooting Patterns
8. Calico Troubleshooting
- 8.1 Checking Calico Status
- 8.2 Calico Debugging
9. Cilium Troubleshooting
- 9.1 Checking Cilium Status
- 9.2 Cilium Debugging
10. Debug Containers and Ephemeral Containers
- 10.1 Using kubectl debug
- 10.2 Network Debugging with netshoot
11. Practical Debugging Scenarios
12. Debugging Checklist

1. Docker Networking Models

The foundation of container network debugging starts with a solid understanding of the networking models Docker provides.

1.1 Bridge Network

The bridge driver is Docker's default network driver. Each container receives a virtual ethernet (veth) interface connected to the docker0 bridge.

# Inspect the bridge network details
docker network inspect bridge

# Check a container's IP address
docker inspect --format '{{.NetworkSettings.IPAddress}}' <container_id>

# List veth pairs
ip link show type veth

# Show interfaces connected to docker0
brctl show docker0

Common issues with bridge networking include:

Container-to-container communication failure: Verify that both containers are on the same bridge network.
No external connectivity: Check iptables NAT rules and IP forwarding settings.
Port conflicts: Inspect host port binding overlaps.

# Check iptables NAT rules
sudo iptables -t nat -L -n -v

# Verify IP forwarding is enabled
cat /proc/sys/net/ipv4/ip_forward

# Check port bindings
docker port <container_id>

1.2 Host Network

The container shares the host's network stack directly. There is no network isolation, which yields better performance but introduces port conflict risks.

# Run a container in host network mode
docker run --network host nginx

# Verify the container sees host interfaces
docker exec <container_id> ip addr show

# Confirm shared network stack
docker exec <container_id> ss -tlnp

1.3 Overlay Network

Overlay networks enable communication between containers across multiple Docker hosts. They require Docker Swarm or an external key-value store.

# Create an overlay network
docker network create --driver overlay my-overlay

# Check VXLAN tunnel status
ip -d link show type vxlan

# Inspect overlay network peer information
docker network inspect my-overlay --format '{{json .Peers}}'

Key considerations when debugging overlay networks:

Ensure the VXLAN port (UDP 4789) is allowed through firewalls
Verify MTU settings account for VXLAN overhead (50 bytes)
Check time synchronization between nodes

2. Kubernetes Networking Model and CNI

2.1 Core Principles of Kubernetes Networking

The Kubernetes networking model is built on three fundamental requirements:

Every Pod can communicate with every other Pod without NAT
Every node can communicate with every Pod without NAT
The IP a Pod sees for itself is the same IP that other Pods see for it

# Check the cluster network CIDR
kubectl cluster-info dump | grep -m 1 cluster-cidr

# Get Pod CIDRs assigned to each node
kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'

# Identify the CNI plugin in use
ls /etc/cni/net.d/
cat /etc/cni/net.d/*.conflist

2.2 Understanding CNI Plugins

CNI (Container Network Interface) is the standard for configuring network interfaces in containers.

# Locate CNI binaries
ls /opt/cni/bin/

# Inspect the CNI configuration
cat /etc/cni/net.d/10-calico.conflist

# Check kubelet logs for CNI-related events
journalctl -u kubelet | grep -i cni

Comparison of popular CNI plugins:

CNI Plugin	Data Plane	Network Policy	Key Feature
Calico	iptables/eBPF	Supported	BGP-based routing
Cilium	eBPF	Supported	L7 policies, Hubble observability
Flannel	VXLAN/host-gw	Not supported	Simple configuration
Weave	VXLAN	Supported	Automatic mesh networking

3. Pod-to-Pod Communication Debugging

3.1 Same-Node Pod Communication

# Get Pod IPs
kubectl get pods -o wide

# Ping from one Pod to another
kubectl exec -it <pod-a> -- ping <pod-b-ip>

# Trace the route between Pods on the same node
kubectl exec -it <pod-a> -- traceroute <pod-b-ip>

# Inspect veth interfaces on the node
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- ip link show

3.2 Cross-Node Pod Communication

# Check the routing table on the node
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- route -n

# Verify VXLAN or IPIP tunnel status
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- ip tunnel show

# Capture packets with tcpdump
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- \
  tcpdump -i any -nn host <target-pod-ip>

# Test for MTU issues (Path MTU Discovery)
kubectl exec -it <pod> -- ping -M do -s 1400 <target-pod-ip>

Common causes of cross-node communication failures:

Cloud security groups blocking Pod CIDR traffic
MTU mismatch causing packet fragmentation
Inconsistent routing tables across nodes
IPIP/VXLAN tunnel interfaces being down

4. Pod-to-Service Communication Debugging

4.1 Verifying Service Basics

# Check the Service's ClusterIP and Endpoints
kubectl get svc <service-name> -o wide
kubectl get endpoints <service-name>

# Inspect the Endpoints in detail
kubectl get endpoints <service-name> -o yaml

# Test connectivity to the Service
kubectl exec -it <test-pod> -- curl -v http://<service-name>:<port>

# Verify Pods matching the Service selector
kubectl get pods -l <label-selector>

4.2 kube-proxy and iptables/IPVS Rules

kube-proxy translates Service virtual IPs into actual Pod IPs.

# Determine the kube-proxy mode
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

# iptables mode: inspect Service-related rules
sudo iptables -t nat -L KUBE-SERVICES -n
sudo iptables -t nat -L KUBE-SVC-<hash> -n
sudo iptables -t nat -L KUBE-SEP-<hash> -n

# IPVS mode: list virtual servers
sudo ipvsadm -Ln
sudo ipvsadm -Ln -t <cluster-ip>:<port>

# Check kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=100

Understanding the iptables chain flow is essential for debugging:

PREROUTING -> KUBE-SERVICES -> KUBE-SVC-xxx -> KUBE-SEP-xxx (DNAT)

# Trace the full iptables chain for a specific Service
SERVICE_IP=$(kubectl get svc <service-name> -o jsonpath='{.spec.clusterIP}')
sudo iptables -t nat -L KUBE-SERVICES -n | grep $SERVICE_IP

# Inspect conntrack entries for the Service
sudo conntrack -L -d $SERVICE_IP

5. Service DNS Resolution Issues

5.1 Verifying CoreDNS Operation

CoreDNS handles all in-cluster DNS resolution in Kubernetes.

# Check CoreDNS Pod status
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Inspect CoreDNS configuration
kubectl get configmap coredns -n kube-system -o yaml

# Test DNS resolution
kubectl exec -it <test-pod> -- nslookup <service-name>
kubectl exec -it <test-pod> -- nslookup <service-name>.<namespace>.svc.cluster.local

# Measure DNS response times
kubectl exec -it <test-pod> -- dig <service-name>.default.svc.cluster.local +stats

5.2 Diagnosing DNS Problems

# Check the Pod's DNS configuration
kubectl exec -it <pod> -- cat /etc/resolv.conf

# Review CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=200

# Deploy a DNS debugging Pod
kubectl run dnsutils --image=gcr.io/kubernetes-e2e-test-images/dnsutils:1.3 \
  --restart=Never -- sleep 3600

# Test external domain resolution
kubectl exec -it dnsutils -- nslookup google.com
kubectl exec -it dnsutils -- nslookup kubernetes.default

Common DNS issues and their resolutions:

ndots configuration: The default ndots:5 in resolv.conf causes multiple DNS queries for short names. Using FQDNs (with a trailing .) eliminates unnecessary lookups.
CoreDNS CrashLoopBackOff: Check logs and verify upstream DNS server connectivity.
DNS caching issues: Review the CoreDNS cache plugin configuration.

# Demonstrate the ndots effect: compare query counts
kubectl exec -it <pod> -- dig myservice.default.svc.cluster.local +search +showsearch
kubectl exec -it <pod> -- dig myservice.default.svc.cluster.local. +search +showsearch

6. External Traffic Debugging

6.1 Ingress / LoadBalancer Troubleshooting

# List all Ingress resources
kubectl get ingress -A
kubectl describe ingress <ingress-name>

# Check Ingress Controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100

# Verify LoadBalancer Service external IPs
kubectl get svc -l type=LoadBalancer

# Test NodePort access externally
curl -v http://<node-ip>:<node-port>

6.2 Egress Traffic Debugging

# Verify outbound connectivity from a Pod
kubectl exec -it <pod> -- curl -v https://httpbin.org/ip

# Check NAT gateway and routing
kubectl exec -it <pod> -- traceroute 8.8.8.8

# Inspect SNAT rules
sudo iptables -t nat -L POSTROUTING -n -v

7. Network Policy Debugging

7.1 Basic Network Policy Diagnostics

# List all Network Policies
kubectl get networkpolicy -A
kubectl describe networkpolicy <policy-name> -n <namespace>

# Find Network Policies applying to a specific Pod
kubectl get networkpolicy -n <namespace> -o json | \
  jq '.items[] | select(.spec.podSelector.matchLabels | to_entries[] |
  .key == "app" and .value == "myapp")'

# Test connectivity before and after policy application
kubectl exec -it <source-pod> -- nc -zv <target-pod-ip> <port>

7.2 Network Policy Troubleshooting Patterns

# Default deny-all policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

# Test connectivity after applying the policy
kubectl exec -it <pod> -- wget -qO- --timeout=2 http://<target-service>

# Check CNI-specific Network Policy logs
# Calico
kubectl logs -n calico-system -l k8s-app=calico-node --tail=100

# Cilium
kubectl exec -n kube-system <cilium-pod> -- cilium policy get
kubectl exec -n kube-system <cilium-pod> -- cilium endpoint list

8. Calico Troubleshooting

8.1 Checking Calico Status

# Verify Calico node health
kubectl get pods -n calico-system
calicoctl node status

# Check BGP peer status
calicoctl get bgpPeer
calicoctl node status | grep -A 5 "BGP"

# Inspect IP pools
calicoctl get ippool -o wide

# List Calico-specific network policies
calicoctl get networkpolicy -A
calicoctl get globalnetworkpolicy

8.2 Calico Debugging

# Increase Felix log level for debugging
calicoctl patch felixconfiguration default \
  --patch '{"spec":{"logSeverityScreen":"Debug"}}'

# Check BIRD routing daemon status
kubectl exec -n calico-system <calico-node-pod> -- birdcl show route

# Inspect IP-in-IP tunnel status
kubectl exec -n calico-system <calico-node-pod> -- ip tunnel show

# Review Calico-programmed iptables rules
sudo iptables -L -n | grep cali

9. Cilium Troubleshooting

9.1 Checking Cilium Status

# Get detailed Cilium agent status
kubectl exec -n kube-system <cilium-pod> -- cilium status --verbose

# List all Cilium endpoints
kubectl exec -n kube-system <cilium-pod> -- cilium endpoint list

# Inspect BPF maps
kubectl exec -n kube-system <cilium-pod> -- cilium bpf lb list
kubectl exec -n kube-system <cilium-pod> -- cilium bpf ct list global

# Observe network flows with Hubble
hubble observe --namespace <namespace>
hubble observe --pod <pod-name> --protocol TCP

9.2 Cilium Debugging

# Run Cilium connectivity test
kubectl exec -n kube-system <cilium-pod> -- cilium connectivity test

# Check eBPF program status
kubectl exec -n kube-system <cilium-pod> -- cilium bpf prog list

# Trace policy decisions between Pods
kubectl exec -n kube-system <cilium-pod> -- cilium policy trace \
  --src-k8s-pod <namespace>:<src-pod> \
  --dst-k8s-pod <namespace>:<dst-pod> \
  --dport <port>

# Monitor events in real time
kubectl exec -n kube-system <cilium-pod> -- cilium monitor --type drop
kubectl exec -n kube-system <cilium-pod> -- cilium monitor --type policy-verdict

10. Debug Containers and Ephemeral Containers

10.1 Using kubectl debug

# Attach a debug container to a running Pod
kubectl debug -it <pod-name> --image=nicolaka/netshoot --target=<container-name>

# Debug at the node level
kubectl debug node/<node-name> -it --image=ubuntu

# Create a copy of a Pod for debugging (no impact on original)
kubectl debug <pod-name> -it --copy-to=debug-pod --image=nicolaka/netshoot

# Debug container with shared process namespace
kubectl debug -it <pod-name> --image=busybox --share-processes

10.2 Network Debugging with netshoot

# Deploy a netshoot container
kubectl run netshoot --image=nicolaka/netshoot --restart=Never -- sleep 3600

# Enter the container for comprehensive diagnostics
kubectl exec -it netshoot -- bash

# Commands to run inside the container:
# TCP connectivity test
nc -zv <service-ip> <port>

# HTTP request test
curl -v http://<service-name>.<namespace>.svc.cluster.local:<port>/health

# DNS lookup
dig +short <service-name>.<namespace>.svc.cluster.local

# Packet capture
tcpdump -i eth0 -nn port 80

# Network path tracing
mtr <target-ip>

# SSL/TLS connection verification
openssl s_client -connect <service-ip>:443

11. Practical Debugging Scenarios

Scenario 1: Pod Cannot Reach a Service

# Step 1: Verify the Service and its Endpoints
kubectl get svc <service-name> -o yaml
kubectl get endpoints <service-name>

# Step 2: If Endpoints are empty, check the selector
kubectl get pods -l <label-selector> --show-labels

# Step 3: Verify target Pods are Ready
kubectl get pods -l <label-selector> -o wide
kubectl describe pod <target-pod> | grep -A 5 "Conditions"

# Step 4: Inspect kube-proxy rules
sudo iptables -t nat -L KUBE-SERVICES -n | grep <cluster-ip>

# Step 5: Test direct Pod IP connectivity
kubectl exec -it <source-pod> -- curl -v http://<pod-ip>:<target-port>

Scenario 2: Intermittent Timeouts

# Step 1: Check conntrack table saturation
sudo sysctl net.netfilter.nf_conntrack_count
sudo sysctl net.netfilter.nf_conntrack_max

# Step 2: Look for packet drops
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- \
  netstat -s | grep -i drop

# Step 3: Check network interface errors
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- \
  ip -s link show

# Step 4: Monitor TCP retransmissions
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- \
  netstat -s | grep -i retrans

Scenario 3: Slow DNS Resolution

# Step 1: Measure DNS response times
kubectl exec -it <pod> -- dig <service-name>.default.svc.cluster.local +stats | grep "Query time"

# Step 2: Check CoreDNS resource usage
kubectl top pods -n kube-system -l k8s-app=kube-dns

# Step 3: Apply ndots optimization
# Add to Pod spec:
# dnsConfig:
#   options:
#   - name: ndots
#     value: "2"

# Step 4: Check CoreDNS cache hit rates
kubectl exec -n kube-system <coredns-pod> -- \
  wget -qO- http://localhost:9153/metrics | grep coredns_cache

12. Debugging Checklist

A systematic checklist for approaching network issues:

[ ] Pod status verification (Running, Ready)
[ ] Service Endpoints verification (not empty)
[ ] DNS resolution test (nslookup/dig)
[ ] Direct Pod IP connectivity test
[ ] kube-proxy rules inspection (iptables/IPVS)
[ ] Network Policy review (Ingress/Egress rules)
[ ] CNI plugin health check
[ ] Cross-node network connectivity
[ ] MTU configuration verification
[ ] Firewall / security group rules
[ ] conntrack table status
[ ] CoreDNS health and logs

Network debugging is most effective when approached layer by layer: L2 (Ethernet) -> L3 (IP/Routing) -> L4 (TCP/UDP) -> L7 (HTTP/DNS). By systematically verifying each layer, even the most complex networking issues can be methodically diagnosed and resolved.