Skip to content
Published on

The Complete Guide to Container & Kubernetes Network Debugging

Authors
  • Name
    Twitter

1. Docker Networking Models

The foundation of container network debugging starts with a solid understanding of the networking models Docker provides.

1.1 Bridge Network

The bridge driver is Docker's default network driver. Each container receives a virtual ethernet (veth) interface connected to the docker0 bridge.

# Inspect the bridge network details
docker network inspect bridge

# Check a container's IP address
docker inspect --format '{{.NetworkSettings.IPAddress}}' <container_id>

# List veth pairs
ip link show type veth

# Show interfaces connected to docker0
brctl show docker0

Common issues with bridge networking include:

  • Container-to-container communication failure: Verify that both containers are on the same bridge network.
  • No external connectivity: Check iptables NAT rules and IP forwarding settings.
  • Port conflicts: Inspect host port binding overlaps.
# Check iptables NAT rules
sudo iptables -t nat -L -n -v

# Verify IP forwarding is enabled
cat /proc/sys/net/ipv4/ip_forward

# Check port bindings
docker port <container_id>

1.2 Host Network

The container shares the host's network stack directly. There is no network isolation, which yields better performance but introduces port conflict risks.

# Run a container in host network mode
docker run --network host nginx

# Verify the container sees host interfaces
docker exec <container_id> ip addr show

# Confirm shared network stack
docker exec <container_id> ss -tlnp

1.3 Overlay Network

Overlay networks enable communication between containers across multiple Docker hosts. They require Docker Swarm or an external key-value store.

# Create an overlay network
docker network create --driver overlay my-overlay

# Check VXLAN tunnel status
ip -d link show type vxlan

# Inspect overlay network peer information
docker network inspect my-overlay --format '{{json .Peers}}'

Key considerations when debugging overlay networks:

  • Ensure the VXLAN port (UDP 4789) is allowed through firewalls
  • Verify MTU settings account for VXLAN overhead (50 bytes)
  • Check time synchronization between nodes

2. Kubernetes Networking Model and CNI

2.1 Core Principles of Kubernetes Networking

The Kubernetes networking model is built on three fundamental requirements:

  1. Every Pod can communicate with every other Pod without NAT
  2. Every node can communicate with every Pod without NAT
  3. The IP a Pod sees for itself is the same IP that other Pods see for it
# Check the cluster network CIDR
kubectl cluster-info dump | grep -m 1 cluster-cidr

# Get Pod CIDRs assigned to each node
kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'

# Identify the CNI plugin in use
ls /etc/cni/net.d/
cat /etc/cni/net.d/*.conflist

2.2 Understanding CNI Plugins

CNI (Container Network Interface) is the standard for configuring network interfaces in containers.

# Locate CNI binaries
ls /opt/cni/bin/

# Inspect the CNI configuration
cat /etc/cni/net.d/10-calico.conflist

# Check kubelet logs for CNI-related events
journalctl -u kubelet | grep -i cni

Comparison of popular CNI plugins:

CNI PluginData PlaneNetwork PolicyKey Feature
Calicoiptables/eBPFSupportedBGP-based routing
CiliumeBPFSupportedL7 policies, Hubble observability
FlannelVXLAN/host-gwNot supportedSimple configuration
WeaveVXLANSupportedAutomatic mesh networking

3. Pod-to-Pod Communication Debugging

3.1 Same-Node Pod Communication

# Get Pod IPs
kubectl get pods -o wide

# Ping from one Pod to another
kubectl exec -it <pod-a> -- ping <pod-b-ip>

# Trace the route between Pods on the same node
kubectl exec -it <pod-a> -- traceroute <pod-b-ip>

# Inspect veth interfaces on the node
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- ip link show

3.2 Cross-Node Pod Communication

# Check the routing table on the node
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- route -n

# Verify VXLAN or IPIP tunnel status
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- ip tunnel show

# Capture packets with tcpdump
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- \
  tcpdump -i any -nn host <target-pod-ip>

# Test for MTU issues (Path MTU Discovery)
kubectl exec -it <pod> -- ping -M do -s 1400 <target-pod-ip>

Common causes of cross-node communication failures:

  • Cloud security groups blocking Pod CIDR traffic
  • MTU mismatch causing packet fragmentation
  • Inconsistent routing tables across nodes
  • IPIP/VXLAN tunnel interfaces being down

4. Pod-to-Service Communication Debugging

4.1 Verifying Service Basics

# Check the Service's ClusterIP and Endpoints
kubectl get svc <service-name> -o wide
kubectl get endpoints <service-name>

# Inspect the Endpoints in detail
kubectl get endpoints <service-name> -o yaml

# Test connectivity to the Service
kubectl exec -it <test-pod> -- curl -v http://<service-name>:<port>

# Verify Pods matching the Service selector
kubectl get pods -l <label-selector>

4.2 kube-proxy and iptables/IPVS Rules

kube-proxy translates Service virtual IPs into actual Pod IPs.

# Determine the kube-proxy mode
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

# iptables mode: inspect Service-related rules
sudo iptables -t nat -L KUBE-SERVICES -n
sudo iptables -t nat -L KUBE-SVC-<hash> -n
sudo iptables -t nat -L KUBE-SEP-<hash> -n

# IPVS mode: list virtual servers
sudo ipvsadm -Ln
sudo ipvsadm -Ln -t <cluster-ip>:<port>

# Check kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=100

Understanding the iptables chain flow is essential for debugging:

PREROUTING -> KUBE-SERVICES -> KUBE-SVC-xxx -> KUBE-SEP-xxx (DNAT)
# Trace the full iptables chain for a specific Service
SERVICE_IP=$(kubectl get svc <service-name> -o jsonpath='{.spec.clusterIP}')
sudo iptables -t nat -L KUBE-SERVICES -n | grep $SERVICE_IP

# Inspect conntrack entries for the Service
sudo conntrack -L -d $SERVICE_IP

5. Service DNS Resolution Issues

5.1 Verifying CoreDNS Operation

CoreDNS handles all in-cluster DNS resolution in Kubernetes.

# Check CoreDNS Pod status
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Inspect CoreDNS configuration
kubectl get configmap coredns -n kube-system -o yaml

# Test DNS resolution
kubectl exec -it <test-pod> -- nslookup <service-name>
kubectl exec -it <test-pod> -- nslookup <service-name>.<namespace>.svc.cluster.local

# Measure DNS response times
kubectl exec -it <test-pod> -- dig <service-name>.default.svc.cluster.local +stats

5.2 Diagnosing DNS Problems

# Check the Pod's DNS configuration
kubectl exec -it <pod> -- cat /etc/resolv.conf

# Review CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=200

# Deploy a DNS debugging Pod
kubectl run dnsutils --image=gcr.io/kubernetes-e2e-test-images/dnsutils:1.3 \
  --restart=Never -- sleep 3600

# Test external domain resolution
kubectl exec -it dnsutils -- nslookup google.com
kubectl exec -it dnsutils -- nslookup kubernetes.default

Common DNS issues and their resolutions:

  • ndots configuration: The default ndots:5 in resolv.conf causes multiple DNS queries for short names. Using FQDNs (with a trailing .) eliminates unnecessary lookups.
  • CoreDNS CrashLoopBackOff: Check logs and verify upstream DNS server connectivity.
  • DNS caching issues: Review the CoreDNS cache plugin configuration.
# Demonstrate the ndots effect: compare query counts
kubectl exec -it <pod> -- dig myservice.default.svc.cluster.local +search +showsearch
kubectl exec -it <pod> -- dig myservice.default.svc.cluster.local. +search +showsearch

6. External Traffic Debugging

6.1 Ingress / LoadBalancer Troubleshooting

# List all Ingress resources
kubectl get ingress -A
kubectl describe ingress <ingress-name>

# Check Ingress Controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100

# Verify LoadBalancer Service external IPs
kubectl get svc -l type=LoadBalancer

# Test NodePort access externally
curl -v http://<node-ip>:<node-port>

6.2 Egress Traffic Debugging

# Verify outbound connectivity from a Pod
kubectl exec -it <pod> -- curl -v https://httpbin.org/ip

# Check NAT gateway and routing
kubectl exec -it <pod> -- traceroute 8.8.8.8

# Inspect SNAT rules
sudo iptables -t nat -L POSTROUTING -n -v

7. Network Policy Debugging

7.1 Basic Network Policy Diagnostics

# List all Network Policies
kubectl get networkpolicy -A
kubectl describe networkpolicy <policy-name> -n <namespace>

# Find Network Policies applying to a specific Pod
kubectl get networkpolicy -n <namespace> -o json | \
  jq '.items[] | select(.spec.podSelector.matchLabels | to_entries[] |
  .key == "app" and .value == "myapp")'

# Test connectivity before and after policy application
kubectl exec -it <source-pod> -- nc -zv <target-pod-ip> <port>

7.2 Network Policy Troubleshooting Patterns

# Default deny-all policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
# Test connectivity after applying the policy
kubectl exec -it <pod> -- wget -qO- --timeout=2 http://<target-service>

# Check CNI-specific Network Policy logs
# Calico
kubectl logs -n calico-system -l k8s-app=calico-node --tail=100

# Cilium
kubectl exec -n kube-system <cilium-pod> -- cilium policy get
kubectl exec -n kube-system <cilium-pod> -- cilium endpoint list

8. Calico Troubleshooting

8.1 Checking Calico Status

# Verify Calico node health
kubectl get pods -n calico-system
calicoctl node status

# Check BGP peer status
calicoctl get bgpPeer
calicoctl node status | grep -A 5 "BGP"

# Inspect IP pools
calicoctl get ippool -o wide

# List Calico-specific network policies
calicoctl get networkpolicy -A
calicoctl get globalnetworkpolicy

8.2 Calico Debugging

# Increase Felix log level for debugging
calicoctl patch felixconfiguration default \
  --patch '{"spec":{"logSeverityScreen":"Debug"}}'

# Check BIRD routing daemon status
kubectl exec -n calico-system <calico-node-pod> -- birdcl show route

# Inspect IP-in-IP tunnel status
kubectl exec -n calico-system <calico-node-pod> -- ip tunnel show

# Review Calico-programmed iptables rules
sudo iptables -L -n | grep cali

9. Cilium Troubleshooting

9.1 Checking Cilium Status

# Get detailed Cilium agent status
kubectl exec -n kube-system <cilium-pod> -- cilium status --verbose

# List all Cilium endpoints
kubectl exec -n kube-system <cilium-pod> -- cilium endpoint list

# Inspect BPF maps
kubectl exec -n kube-system <cilium-pod> -- cilium bpf lb list
kubectl exec -n kube-system <cilium-pod> -- cilium bpf ct list global

# Observe network flows with Hubble
hubble observe --namespace <namespace>
hubble observe --pod <pod-name> --protocol TCP

9.2 Cilium Debugging

# Run Cilium connectivity test
kubectl exec -n kube-system <cilium-pod> -- cilium connectivity test

# Check eBPF program status
kubectl exec -n kube-system <cilium-pod> -- cilium bpf prog list

# Trace policy decisions between Pods
kubectl exec -n kube-system <cilium-pod> -- cilium policy trace \
  --src-k8s-pod <namespace>:<src-pod> \
  --dst-k8s-pod <namespace>:<dst-pod> \
  --dport <port>

# Monitor events in real time
kubectl exec -n kube-system <cilium-pod> -- cilium monitor --type drop
kubectl exec -n kube-system <cilium-pod> -- cilium monitor --type policy-verdict

10. Debug Containers and Ephemeral Containers

10.1 Using kubectl debug

# Attach a debug container to a running Pod
kubectl debug -it <pod-name> --image=nicolaka/netshoot --target=<container-name>

# Debug at the node level
kubectl debug node/<node-name> -it --image=ubuntu

# Create a copy of a Pod for debugging (no impact on original)
kubectl debug <pod-name> -it --copy-to=debug-pod --image=nicolaka/netshoot

# Debug container with shared process namespace
kubectl debug -it <pod-name> --image=busybox --share-processes

10.2 Network Debugging with netshoot

# Deploy a netshoot container
kubectl run netshoot --image=nicolaka/netshoot --restart=Never -- sleep 3600

# Enter the container for comprehensive diagnostics
kubectl exec -it netshoot -- bash

# Commands to run inside the container:
# TCP connectivity test
nc -zv <service-ip> <port>

# HTTP request test
curl -v http://<service-name>.<namespace>.svc.cluster.local:<port>/health

# DNS lookup
dig +short <service-name>.<namespace>.svc.cluster.local

# Packet capture
tcpdump -i eth0 -nn port 80

# Network path tracing
mtr <target-ip>

# SSL/TLS connection verification
openssl s_client -connect <service-ip>:443

11. Practical Debugging Scenarios

Scenario 1: Pod Cannot Reach a Service

# Step 1: Verify the Service and its Endpoints
kubectl get svc <service-name> -o yaml
kubectl get endpoints <service-name>

# Step 2: If Endpoints are empty, check the selector
kubectl get pods -l <label-selector> --show-labels

# Step 3: Verify target Pods are Ready
kubectl get pods -l <label-selector> -o wide
kubectl describe pod <target-pod> | grep -A 5 "Conditions"

# Step 4: Inspect kube-proxy rules
sudo iptables -t nat -L KUBE-SERVICES -n | grep <cluster-ip>

# Step 5: Test direct Pod IP connectivity
kubectl exec -it <source-pod> -- curl -v http://<pod-ip>:<target-port>

Scenario 2: Intermittent Timeouts

# Step 1: Check conntrack table saturation
sudo sysctl net.netfilter.nf_conntrack_count
sudo sysctl net.netfilter.nf_conntrack_max

# Step 2: Look for packet drops
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- \
  netstat -s | grep -i drop

# Step 3: Check network interface errors
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- \
  ip -s link show

# Step 4: Monitor TCP retransmissions
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- \
  netstat -s | grep -i retrans

Scenario 3: Slow DNS Resolution

# Step 1: Measure DNS response times
kubectl exec -it <pod> -- dig <service-name>.default.svc.cluster.local +stats | grep "Query time"

# Step 2: Check CoreDNS resource usage
kubectl top pods -n kube-system -l k8s-app=kube-dns

# Step 3: Apply ndots optimization
# Add to Pod spec:
# dnsConfig:
#   options:
#   - name: ndots
#     value: "2"

# Step 4: Check CoreDNS cache hit rates
kubectl exec -n kube-system <coredns-pod> -- \
  wget -qO- http://localhost:9153/metrics | grep coredns_cache

12. Debugging Checklist

A systematic checklist for approaching network issues:

[ ] Pod status verification (Running, Ready)
[ ] Service Endpoints verification (not empty)
[ ] DNS resolution test (nslookup/dig)
[ ] Direct Pod IP connectivity test
[ ] kube-proxy rules inspection (iptables/IPVS)
[ ] Network Policy review (Ingress/Egress rules)
[ ] CNI plugin health check
[ ] Cross-node network connectivity
[ ] MTU configuration verification
[ ] Firewall / security group rules
[ ] conntrack table status
[ ] CoreDNS health and logs

Network debugging is most effective when approached layer by layer: L2 (Ethernet) -> L3 (IP/Routing) -> L4 (TCP/UDP) -> L7 (HTTP/DNS). By systematically verifying each layer, even the most complex networking issues can be methodically diagnosed and resolved.