Skip to content
Published on

Network Troubleshooting Complete Guide — 6-Part Series Overview

Authors
  • Name
    Twitter

Introduction

Network failures are among the most frequent and challenging issues in service operations. Behind a simple symptom like "the server is not responding" lie a multitude of potential root causes: DNS resolution failures, TCP handshake timeouts, TLS certificate expiration, container network isolation misconfigurations, and more.

This series is designed to help DevOps engineers, SREs, and backend developers take a systematic approach to network troubleshooting, covering each layer from the lower levels of the OSI model all the way up to cloud architecture.

Series Structure

The series consists of 6 parts. Each part can be referenced independently, but reading them in order will give you a comprehensive understanding of the entire network troubleshooting workflow.


Part 1. DNS Troubleshooting Deep Dive

Go to DNS Troubleshooting Deep Dive

DNS is the starting point of all network communication. Issues in the process of translating domain names to IP addresses can impact your entire service.

Topics covered:

  • End-to-end DNS resolution flow (recursive vs. iterative queries)
  • Step-by-step diagnosis using dig, nslookup, and host
  • DNS cache issues and TTL-related troubleshooting
  • DNSSEC validation failure debugging
  • Internal DNS server / CoreDNS incident response
  • DNS-based service discovery problem resolution

Useful when you encounter:

  • "The domain suddenly stopped resolving"
  • "Different DNS servers are returning different results"
  • "Kubernetes Pods cannot reach external domains"

Part 2. TCP/IP Connection Debugging

Go to TCP/IP Connection Debugging

TCP/IP is the fundamental protocol stack for internet communication. This part covers how to diagnose problems at each stage: connection establishment, data transfer, and connection teardown.

Topics covered:

  • TCP 3-way handshake / 4-way teardown analysis
  • Packet capture and analysis with tcpdump and Wireshark
  • Socket state inspection using ss and netstat
  • TCP retransmission, window size, and Nagle algorithm issues
  • Resolving TIME_WAIT and CLOSE_WAIT accumulation
  • Packet fragmentation caused by MTU / MSS mismatch

Useful when you encounter:

  • "Connections intermittently time out"
  • "CLOSE_WAIT sockets keep accumulating"
  • "Packets above a certain size fail to transmit"

Part 3. HTTP/HTTPS Troubleshooting

Go to HTTP/HTTPS Troubleshooting

This part focuses on diagnosing issues with HTTP/HTTPS, the most widely used application-layer protocols. It covers TLS handshakes, certificate management, HTTP/2, and gRPC-related problems.

Topics covered:

  • Root cause analysis by HTTP status code (deep dive into 4xx, 5xx)
  • TLS/SSL handshake process and certificate chain verification
  • HTTPS debugging with curl and openssl s_client
  • HTTP/2 protocol issues (stream multiplexing, HPACK)
  • CORS, redirect loops, and proxy configuration problems
  • Let's Encrypt / ACME automatic certificate renewal failures

Useful when you encounter:

  • "SSL handshake is failing"
  • "Certificate has expired and the service is unreachable"
  • "Errors appear on certain clients after switching to HTTP/2"

Part 4. Network Performance Analysis

Go to Network Performance Analysis

When network problems manifest as performance degradation rather than outright failures, quantitative analysis is required. This part covers how to measure bandwidth, latency, and packet loss, and how to identify bottlenecks.

Topics covered:

  • Bandwidth measurement and benchmarking with iperf3
  • Network path analysis using mtr and traceroute
  • Understanding latency vs. throughput
  • TCP window scaling and buffer tuning
  • QoS policies and traffic shaping
  • Building network monitoring with Prometheus + Grafana

Useful when you encounter:

  • "API responses are slower than usual"
  • "There is high latency between specific regions"
  • "Packet loss occurs when traffic spikes"

Part 5. Container / Kubernetes Network Debugging

Go to Container/K8s Network Debugging

Container environments add an additional networking layer. This part helps you understand and debug Kubernetes-specific network constructs including veth pairs, bridge networks, CNI plugins, Services, and Ingress.

Topics covered:

  • Comparison of Docker network modes (bridge, host, overlay)
  • Kubernetes network model and CNI plugins (Calico, Cilium, Flannel)
  • Tracing Pod-to-Pod, Service, and external traffic flows
  • Network isolation and debugging with NetworkPolicy
  • Ingress Controller troubleshooting (Nginx, Traefik)
  • Entering network namespaces with kubectl debug and nsenter
  • eBPF-based network observability (Cilium Hubble)

Useful when you encounter:

  • "A Pod cannot communicate with another Pod"
  • "Cannot access the Service via its ClusterIP"
  • "External traffic through Ingress returns 504 on certain paths"

Part 6. Cloud Network Architecture Troubleshooting

Go to Cloud Network Troubleshooting

Networking in cloud environments such as AWS, GCP, and Azure introduces abstracted layers including VPCs, subnets, security groups, and routing tables. This part covers how to diagnose cloud-specific network issues.

Topics covered:

  • VPC design principles and subnet architecture strategies
  • Security Group / NACL rule debugging
  • VPC Peering, Transit Gateway, and PrivateLink connectivity issues
  • Route Table and Internet Gateway / NAT Gateway configuration validation
  • Traffic analysis with AWS VPC Flow Logs / GCP Flow Logs
  • Multi-region and hybrid cloud network troubleshooting
  • DNS integration (Route 53 Private Hosted Zones, Cloud DNS)

Useful when you encounter:

  • "Resources in the peered VPC are unreachable after setting up VPC Peering"
  • "Instances in a private subnet cannot access the internet"
  • "The Security Group is open but the connection is being refused"

Series Learning Roadmap

For the most effective learning experience, we recommend the following progression.

                          ┌───────────────────────────────┐
Part 1. DNS Troubleshooting                          └──────────────┬────────────────┘
                          ┌──────────────▼────────────────┐
Part 2. TCP/IP Debugging                          └──────────────┬────────────────┘
                          ┌──────────────▼────────────────┐
Part 3. HTTP/HTTPS                          └──────────────┬────────────────┘
                          ┌──────────────▼────────────────┐
Part 4. Performance Analysis                          └──────────────┬────────────────┘
                     ┌───────────────────┴─────────────────────┐
                     │                                         │
          ┌──────────▼───────────┐              ┌──────────────▼──────────┐
Part 5. Container/K8s│Part 6. Cloud Network          └──────────────────────┘              └─────────────────────────┘
RoleRecommended Path
Backend DeveloperPart 1 → Part 3 → Part 2 → Part 4
DevOps / SREPart 1 → Part 2 → Part 3 → Part 4 → Part 5 → Part 6
Cloud EngineerPart 1 → Part 2 → Part 6 → Part 5
Kubernetes AdministratorPart 1 → Part 2 → Part 5 → Part 4

Prerequisites

The following environments and tools are needed to follow along with the series.

Required Tools

ToolPurposeInstallation Check
dig / nslookupDNS diagnosticsdig -v
curlHTTP request testingcurl --version
tcpdumpPacket capturetcpdump --version
ss / netstatSocket state inspectionss -v
traceroute / mtrRoute tracingmtr --version
iperf3Bandwidth measurementiperf3 --version
opensslTLS certificate inspectionopenssl version
ToolPurpose
WiresharkGUI-based packet analysis
kubectlKubernetes cluster management
nsenterNetwork namespace entry
Cilium HubbleeBPF-based network observability

Lab Environment

  • Linux-based server (Ubuntu 22.04 / Rocky Linux 9 recommended)
  • Docker and Docker Compose
  • Kubernetes cluster (minikube, kind, or a managed cluster)
  • Cloud account (AWS Free Tier or GCP Free Tier)

Core Troubleshooting Principles

Before diving into the individual parts, here are the fundamental troubleshooting principles that apply throughout the entire series.

1. Work from the Bottom Up

Always verify network issues from the lowest layer upward.

PhysicalData LinkNetwork (IP)Transport (TCP/UDP)Application (HTTP)

Analyzing HTTP response codes when DNS resolution is failing is a waste of time.

2. Identify What Changed

Determining what changed just before the problem occurred is often the fastest path to diagnosis.

  • Was there a deployment?
  • Were infrastructure changes made?
  • Has a certificate renewal cycle arrived?
  • Were DNS records modified?

3. Isolate and Reproduce

If you can reproduce the problem, you are halfway to solving it.

  • Does it occur on a specific server only?
  • Does it occur only at certain times?
  • Does it occur only with certain request patterns?

4. Rely on Logs and Metrics

Make decisions based on data, not assumptions.

# Example: check kernel logs for network-related events
dmesg | grep -i -E "net|eth|tcp|drop"

# Example: system network statistics
cat /proc/net/snmp | grep -i tcp
PartTitleLink
Part 1DNS Troubleshooting Deep DiveRead
Part 2TCP/IP Connection DebuggingRead
Part 3HTTP/HTTPS TroubleshootingRead
Part 4Network Performance AnalysisRead
Part 5Container/K8s Network DebuggingRead
Part 6Cloud Network TroubleshootingRead

Each part is designed to be used as a standalone reference. If you are facing a specific issue, jump directly to the relevant part. If you want to build comprehensive network troubleshooting skills, we recommend reading from Part 1 in sequence.