TCP Network Stack Deep Dive — State Machine, Congestion Control (Cubic vs BBR), Nagle, Delayed ACK, and the Evolution to QUIC (2025)

0. The TCP You Think You Know

Every HTTP request rides on TCP. Yet consider:

Why does TIME_WAIT in ss -tan sit around for 30 seconds?
What distinguishes "connection reset by peer" from "broken pipe"?
Why can sending 100 messages of 1KB per second take 4 seconds?
Why does iperf deliver 1Gbps on a 10Gbps link?
Why did Google build QUIC on UDP instead of TCP?

The answers live below: TCP state machine, congestion control evolution, infamous interaction bugs, and the QUIC era.

1. TCP State Machine — 11 States

1.1 Connection Setup: 3-Way Handshake

Client              Server
  |  SYN (seq=x)      |
  |----------------->|  [LISTEN -> SYN_RCVD]
  | SYN+ACK(y,x+1)   |
  |<-----------------|
  |  ACK (ack=y+1)   |
  |----------------->|  [SYN_RCVD -> ESTABLISHED]

Why three, not two? Bidirectional ISN (Initial Sequence Number) synchronization. Each direction must announce its own ISN and receive acknowledgment.

1.2 Connection Teardown: 4-Way Handshake

A                B
 |   FIN        |
 |------------->|  [ESTABLISHED -> CLOSE_WAIT]
 |   ACK        |
 |<-------------|
 [FIN_WAIT_1 -> FIN_WAIT_2]
 |   FIN        |
 |<-------------|  [CLOSE_WAIT -> LAST_ACK]
 |   ACK        |
 |------------->|
 [FIN_WAIT_2 -> TIME_WAIT]  [LAST_ACK -> CLOSED]

Four, because TCP is full-duplex. Each direction closes independently.

1.3 TIME_WAIT — Misunderstood

Wait 2MSL (Maximum Segment Lifetime) — typically 30s to 2 minutes. Reasons:

Prevent stray packets from joining a new connection: quickly reopened port pairs could receive delayed packets from the previous connection.
Handle lost final ACK: if the peer retransmits FIN, we must respond with ACK; a CLOSED state would reply with RST.

1.4 TIME_WAIT Explosion

Short-lived outbound connections pile up tens of thousands of TIME_WAIT, exhausting ephemeral ports.

Wrong fix: forcing TIME_WAIT to 0 (unsafe, risks data corruption).

Right fixes:

net.ipv4.tcp_tw_reuse=1: reuse ephemeral port safely.
Keep-Alive connections (HTTP/1.1 persistent, HTTP/2 multiplexing).
Connection pools in DB drivers and HTTP clients.

1.5 CLOSE_WAIT — App Bug Signal

TIME_WAIT is normal; stacked CLOSE_WAIT is not. The peer sent FIN but your app never called close().

$ ss -tan | grep CLOSE_WAIT | wc -l
50000   # socket leak

Cause: missing try-finally on file descriptors.

2. TCP Reliability Mechanisms

2.1 Sequence Numbers and ACK

Every byte has a sequence number. The receiver ACKs the next expected byte.

Send:  [1000][1001][1002][1003]
Recv:  ACK=1004  (got 1000-1003, expect 1004)

If a segment is lost, the receiver repeats the same ACK (duplicate ACK). Three dup ACKs trigger Fast Retransmit.

2.2 Retransmission — RTO and Fast Retransmit

RTO: dynamic timeout based on RTT measurement (Jacobson, 1988).
Fast Retransmit: skip timeout on 3 dup ACKs; recovery within milliseconds.
SACK: precise "got 1000-2000, 3000-4000, missing 2000-3000".

2.3 Flow Control — Receive Window

Receivers advertise rwnd. Senders only send while unacked_bytes < rwnd.

2.4 Window Scaling

TCP's 16-bit window = 65,535 bytes. 10Gbps x 100ms RTT = 125MB needed. RFC 1323 scaling multiplies rwnd by 2^shift.

sysctl net.ipv4.tcp_window_scaling
sysctl net.core.rmem_max
sysctl net.ipv4.tcp_rmem

3. Congestion Control

3.1 The 1986 Congestion Collapse

UC Berkeley to LBL link dropped from 32Kbps to 40bps — 1000x degradation. Cause: retransmit storms. Van Jacobson's 1988 paper introduced congestion control.

3.2 Congestion Window (cwnd)

send = min(rwnd, cwnd)

rwnd comes from the receiver; cwnd is the sender's estimate of network capacity.

3.3 Slow Start

Initial cwnd = 10 MSS. Each ACK increments cwnd by 1 MSS, doubling per RTT.

cwnd: 10 -> 20 -> 40 -> 80 -> 160

Exponential growth in practice despite the "slow" name.

3.4 Congestion Avoidance

After ssthresh, cwnd += 1 MSS per RTT (linear).

3.5 Loss Detection

3 Dup ACK (Fast Retransmit): halve cwnd.
Timeout: cwnd = 1 MSS, restart Slow Start.

This is AIMD (Additive Increase, Multiplicative Decrease) — the Reno core.

3.6 Reno to Cubic

Reno is too conservative on long fat networks. CUBIC (Linux default) grows cwnd as a cubic function of time, remembers the last loss point, and probes beyond it.

3.7 BBR — Google's 2016 Revolution

Traditional algorithms assume loss = congestion. Modern networks see non-congestive loss (WiFi noise, bufferbloat).

BBR (Bottleneck Bandwidth and RTT) idea:

"Measure actual bandwidth and RTT directly; size cwnd from them."

Periodically probe for bandwidth.
Track minimum RTT (detect bufferbloat).
Keep cwnd near BW x RTT — minimize queuing.

3.8 BBR Results

Google google.com and YouTube after BBR:

4% reduction in YouTube rebuffering.
Lower google.com response time.
Large gains in developing regions.

Linux 4.9+ includes BBR. Enable with sysctl net.ipv4.tcp_congestion_control=bbr.

3.9 BBR Fairness Debate

BBR v1 took more bandwidth than Cubic when coexisting — a fairness issue. Improved in BBRv2 (2019) and v3 (2023).

Algorithm	Trait	Use
Reno	AIMD, classic	Legacy
Cubic	Cubic growth	Linux default, WAN
BBR	Direct BW/RTT	High speed, non-congestive loss
DCTCP	ECN marking	Data center
CoPA	Low latency	Video conferencing

4. Nagle and Delayed ACK — The Worst Pairing

4.1 Nagle

Small packets are expensive (40-byte header for 1-byte payload = 2.5% efficiency).

Nagle: "If an unacked small packet is in flight, buffer further small data until ACK arrives."

setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(int));

Good for throughput; adds latency.

4.2 Delayed ACK

The receiver also batches ACKs — piggyback on next data, or wait up to 200ms (40ms default on Linux).

4.3 Nagle + Delayed ACK = 40ms Stall

A: sends first small packet
B: defers ACK (more might come)
A: wants to send second, Nagle waits for ACK
   -> Nagle waits for ACK
   -> Delayed ACK waits for data
   -> 40ms later, ACK fires
A: finally sends second

Real-time apps (Telnet, SSH, remote games) must set TCP_NODELAY.

4.4 TCP_CORK — The Opposite

"Buffer up, send in one shot." sendfile + TCP_CORK powers nginx static-file delivery.

int cork = 1;
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &cork, sizeof(int));
writev(fd, iov, 10);
cork = 0;
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &cork, sizeof(int));

5. TCP Fast Open — Skip the Handshake

Each HTTP request costs at least 1.5 RTT for handshake plus request.

5.1 TFO (RFC 7413, 2014)

First connection: server issues a cookie. Later connections send SYN with cookie + request payload.

1st:  SYN -> SYN+ACK(cookie) -> ACK -> GET
2nd:  SYN(cookie, GET) -> SYN+ACK(data)

5.2 Why It Flopped

Middleboxes drop non-standard SYN.
Server-side cookie state.
HTTP/2 reuse mitigates the pain.

QUIC's 0-RTT handshake sidesteps this.

6. Common Errors Decoded

6.1 Connection refused

RST to SYN — no listener on that port.

6.2 Connection reset by peer

RST during ESTABLISHED — peer crashed, SO_LINGER 0, or firewall active reset.

6.3 Broken pipe

Write to a socket the peer already closed.

6.4 Connection timed out

SYNs gone unanswered (5-7 retries, 60s+). Network black hole or dead server.

6.5 No route to host

Routing table has no path — VPN dropped, misconfigured route.

7. Production Tuning — Key sysctls

7.1 Backlog and Queues

net.core.somaxconn            # listen backlog cap
net.ipv4.tcp_max_syn_backlog  # SYN_RCVD queue
net.core.netdev_max_backlog   # NIC to kernel queue

nginx listen 80 backlog=65535 needs matching kernel caps.

7.2 TIME_WAIT

net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_fin_timeout=30
net.ipv4.ip_local_port_range

7.3 Keep-alive

net.ipv4.tcp_keepalive_time=600
net.ipv4.tcp_keepalive_intvl=60
net.ipv4.tcp_keepalive_probes=3

Defaults of 2 hours are too long for load-balanced services.

7.4 Buffers

net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_rmem="4096 131072 16777216"
net.ipv4.tcp_wmem="4096 131072 16777216"

Match BDP: 10Gbps x 50ms = 62MB.

7.5 Congestion Control

net.ipv4.tcp_congestion_control=bbr
net.core.default_qdisc=fq

8. QUIC — Leaving TCP Behind

8.1 TCP's Hard Limits

Head-of-Line Blocking: one lost packet stalls all streams on the same TCP connection — HTTP/2 inherits this.
Handshake cost: TCP 3-way + TLS 1.2 2-RTT = 3 RTT.
Middlebox ossification: new TCP options get dropped by ISP firewalls. TFO and MPTCP failed here.
Kernel coupling: TCP improvements require kernel upgrades.

8.2 QUIC's Answer — User-Space Transport over UDP

Google experiment (2012), IETF standard (RFC 9000, 2021), HTTP/3 base.

Built on UDP — middleboxes just see UDP.
User-space library — ships with the app.
TLS 1.3 integrated.
True stream multiplexing — no HoL across streams.

8.3 0-RTT Handshake

Reconnects piggyback data on the first packet using a session ticket.

First:     QUIC handshake (1 RTT)
Reconnect: data immediately (0 RTT)

8.4 Connection Migration

Connection ID survives IP changes. WiFi to 4G handover without dropping.

8.5 HTTP/3 = HTTP over QUIC

Same semantics as HTTP/2, transport swap to QUIC. Major browsers and CDNs (Cloudflare, Akamai, Fastly) support it.

8.6 QUIC Downsides

Higher CPU (mandatory crypto, user space).
Middlebox incompatibility with heavy UDP.
Implementation complexity.
Harder to observe — encrypted payloads, fewer tools.

Google and Meta measured ~10% latency improvement on mobile; desktop gains are smaller.

9. Debugging Toolkit

9.1 Inspect Connections

ss -tan
ss -tan state established
ss -tnp | grep :443
ss -s

9.2 Packet Capture

tcpdump -i any -w capture.pcap 'port 443'
wireshark capture.pcap

9.3 Congestion Control Observation

ss -ti

Sample:

ESTAB  ... cubic cwnd:10 ssthresh:7 bytes_acked:1234 bytes_received:5678 rtt:25.3/3.1 rcv_rtt:25.1 delivered:10 ...

Small cwnd with many retrans = congestion.

9.4 bpftrace

bpftrace -e 'kprobe:tcp_retransmit_skb { printf("retrans pid=%d\n", pid); }'

9.5 Mtr

mtr -r -c 100 example.com

Per-hop loss + RTT for ISP diagnosis.

10. Closing — 50 Years of TCP, and What Comes Next

TCP dates to Vint Cerf and Bob Kahn's 1974 paper. Highlights:

1988: Jacobson congestion control.
1992: Window Scaling.
1996: SACK.
2006: Cubic.
2016: BBR.
2021: QUIC (RFC).

QUIC pulled transport into user space. Apps can evolve their transport without kernel upgrades — Facebook mvfst, Cloudflare quiche, Google gQUIC will shape the next decade. Meanwhile TCP stays: 80%+ of traffic still runs on it.

Next post: TLS/SSL and PKI internals — cert chains, cipher suites, 0-RTT replay risk, QUIC crypto integration, and post-quantum.

References

RFC 9293 — Transmission Control Protocol (2022 revision).
Van Jacobson — "Congestion Avoidance and Control" (SIGCOMM, 1988).
Cardwell et al — "BBR: Congestion-Based Congestion Control" (ACM Queue, 2016).
RFC 9000 — QUIC.
RFC 9114 — HTTP/3.
"Computer Networks: A Systems Approach" — Peterson & Davie.
Brendan Gregg — Linux Performance blog.
Marek Majkowski (Cloudflare) — TCP internals writing.
"High Performance Browser Networking" — Ilya Grigorik (O'Reilly, 2013).