- Published on
TCP Network Stack Deep Dive — State Machine, Congestion Control (Cubic vs BBR), Nagle, Delayed ACK, and the Evolution to QUIC (2025)
- Authors

- Name
- Youngju Kim
- @fjvbn20031
0. The TCP You Think You Know
Every HTTP request rides on TCP. Yet consider:
- Why does
TIME_WAITinss -tansit around for 30 seconds? - What distinguishes "connection reset by peer" from "broken pipe"?
- Why can sending 100 messages of 1KB per second take 4 seconds?
- Why does iperf deliver 1Gbps on a 10Gbps link?
- Why did Google build QUIC on UDP instead of TCP?
The answers live below: TCP state machine, congestion control evolution, infamous interaction bugs, and the QUIC era.
1. TCP State Machine — 11 States
1.1 Connection Setup: 3-Way Handshake
Client Server
| SYN (seq=x) |
|----------------->| [LISTEN -> SYN_RCVD]
| SYN+ACK(y,x+1) |
|<-----------------|
| ACK (ack=y+1) |
|----------------->| [SYN_RCVD -> ESTABLISHED]
Why three, not two? Bidirectional ISN (Initial Sequence Number) synchronization. Each direction must announce its own ISN and receive acknowledgment.
1.2 Connection Teardown: 4-Way Handshake
A B
| FIN |
|------------->| [ESTABLISHED -> CLOSE_WAIT]
| ACK |
|<-------------|
[FIN_WAIT_1 -> FIN_WAIT_2]
| FIN |
|<-------------| [CLOSE_WAIT -> LAST_ACK]
| ACK |
|------------->|
[FIN_WAIT_2 -> TIME_WAIT] [LAST_ACK -> CLOSED]
Four, because TCP is full-duplex. Each direction closes independently.
1.3 TIME_WAIT — Misunderstood
Wait 2MSL (Maximum Segment Lifetime) — typically 30s to 2 minutes. Reasons:
- Prevent stray packets from joining a new connection: quickly reopened port pairs could receive delayed packets from the previous connection.
- Handle lost final ACK: if the peer retransmits FIN, we must respond with ACK; a CLOSED state would reply with RST.
1.4 TIME_WAIT Explosion
Short-lived outbound connections pile up tens of thousands of TIME_WAIT, exhausting ephemeral ports.
Wrong fix: forcing TIME_WAIT to 0 (unsafe, risks data corruption).
Right fixes:
net.ipv4.tcp_tw_reuse=1: reuse ephemeral port safely.- Keep-Alive connections (HTTP/1.1 persistent, HTTP/2 multiplexing).
- Connection pools in DB drivers and HTTP clients.
1.5 CLOSE_WAIT — App Bug Signal
TIME_WAIT is normal; stacked CLOSE_WAIT is not. The peer sent FIN but your app never called close().
$ ss -tan | grep CLOSE_WAIT | wc -l
50000 # socket leak
Cause: missing try-finally on file descriptors.
2. TCP Reliability Mechanisms
2.1 Sequence Numbers and ACK
Every byte has a sequence number. The receiver ACKs the next expected byte.
Send: [1000][1001][1002][1003]
Recv: ACK=1004 (got 1000-1003, expect 1004)
If a segment is lost, the receiver repeats the same ACK (duplicate ACK). Three dup ACKs trigger Fast Retransmit.
2.2 Retransmission — RTO and Fast Retransmit
- RTO: dynamic timeout based on RTT measurement (Jacobson, 1988).
- Fast Retransmit: skip timeout on 3 dup ACKs; recovery within milliseconds.
- SACK: precise "got 1000-2000, 3000-4000, missing 2000-3000".
2.3 Flow Control — Receive Window
Receivers advertise rwnd. Senders only send while unacked_bytes < rwnd.
2.4 Window Scaling
TCP's 16-bit window = 65,535 bytes. 10Gbps x 100ms RTT = 125MB needed. RFC 1323 scaling multiplies rwnd by 2^shift.
sysctl net.ipv4.tcp_window_scaling
sysctl net.core.rmem_max
sysctl net.ipv4.tcp_rmem
3. Congestion Control
3.1 The 1986 Congestion Collapse
UC Berkeley to LBL link dropped from 32Kbps to 40bps — 1000x degradation. Cause: retransmit storms. Van Jacobson's 1988 paper introduced congestion control.
3.2 Congestion Window (cwnd)
send = min(rwnd, cwnd)
rwnd comes from the receiver; cwnd is the sender's estimate of network capacity.
3.3 Slow Start
Initial cwnd = 10 MSS. Each ACK increments cwnd by 1 MSS, doubling per RTT.
cwnd: 10 -> 20 -> 40 -> 80 -> 160
Exponential growth in practice despite the "slow" name.
3.4 Congestion Avoidance
After ssthresh, cwnd += 1 MSS per RTT (linear).
3.5 Loss Detection
- 3 Dup ACK (Fast Retransmit): halve cwnd.
- Timeout: cwnd = 1 MSS, restart Slow Start.
This is AIMD (Additive Increase, Multiplicative Decrease) — the Reno core.
3.6 Reno to Cubic
Reno is too conservative on long fat networks. CUBIC (Linux default) grows cwnd as a cubic function of time, remembers the last loss point, and probes beyond it.
3.7 BBR — Google's 2016 Revolution
Traditional algorithms assume loss = congestion. Modern networks see non-congestive loss (WiFi noise, bufferbloat).
BBR (Bottleneck Bandwidth and RTT) idea:
"Measure actual bandwidth and RTT directly; size cwnd from them."
- Periodically probe for bandwidth.
- Track minimum RTT (detect bufferbloat).
- Keep cwnd near BW x RTT — minimize queuing.
3.8 BBR Results
Google google.com and YouTube after BBR:
- 4% reduction in YouTube rebuffering.
- Lower google.com response time.
- Large gains in developing regions.
Linux 4.9+ includes BBR. Enable with sysctl net.ipv4.tcp_congestion_control=bbr.
3.9 BBR Fairness Debate
BBR v1 took more bandwidth than Cubic when coexisting — a fairness issue. Improved in BBRv2 (2019) and v3 (2023).
| Algorithm | Trait | Use |
|---|---|---|
| Reno | AIMD, classic | Legacy |
| Cubic | Cubic growth | Linux default, WAN |
| BBR | Direct BW/RTT | High speed, non-congestive loss |
| DCTCP | ECN marking | Data center |
| CoPA | Low latency | Video conferencing |
4. Nagle and Delayed ACK — The Worst Pairing
4.1 Nagle
Small packets are expensive (40-byte header for 1-byte payload = 2.5% efficiency).
Nagle: "If an unacked small packet is in flight, buffer further small data until ACK arrives."
setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(int));
Good for throughput; adds latency.
4.2 Delayed ACK
The receiver also batches ACKs — piggyback on next data, or wait up to 200ms (40ms default on Linux).
4.3 Nagle + Delayed ACK = 40ms Stall
A: sends first small packet
B: defers ACK (more might come)
A: wants to send second, Nagle waits for ACK
-> Nagle waits for ACK
-> Delayed ACK waits for data
-> 40ms later, ACK fires
A: finally sends second
Real-time apps (Telnet, SSH, remote games) must set TCP_NODELAY.
4.4 TCP_CORK — The Opposite
"Buffer up, send in one shot." sendfile + TCP_CORK powers nginx static-file delivery.
int cork = 1;
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &cork, sizeof(int));
writev(fd, iov, 10);
cork = 0;
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &cork, sizeof(int));
5. TCP Fast Open — Skip the Handshake
Each HTTP request costs at least 1.5 RTT for handshake plus request.
5.1 TFO (RFC 7413, 2014)
First connection: server issues a cookie. Later connections send SYN with cookie + request payload.
1st: SYN -> SYN+ACK(cookie) -> ACK -> GET
2nd: SYN(cookie, GET) -> SYN+ACK(data)
5.2 Why It Flopped
- Middleboxes drop non-standard SYN.
- Server-side cookie state.
- HTTP/2 reuse mitigates the pain.
QUIC's 0-RTT handshake sidesteps this.
6. Common Errors Decoded
6.1 Connection refused
RST to SYN — no listener on that port.
6.2 Connection reset by peer
RST during ESTABLISHED — peer crashed, SO_LINGER 0, or firewall active reset.
6.3 Broken pipe
Write to a socket the peer already closed.
6.4 Connection timed out
SYNs gone unanswered (5-7 retries, 60s+). Network black hole or dead server.
6.5 No route to host
Routing table has no path — VPN dropped, misconfigured route.
7. Production Tuning — Key sysctls
7.1 Backlog and Queues
net.core.somaxconn # listen backlog cap
net.ipv4.tcp_max_syn_backlog # SYN_RCVD queue
net.core.netdev_max_backlog # NIC to kernel queue
nginx listen 80 backlog=65535 needs matching kernel caps.
7.2 TIME_WAIT
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_fin_timeout=30
net.ipv4.ip_local_port_range
7.3 Keep-alive
net.ipv4.tcp_keepalive_time=600
net.ipv4.tcp_keepalive_intvl=60
net.ipv4.tcp_keepalive_probes=3
Defaults of 2 hours are too long for load-balanced services.
7.4 Buffers
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_rmem="4096 131072 16777216"
net.ipv4.tcp_wmem="4096 131072 16777216"
Match BDP: 10Gbps x 50ms = 62MB.
7.5 Congestion Control
net.ipv4.tcp_congestion_control=bbr
net.core.default_qdisc=fq
8. QUIC — Leaving TCP Behind
8.1 TCP's Hard Limits
- Head-of-Line Blocking: one lost packet stalls all streams on the same TCP connection — HTTP/2 inherits this.
- Handshake cost: TCP 3-way + TLS 1.2 2-RTT = 3 RTT.
- Middlebox ossification: new TCP options get dropped by ISP firewalls. TFO and MPTCP failed here.
- Kernel coupling: TCP improvements require kernel upgrades.
8.2 QUIC's Answer — User-Space Transport over UDP
Google experiment (2012), IETF standard (RFC 9000, 2021), HTTP/3 base.
- Built on UDP — middleboxes just see UDP.
- User-space library — ships with the app.
- TLS 1.3 integrated.
- True stream multiplexing — no HoL across streams.
8.3 0-RTT Handshake
Reconnects piggyback data on the first packet using a session ticket.
First: QUIC handshake (1 RTT)
Reconnect: data immediately (0 RTT)
8.4 Connection Migration
Connection ID survives IP changes. WiFi to 4G handover without dropping.
8.5 HTTP/3 = HTTP over QUIC
Same semantics as HTTP/2, transport swap to QUIC. Major browsers and CDNs (Cloudflare, Akamai, Fastly) support it.
8.6 QUIC Downsides
- Higher CPU (mandatory crypto, user space).
- Middlebox incompatibility with heavy UDP.
- Implementation complexity.
- Harder to observe — encrypted payloads, fewer tools.
Google and Meta measured ~10% latency improvement on mobile; desktop gains are smaller.
9. Debugging Toolkit
9.1 Inspect Connections
ss -tan
ss -tan state established
ss -tnp | grep :443
ss -s
9.2 Packet Capture
tcpdump -i any -w capture.pcap 'port 443'
wireshark capture.pcap
9.3 Congestion Control Observation
ss -ti
Sample:
ESTAB ... cubic cwnd:10 ssthresh:7 bytes_acked:1234 bytes_received:5678 rtt:25.3/3.1 rcv_rtt:25.1 delivered:10 ...
Small cwnd with many retrans = congestion.
9.4 bpftrace
bpftrace -e 'kprobe:tcp_retransmit_skb { printf("retrans pid=%d\n", pid); }'
9.5 Mtr
mtr -r -c 100 example.com
Per-hop loss + RTT for ISP diagnosis.
10. Closing — 50 Years of TCP, and What Comes Next
TCP dates to Vint Cerf and Bob Kahn's 1974 paper. Highlights:
- 1988: Jacobson congestion control.
- 1992: Window Scaling.
- 1996: SACK.
- 2006: Cubic.
- 2016: BBR.
- 2021: QUIC (RFC).
QUIC pulled transport into user space. Apps can evolve their transport without kernel upgrades — Facebook mvfst, Cloudflare quiche, Google gQUIC will shape the next decade. Meanwhile TCP stays: 80%+ of traffic still runs on it.
Next post: TLS/SSL and PKI internals — cert chains, cipher suites, 0-RTT replay risk, QUIC crypto integration, and post-quantum.
References
- RFC 9293 — Transmission Control Protocol (2022 revision).
- Van Jacobson — "Congestion Avoidance and Control" (SIGCOMM, 1988).
- Cardwell et al — "BBR: Congestion-Based Congestion Control" (ACM Queue, 2016).
- RFC 9000 — QUIC.
- RFC 9114 — HTTP/3.
- "Computer Networks: A Systems Approach" — Peterson & Davie.
- Brendan Gregg — Linux Performance blog.
- Marek Majkowski (Cloudflare) — TCP internals writing.
- "High Performance Browser Networking" — Ilya Grigorik (O'Reilly, 2013).