✍️ 필사 모드: TCP Network Stack Deep Dive — State Machine, Congestion Control (Cubic vs BBR), Nagle, Delayed ACK, and the Evolution to QUIC (2025)
English0. The TCP You Think You Know
Every HTTP request rides on TCP. Yet consider:
- Why does
TIME_WAITinss -tansit around for 30 seconds? - What distinguishes "connection reset by peer" from "broken pipe"?
- Why can sending 100 messages of 1KB per second take 4 seconds?
- Why does iperf deliver 1Gbps on a 10Gbps link?
- Why did Google build QUIC on UDP instead of TCP?
The answers live below: TCP state machine, congestion control evolution, infamous interaction bugs, and the QUIC era.
1. TCP State Machine — 11 States
1.1 Connection Setup: 3-Way Handshake
Client Server
| SYN (seq=x) |
|----------------->| [LISTEN -> SYN_RCVD]
| SYN+ACK(y,x+1) |
|<-----------------|
| ACK (ack=y+1) |
|----------------->| [SYN_RCVD -> ESTABLISHED]
Why three, not two? Bidirectional ISN (Initial Sequence Number) synchronization. Each direction must announce its own ISN and receive acknowledgment.
1.2 Connection Teardown: 4-Way Handshake
A B
| FIN |
|------------->| [ESTABLISHED -> CLOSE_WAIT]
| ACK |
|<-------------|
[FIN_WAIT_1 -> FIN_WAIT_2]
| FIN |
|<-------------| [CLOSE_WAIT -> LAST_ACK]
| ACK |
|------------->|
[FIN_WAIT_2 -> TIME_WAIT] [LAST_ACK -> CLOSED]
Four, because TCP is full-duplex. Each direction closes independently.
1.3 TIME_WAIT — Misunderstood
Wait 2MSL (Maximum Segment Lifetime) — typically 30s to 2 minutes. Reasons:
- Prevent stray packets from joining a new connection: quickly reopened port pairs could receive delayed packets from the previous connection.
- Handle lost final ACK: if the peer retransmits FIN, we must respond with ACK; a CLOSED state would reply with RST.
1.4 TIME_WAIT Explosion
Short-lived outbound connections pile up tens of thousands of TIME_WAIT, exhausting ephemeral ports.
Wrong fix: forcing TIME_WAIT to 0 (unsafe, risks data corruption).
Right fixes:
net.ipv4.tcp_tw_reuse=1: reuse ephemeral port safely.- Keep-Alive connections (HTTP/1.1 persistent, HTTP/2 multiplexing).
- Connection pools in DB drivers and HTTP clients.
1.5 CLOSE_WAIT — App Bug Signal
TIME_WAIT is normal; stacked CLOSE_WAIT is not. The peer sent FIN but your app never called close().
$ ss -tan | grep CLOSE_WAIT | wc -l
50000 # socket leak
Cause: missing try-finally on file descriptors.
2. TCP Reliability Mechanisms
2.1 Sequence Numbers and ACK
Every byte has a sequence number. The receiver ACKs the next expected byte.
Send: [1000][1001][1002][1003]
Recv: ACK=1004 (got 1000-1003, expect 1004)
If a segment is lost, the receiver repeats the same ACK (duplicate ACK). Three dup ACKs trigger Fast Retransmit.
2.2 Retransmission — RTO and Fast Retransmit
- RTO: dynamic timeout based on RTT measurement (Jacobson, 1988).
- Fast Retransmit: skip timeout on 3 dup ACKs; recovery within milliseconds.
- SACK: precise "got 1000-2000, 3000-4000, missing 2000-3000".
2.3 Flow Control — Receive Window
Receivers advertise rwnd. Senders only send while unacked_bytes < rwnd.
2.4 Window Scaling
TCP's 16-bit window = 65,535 bytes. 10Gbps x 100ms RTT = 125MB needed. RFC 1323 scaling multiplies rwnd by 2^shift.
sysctl net.ipv4.tcp_window_scaling
sysctl net.core.rmem_max
sysctl net.ipv4.tcp_rmem
3. Congestion Control
3.1 The 1986 Congestion Collapse
UC Berkeley to LBL link dropped from 32Kbps to 40bps — 1000x degradation. Cause: retransmit storms. Van Jacobson's 1988 paper introduced congestion control.
3.2 Congestion Window (cwnd)
send = min(rwnd, cwnd)
rwnd comes from the receiver; cwnd is the sender's estimate of network capacity.
3.3 Slow Start
Initial cwnd = 10 MSS. Each ACK increments cwnd by 1 MSS, doubling per RTT.
cwnd: 10 -> 20 -> 40 -> 80 -> 160
Exponential growth in practice despite the "slow" name.
3.4 Congestion Avoidance
After ssthresh, cwnd += 1 MSS per RTT (linear).
3.5 Loss Detection
- 3 Dup ACK (Fast Retransmit): halve cwnd.
- Timeout: cwnd = 1 MSS, restart Slow Start.
This is AIMD (Additive Increase, Multiplicative Decrease) — the Reno core.
3.6 Reno to Cubic
Reno is too conservative on long fat networks. CUBIC (Linux default) grows cwnd as a cubic function of time, remembers the last loss point, and probes beyond it.
3.7 BBR — Google's 2016 Revolution
Traditional algorithms assume loss = congestion. Modern networks see non-congestive loss (WiFi noise, bufferbloat).
BBR (Bottleneck Bandwidth and RTT) idea:
"Measure actual bandwidth and RTT directly; size cwnd from them."
- Periodically probe for bandwidth.
- Track minimum RTT (detect bufferbloat).
- Keep cwnd near BW x RTT — minimize queuing.
3.8 BBR Results
Google google.com and YouTube after BBR:
- 4% reduction in YouTube rebuffering.
- Lower google.com response time.
- Large gains in developing regions.
Linux 4.9+ includes BBR. Enable with sysctl net.ipv4.tcp_congestion_control=bbr.
3.9 BBR Fairness Debate
BBR v1 took more bandwidth than Cubic when coexisting — a fairness issue. Improved in BBRv2 (2019) and v3 (2023).
| Algorithm | Trait | Use |
|---|---|---|
| Reno | AIMD, classic | Legacy |
| Cubic | Cubic growth | Linux default, WAN |
| BBR | Direct BW/RTT | High speed, non-congestive loss |
| DCTCP | ECN marking | Data center |
| CoPA | Low latency | Video conferencing |
4. Nagle and Delayed ACK — The Worst Pairing
4.1 Nagle
Small packets are expensive (40-byte header for 1-byte payload = 2.5% efficiency).
Nagle: "If an unacked small packet is in flight, buffer further small data until ACK arrives."
setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(int));
Good for throughput; adds latency.
4.2 Delayed ACK
The receiver also batches ACKs — piggyback on next data, or wait up to 200ms (40ms default on Linux).
4.3 Nagle + Delayed ACK = 40ms Stall
A: sends first small packet
B: defers ACK (more might come)
A: wants to send second, Nagle waits for ACK
-> Nagle waits for ACK
-> Delayed ACK waits for data
-> 40ms later, ACK fires
A: finally sends second
Real-time apps (Telnet, SSH, remote games) must set TCP_NODELAY.
4.4 TCP_CORK — The Opposite
"Buffer up, send in one shot." sendfile + TCP_CORK powers nginx static-file delivery.
int cork = 1;
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &cork, sizeof(int));
writev(fd, iov, 10);
cork = 0;
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &cork, sizeof(int));
5. TCP Fast Open — Skip the Handshake
Each HTTP request costs at least 1.5 RTT for handshake plus request.
5.1 TFO (RFC 7413, 2014)
First connection: server issues a cookie. Later connections send SYN with cookie + request payload.
1st: SYN -> SYN+ACK(cookie) -> ACK -> GET
2nd: SYN(cookie, GET) -> SYN+ACK(data)
5.2 Why It Flopped
- Middleboxes drop non-standard SYN.
- Server-side cookie state.
- HTTP/2 reuse mitigates the pain.
QUIC's 0-RTT handshake sidesteps this.
6. Common Errors Decoded
6.1 Connection refused
RST to SYN — no listener on that port.
6.2 Connection reset by peer
RST during ESTABLISHED — peer crashed, SO_LINGER 0, or firewall active reset.
6.3 Broken pipe
Write to a socket the peer already closed.
6.4 Connection timed out
SYNs gone unanswered (5-7 retries, 60s+). Network black hole or dead server.
6.5 No route to host
Routing table has no path — VPN dropped, misconfigured route.
7. Production Tuning — Key sysctls
7.1 Backlog and Queues
net.core.somaxconn # listen backlog cap
net.ipv4.tcp_max_syn_backlog # SYN_RCVD queue
net.core.netdev_max_backlog # NIC to kernel queue
nginx listen 80 backlog=65535 needs matching kernel caps.
7.2 TIME_WAIT
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_fin_timeout=30
net.ipv4.ip_local_port_range
7.3 Keep-alive
net.ipv4.tcp_keepalive_time=600
net.ipv4.tcp_keepalive_intvl=60
net.ipv4.tcp_keepalive_probes=3
Defaults of 2 hours are too long for load-balanced services.
7.4 Buffers
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_rmem="4096 131072 16777216"
net.ipv4.tcp_wmem="4096 131072 16777216"
Match BDP: 10Gbps x 50ms = 62MB.
7.5 Congestion Control
net.ipv4.tcp_congestion_control=bbr
net.core.default_qdisc=fq
8. QUIC — Leaving TCP Behind
8.1 TCP's Hard Limits
- Head-of-Line Blocking: one lost packet stalls all streams on the same TCP connection — HTTP/2 inherits this.
- Handshake cost: TCP 3-way + TLS 1.2 2-RTT = 3 RTT.
- Middlebox ossification: new TCP options get dropped by ISP firewalls. TFO and MPTCP failed here.
- Kernel coupling: TCP improvements require kernel upgrades.
8.2 QUIC's Answer — User-Space Transport over UDP
Google experiment (2012), IETF standard (RFC 9000, 2021), HTTP/3 base.
- Built on UDP — middleboxes just see UDP.
- User-space library — ships with the app.
- TLS 1.3 integrated.
- True stream multiplexing — no HoL across streams.
8.3 0-RTT Handshake
Reconnects piggyback data on the first packet using a session ticket.
First: QUIC handshake (1 RTT)
Reconnect: data immediately (0 RTT)
8.4 Connection Migration
Connection ID survives IP changes. WiFi to 4G handover without dropping.
8.5 HTTP/3 = HTTP over QUIC
Same semantics as HTTP/2, transport swap to QUIC. Major browsers and CDNs (Cloudflare, Akamai, Fastly) support it.
8.6 QUIC Downsides
- Higher CPU (mandatory crypto, user space).
- Middlebox incompatibility with heavy UDP.
- Implementation complexity.
- Harder to observe — encrypted payloads, fewer tools.
Google and Meta measured ~10% latency improvement on mobile; desktop gains are smaller.
9. Debugging Toolkit
9.1 Inspect Connections
ss -tan
ss -tan state established
ss -tnp | grep :443
ss -s
9.2 Packet Capture
tcpdump -i any -w capture.pcap 'port 443'
wireshark capture.pcap
9.3 Congestion Control Observation
ss -ti
Sample:
ESTAB ... cubic cwnd:10 ssthresh:7 bytes_acked:1234 bytes_received:5678 rtt:25.3/3.1 rcv_rtt:25.1 delivered:10 ...
Small cwnd with many retrans = congestion.
9.4 bpftrace
bpftrace -e 'kprobe:tcp_retransmit_skb { printf("retrans pid=%d\n", pid); }'
9.5 Mtr
mtr -r -c 100 example.com
Per-hop loss + RTT for ISP diagnosis.
10. Closing — 50 Years of TCP, and What Comes Next
TCP dates to Vint Cerf and Bob Kahn's 1974 paper. Highlights:
- 1988: Jacobson congestion control.
- 1992: Window Scaling.
- 1996: SACK.
- 2006: Cubic.
- 2016: BBR.
- 2021: QUIC (RFC).
QUIC pulled transport into user space. Apps can evolve their transport without kernel upgrades — Facebook mvfst, Cloudflare quiche, Google gQUIC will shape the next decade. Meanwhile TCP stays: 80%+ of traffic still runs on it.
Next post: TLS/SSL and PKI internals — cert chains, cipher suites, 0-RTT replay risk, QUIC crypto integration, and post-quantum.
References
- RFC 9293 — Transmission Control Protocol (2022 revision).
- Van Jacobson — "Congestion Avoidance and Control" (SIGCOMM, 1988).
- Cardwell et al — "BBR: Congestion-Based Congestion Control" (ACM Queue, 2016).
- RFC 9000 — QUIC.
- RFC 9114 — HTTP/3.
- "Computer Networks: A Systems Approach" — Peterson & Davie.
- Brendan Gregg — Linux Performance blog.
- Marek Majkowski (Cloudflare) — TCP internals writing.
- "High Performance Browser Networking" — Ilya Grigorik (O'Reilly, 2013).
현재 단락 (1/181)
Every HTTP request rides on TCP. Yet consider: