Skip to content

✍️ 필사 모드: TCP Network Stack Deep Dive — State Machine, Congestion Control (Cubic vs BBR), Nagle, Delayed ACK, and the Evolution to QUIC (2025)

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

0. The TCP You Think You Know

Every HTTP request rides on TCP. Yet consider:

  • Why does TIME_WAIT in ss -tan sit around for 30 seconds?
  • What distinguishes "connection reset by peer" from "broken pipe"?
  • Why can sending 100 messages of 1KB per second take 4 seconds?
  • Why does iperf deliver 1Gbps on a 10Gbps link?
  • Why did Google build QUIC on UDP instead of TCP?

The answers live below: TCP state machine, congestion control evolution, infamous interaction bugs, and the QUIC era.

1. TCP State Machine — 11 States

1.1 Connection Setup: 3-Way Handshake

Client              Server
  |  SYN (seq=x)      |
  |----------------->|  [LISTEN -> SYN_RCVD]
  | SYN+ACK(y,x+1)   |
  |<-----------------|
  |  ACK (ack=y+1)   |
  |----------------->|  [SYN_RCVD -> ESTABLISHED]

Why three, not two? Bidirectional ISN (Initial Sequence Number) synchronization. Each direction must announce its own ISN and receive acknowledgment.

1.2 Connection Teardown: 4-Way Handshake

A                B
 |   FIN        |
 |------------->|  [ESTABLISHED -> CLOSE_WAIT]
 |   ACK        |
 |<-------------|
 [FIN_WAIT_1 -> FIN_WAIT_2]
 |   FIN        |
 |<-------------|  [CLOSE_WAIT -> LAST_ACK]
 |   ACK        |
 |------------->|
 [FIN_WAIT_2 -> TIME_WAIT]  [LAST_ACK -> CLOSED]

Four, because TCP is full-duplex. Each direction closes independently.

1.3 TIME_WAIT — Misunderstood

Wait 2MSL (Maximum Segment Lifetime) — typically 30s to 2 minutes. Reasons:

  1. Prevent stray packets from joining a new connection: quickly reopened port pairs could receive delayed packets from the previous connection.
  2. Handle lost final ACK: if the peer retransmits FIN, we must respond with ACK; a CLOSED state would reply with RST.

1.4 TIME_WAIT Explosion

Short-lived outbound connections pile up tens of thousands of TIME_WAIT, exhausting ephemeral ports.

Wrong fix: forcing TIME_WAIT to 0 (unsafe, risks data corruption).

Right fixes:

  • net.ipv4.tcp_tw_reuse=1: reuse ephemeral port safely.
  • Keep-Alive connections (HTTP/1.1 persistent, HTTP/2 multiplexing).
  • Connection pools in DB drivers and HTTP clients.

1.5 CLOSE_WAIT — App Bug Signal

TIME_WAIT is normal; stacked CLOSE_WAIT is not. The peer sent FIN but your app never called close().

$ ss -tan | grep CLOSE_WAIT | wc -l
50000   # socket leak

Cause: missing try-finally on file descriptors.

2. TCP Reliability Mechanisms

2.1 Sequence Numbers and ACK

Every byte has a sequence number. The receiver ACKs the next expected byte.

Send:  [1000][1001][1002][1003]
Recv:  ACK=1004  (got 1000-1003, expect 1004)

If a segment is lost, the receiver repeats the same ACK (duplicate ACK). Three dup ACKs trigger Fast Retransmit.

2.2 Retransmission — RTO and Fast Retransmit

  • RTO: dynamic timeout based on RTT measurement (Jacobson, 1988).
  • Fast Retransmit: skip timeout on 3 dup ACKs; recovery within milliseconds.
  • SACK: precise "got 1000-2000, 3000-4000, missing 2000-3000".

2.3 Flow Control — Receive Window

Receivers advertise rwnd. Senders only send while unacked_bytes < rwnd.

2.4 Window Scaling

TCP's 16-bit window = 65,535 bytes. 10Gbps x 100ms RTT = 125MB needed. RFC 1323 scaling multiplies rwnd by 2^shift.

sysctl net.ipv4.tcp_window_scaling
sysctl net.core.rmem_max
sysctl net.ipv4.tcp_rmem

3. Congestion Control

3.1 The 1986 Congestion Collapse

UC Berkeley to LBL link dropped from 32Kbps to 40bps — 1000x degradation. Cause: retransmit storms. Van Jacobson's 1988 paper introduced congestion control.

3.2 Congestion Window (cwnd)

send = min(rwnd, cwnd)

rwnd comes from the receiver; cwnd is the sender's estimate of network capacity.

3.3 Slow Start

Initial cwnd = 10 MSS. Each ACK increments cwnd by 1 MSS, doubling per RTT.

cwnd: 10 -> 20 -> 40 -> 80 -> 160

Exponential growth in practice despite the "slow" name.

3.4 Congestion Avoidance

After ssthresh, cwnd += 1 MSS per RTT (linear).

3.5 Loss Detection

  • 3 Dup ACK (Fast Retransmit): halve cwnd.
  • Timeout: cwnd = 1 MSS, restart Slow Start.

This is AIMD (Additive Increase, Multiplicative Decrease) — the Reno core.

3.6 Reno to Cubic

Reno is too conservative on long fat networks. CUBIC (Linux default) grows cwnd as a cubic function of time, remembers the last loss point, and probes beyond it.

3.7 BBR — Google's 2016 Revolution

Traditional algorithms assume loss = congestion. Modern networks see non-congestive loss (WiFi noise, bufferbloat).

BBR (Bottleneck Bandwidth and RTT) idea:

"Measure actual bandwidth and RTT directly; size cwnd from them."

  • Periodically probe for bandwidth.
  • Track minimum RTT (detect bufferbloat).
  • Keep cwnd near BW x RTT — minimize queuing.

3.8 BBR Results

Google google.com and YouTube after BBR:

  • 4% reduction in YouTube rebuffering.
  • Lower google.com response time.
  • Large gains in developing regions.

Linux 4.9+ includes BBR. Enable with sysctl net.ipv4.tcp_congestion_control=bbr.

3.9 BBR Fairness Debate

BBR v1 took more bandwidth than Cubic when coexisting — a fairness issue. Improved in BBRv2 (2019) and v3 (2023).

AlgorithmTraitUse
RenoAIMD, classicLegacy
CubicCubic growthLinux default, WAN
BBRDirect BW/RTTHigh speed, non-congestive loss
DCTCPECN markingData center
CoPALow latencyVideo conferencing

4. Nagle and Delayed ACK — The Worst Pairing

4.1 Nagle

Small packets are expensive (40-byte header for 1-byte payload = 2.5% efficiency).

Nagle: "If an unacked small packet is in flight, buffer further small data until ACK arrives."

setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(int));

Good for throughput; adds latency.

4.2 Delayed ACK

The receiver also batches ACKs — piggyback on next data, or wait up to 200ms (40ms default on Linux).

4.3 Nagle + Delayed ACK = 40ms Stall

A: sends first small packet
B: defers ACK (more might come)
A: wants to send second, Nagle waits for ACK
   -> Nagle waits for ACK
   -> Delayed ACK waits for data
   -> 40ms later, ACK fires
A: finally sends second

Real-time apps (Telnet, SSH, remote games) must set TCP_NODELAY.

4.4 TCP_CORK — The Opposite

"Buffer up, send in one shot." sendfile + TCP_CORK powers nginx static-file delivery.

int cork = 1;
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &cork, sizeof(int));
writev(fd, iov, 10);
cork = 0;
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &cork, sizeof(int));

5. TCP Fast Open — Skip the Handshake

Each HTTP request costs at least 1.5 RTT for handshake plus request.

5.1 TFO (RFC 7413, 2014)

First connection: server issues a cookie. Later connections send SYN with cookie + request payload.

1st:  SYN -> SYN+ACK(cookie) -> ACK -> GET
2nd:  SYN(cookie, GET) -> SYN+ACK(data)

5.2 Why It Flopped

  • Middleboxes drop non-standard SYN.
  • Server-side cookie state.
  • HTTP/2 reuse mitigates the pain.

QUIC's 0-RTT handshake sidesteps this.

6. Common Errors Decoded

6.1 Connection refused

RST to SYN — no listener on that port.

6.2 Connection reset by peer

RST during ESTABLISHED — peer crashed, SO_LINGER 0, or firewall active reset.

6.3 Broken pipe

Write to a socket the peer already closed.

6.4 Connection timed out

SYNs gone unanswered (5-7 retries, 60s+). Network black hole or dead server.

6.5 No route to host

Routing table has no path — VPN dropped, misconfigured route.

7. Production Tuning — Key sysctls

7.1 Backlog and Queues

net.core.somaxconn            # listen backlog cap
net.ipv4.tcp_max_syn_backlog  # SYN_RCVD queue
net.core.netdev_max_backlog   # NIC to kernel queue

nginx listen 80 backlog=65535 needs matching kernel caps.

7.2 TIME_WAIT

net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_fin_timeout=30
net.ipv4.ip_local_port_range

7.3 Keep-alive

net.ipv4.tcp_keepalive_time=600
net.ipv4.tcp_keepalive_intvl=60
net.ipv4.tcp_keepalive_probes=3

Defaults of 2 hours are too long for load-balanced services.

7.4 Buffers

net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_rmem="4096 131072 16777216"
net.ipv4.tcp_wmem="4096 131072 16777216"

Match BDP: 10Gbps x 50ms = 62MB.

7.5 Congestion Control

net.ipv4.tcp_congestion_control=bbr
net.core.default_qdisc=fq

8. QUIC — Leaving TCP Behind

8.1 TCP's Hard Limits

  1. Head-of-Line Blocking: one lost packet stalls all streams on the same TCP connection — HTTP/2 inherits this.
  2. Handshake cost: TCP 3-way + TLS 1.2 2-RTT = 3 RTT.
  3. Middlebox ossification: new TCP options get dropped by ISP firewalls. TFO and MPTCP failed here.
  4. Kernel coupling: TCP improvements require kernel upgrades.

8.2 QUIC's Answer — User-Space Transport over UDP

Google experiment (2012), IETF standard (RFC 9000, 2021), HTTP/3 base.

  • Built on UDP — middleboxes just see UDP.
  • User-space library — ships with the app.
  • TLS 1.3 integrated.
  • True stream multiplexing — no HoL across streams.

8.3 0-RTT Handshake

Reconnects piggyback data on the first packet using a session ticket.

First:     QUIC handshake (1 RTT)
Reconnect: data immediately (0 RTT)

8.4 Connection Migration

Connection ID survives IP changes. WiFi to 4G handover without dropping.

8.5 HTTP/3 = HTTP over QUIC

Same semantics as HTTP/2, transport swap to QUIC. Major browsers and CDNs (Cloudflare, Akamai, Fastly) support it.

8.6 QUIC Downsides

  • Higher CPU (mandatory crypto, user space).
  • Middlebox incompatibility with heavy UDP.
  • Implementation complexity.
  • Harder to observe — encrypted payloads, fewer tools.

Google and Meta measured ~10% latency improvement on mobile; desktop gains are smaller.

9. Debugging Toolkit

9.1 Inspect Connections

ss -tan
ss -tan state established
ss -tnp | grep :443
ss -s

9.2 Packet Capture

tcpdump -i any -w capture.pcap 'port 443'
wireshark capture.pcap

9.3 Congestion Control Observation

ss -ti

Sample:

ESTAB  ... cubic cwnd:10 ssthresh:7 bytes_acked:1234 bytes_received:5678 rtt:25.3/3.1 rcv_rtt:25.1 delivered:10 ...

Small cwnd with many retrans = congestion.

9.4 bpftrace

bpftrace -e 'kprobe:tcp_retransmit_skb { printf("retrans pid=%d\n", pid); }'

9.5 Mtr

mtr -r -c 100 example.com

Per-hop loss + RTT for ISP diagnosis.

10. Closing — 50 Years of TCP, and What Comes Next

TCP dates to Vint Cerf and Bob Kahn's 1974 paper. Highlights:

  • 1988: Jacobson congestion control.
  • 1992: Window Scaling.
  • 1996: SACK.
  • 2006: Cubic.
  • 2016: BBR.
  • 2021: QUIC (RFC).

QUIC pulled transport into user space. Apps can evolve their transport without kernel upgrades — Facebook mvfst, Cloudflare quiche, Google gQUIC will shape the next decade. Meanwhile TCP stays: 80%+ of traffic still runs on it.

Next post: TLS/SSL and PKI internals — cert chains, cipher suites, 0-RTT replay risk, QUIC crypto integration, and post-quantum.

References

  • RFC 9293 — Transmission Control Protocol (2022 revision).
  • Van Jacobson — "Congestion Avoidance and Control" (SIGCOMM, 1988).
  • Cardwell et al — "BBR: Congestion-Based Congestion Control" (ACM Queue, 2016).
  • RFC 9000 — QUIC.
  • RFC 9114 — HTTP/3.
  • "Computer Networks: A Systems Approach" — Peterson & Davie.
  • Brendan Gregg — Linux Performance blog.
  • Marek Majkowski (Cloudflare) — TCP internals writing.
  • "High Performance Browser Networking" — Ilya Grigorik (O'Reilly, 2013).

현재 단락 (1/181)

Every HTTP request rides on TCP. Yet consider:

작성 글자: 0원문 글자: 9,558작성 단락: 0/181