- Published on
TCP/IP & Network Programming Deep Dive 2025: Sockets, HTTP/3, QUIC, DNS, TLS 1.3
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Table of Contents
1. Why Deep Network Programming Knowledge Matters
For modern software engineers, networking is no longer something the infrastructure team handles alone. Latency between microservices determines P99 response times, HTTP/3 adoption decides mobile user experience, and a single TLS misconfiguration can fail a security audit.
Core topics covered in this guide:
- TCP internals and state machine
- Congestion control algorithms (Cubic, BBR, BBRv2)
- Socket programming and I/O multiplexing (epoll, kqueue, io_uring)
- HTTP/1.1 to HTTP/2 to HTTP/3 (QUIC) evolution
- DNS resolution and security (DNSSEC, DoH/DoT)
- TLS 1.3 handshake and 0-RTT
- Network debugging tools (tcpdump, Wireshark, ss)
- Performance tuning (Nagle, TCP_NODELAY, kernel parameters)
2. TCP Protocol Deep Dive
2.1 3-Way Handshake
The core process for establishing a TCP connection:
Client Server
| |
|--- SYN (seq=x) ------->| (1) Client requests connection
| |
|<-- SYN-ACK ------------| (2) Server accepts + sends its own SYN
| (seq=y, ack=x+1) |
| |
|--- ACK (ack=y+1) ----->| (3) Client acknowledges server's SYN
| |
|==== Connection Open ====|
Why 3-way? Both sides must confirm each other's Initial Sequence Number (ISN). With only 2-way, the server cannot verify the client's ACK.
// Server socket creation example
int server_fd = socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in addr;
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = INADDR_ANY;
addr.sin_port = htons(8080);
bind(server_fd, (struct sockaddr *)&addr, sizeof(addr));
// backlog: determines SYN queue + Accept queue size
listen(server_fd, SOMAXCONN);
// Dequeue a completed 3-way handshake from the accept queue
int client_fd = accept(server_fd, NULL, NULL);
2.2 4-Way Termination
Connection teardown happens independently in each direction:
Client Server
| |
|--- FIN (seq=u) ------->| (1) Client: "I have no more data"
| |
|<-- ACK (ack=u+1) ------| (2) Server: "Acknowledged" (can still send)
| |
| ... server sends remaining data ...
| |
|<-- FIN (seq=w) --------| (3) Server: "I have no more data either"
| |
|--- ACK (ack=w+1) ----->| (4) Client: "Acknowledged"
| |
|=== TIME_WAIT (2MSL) ===| Client enters TIME_WAIT state
2.3 TCP State Machine
CLOSED
|
(active OPEN)
|
SYN_SENT
|
(SYN-ACK received)
|
ESTABLISHED
/ \
(active CLOSE) (passive CLOSE)
/ \
FIN_WAIT_1 CLOSE_WAIT
| |
FIN_WAIT_2 LAST_ACK
| |
TIME_WAIT CLOSED
|
CLOSED
2.4 TIME_WAIT and SO_REUSEADDR
TIME_WAIT lasts for 2MSL (Maximum Segment Lifetime, typically 60 seconds).
Why TIME_WAIT is needed:
- Prevents delayed packets from interfering with new connections
- Allows retransmission if the final ACK is lost
// Prevent "Address already in use" on server restart
int optval = 1;
setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR,
&optval, sizeof(optval));
// Linux also supports SO_REUSEPORT (load balancing)
setsockopt(server_fd, SOL_SOCKET, SO_REUSEPORT,
&optval, sizeof(optval));
Handling excessive TIME_WAIT:
# Check TIME_WAIT socket count
ss -s | grep TIME-WAIT
# Kernel parameter tuning (caution: side effects possible)
# Enable net.ipv4.tcp_tw_reuse
sysctl -w net.ipv4.tcp_tw_reuse=1
# Maximum TIME_WAIT sockets
sysctl -w net.ipv4.tcp_max_tw_buckets=262144
3. TCP Congestion Control
3.1 Why Congestion Control Is Necessary
When all hosts send packets without limit during congestion, router buffers overflow causing packet loss, and retransmissions make congestion even worse -- a phenomenon called Congestion Collapse.
3.2 Core Algorithms
cwnd (Congestion Window)
^
| * * * *
| * * <-- Congestion Avoidance (linear increase)
| * *
| * \
| * \ Packet loss detected
| * \
| * <-- Slow Start * (cwnd halved)
| * (exponential growth) *
| * *
| * * ...
|*
+-----------------------------------------> Time
ssthresh
Slow Start: cwnd starts at 1 MSS, increases by 1 MSS per ACK (doubles every RTT)
Congestion Avoidance: When cwnd reaches ssthresh, linear increase of 1 MSS per RTT
Fast Retransmit: Upon receiving 3 duplicate ACKs, retransmit immediately without waiting for timeout
Fast Recovery: On packet loss, set cwnd to half of ssthresh instead of 1
3.3 Cubic vs BBR
Algorithm Comparison:
+----------+--------+-----------+-------------------------------+
| Algorithm | Type | Detection | Characteristics |
+----------+--------+-----------+-------------------------------+
| Reno | Loss | Pkt loss | Basic AIMD |
| Cubic | Loss | Pkt loss | Cubic function, Linux default |
| BBR | Model | RTT/BW | Bandwidth-delay model, Google |
| BBRv2 | Model | RTT/BW | Improved BBR, better fairness |
+----------+--------+-----------+-------------------------------+
Cubic: Linux default congestion control. Window size follows a cubic function.
# Check current congestion control algorithm
sysctl net.ipv4.tcp_congestion_control
# Output: net.ipv4.tcp_congestion_control = cubic
# Switch to BBR
sysctl -w net.ipv4.tcp_congestion_control=bbr
# List available algorithms
sysctl net.ipv4.tcp_available_congestion_control
BBR (Bottleneck Bandwidth and RTT):
- Measures actual bandwidth and RTT to determine optimal send rate (not relying on packet loss)
- Far superior to Cubic in high-loss environments (satellite, wireless)
- Downside: fairness issues with other flows
# Enable BBR (kernel 4.9+)
modprobe tcp_bbr
echo "tcp_bbr" >> /etc/modules-load.d/bbr.conf
sysctl -w net.core.default_qdisc=fq
sysctl -w net.ipv4.tcp_congestion_control=bbr
4. UDP: When and Why
4.1 TCP vs UDP
+------------+----------------------------+----------------------------+
| Feature | TCP | UDP |
+------------+----------------------------+----------------------------+
| Connection | Connection-oriented (3-way)| Connectionless |
| Reliability| Guaranteed (retransmit) | No guarantee |
| Flow Ctrl | Yes (sliding window) | None |
| Congestion | Yes | None (app-level) |
| Header | 20-60 bytes | 8 bytes |
| Latency | Higher (handshake+retrans) | Lower |
| Use Cases | HTTP, SSH, DB | DNS, gaming, streaming |
+------------+----------------------------+----------------------------+
4.2 UDP Use Cases
Gaming: Per-frame position data -- only the latest value matters. The next frame is more important than retransmitting the previous one.
# Simple UDP game server
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind(('0.0.0.0', 9999))
players = {}
while True:
data, addr = sock.recvfrom(1024)
# Update player position (loss is acceptable)
players[addr] = parse_position(data)
# Broadcast game state to all players
state = serialize_game_state(players)
for player_addr in players:
sock.sendto(state, player_addr)
DNS: Single request-response pattern. TCP handshake overhead is unnecessary.
Real-time streaming: RTP/RTCP protocols run over UDP.
5. Socket Programming
5.1 Berkeley Socket API
// Complete TCP client example
#include <stdio.h>
#include <string.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <unistd.h>
int main() {
// 1. Create socket
int sock = socket(AF_INET, SOCK_STREAM, 0);
if (sock < 0) {
perror("socket");
return 1;
}
// 2. Configure server address
struct sockaddr_in server_addr;
server_addr.sin_family = AF_INET;
server_addr.sin_port = htons(80);
inet_pton(AF_INET, "93.184.216.34", &server_addr.sin_addr);
// 3. Connect (triggers 3-way handshake)
if (connect(sock, (struct sockaddr *)&server_addr,
sizeof(server_addr)) < 0) {
perror("connect");
return 1;
}
// 4. Send HTTP request
const char *request = "GET / HTTP/1.1\r\n"
"Host: example.com\r\n"
"Connection: close\r\n\r\n";
send(sock, request, strlen(request), 0);
// 5. Receive response
char buffer[4096];
int bytes;
while ((bytes = recv(sock, buffer, sizeof(buffer) - 1, 0)) > 0) {
buffer[bytes] = '\0';
printf("%s", buffer);
}
// 6. Close connection (triggers 4-way teardown)
close(sock);
return 0;
}
5.2 Blocking vs Non-Blocking
#include <fcntl.h>
// Set non-blocking mode
int flags = fcntl(sock, F_GETFL, 0);
fcntl(sock, F_SETFL, flags | O_NONBLOCK);
// Non-blocking connect
int ret = connect(sock, (struct sockaddr *)&addr, sizeof(addr));
if (ret < 0 && errno == EINPROGRESS) {
// Connection in progress - use poll/epoll to check completion
struct pollfd pfd;
pfd.fd = sock;
pfd.events = POLLOUT;
poll(&pfd, 1, 5000); // 5 second timeout
int error;
socklen_t len = sizeof(error);
getsockopt(sock, SOL_LEVEL, SO_ERROR, &error, &len);
if (error == 0) {
// Connection succeeded
}
}
// Non-blocking recv
char buf[1024];
ssize_t n = recv(sock, buf, sizeof(buf), 0);
if (n < 0) {
if (errno == EAGAIN || errno == EWOULDBLOCK) {
// No data available yet - try again later
} else {
// Actual error
}
}
6. I/O Multiplexing
6.1 select, poll, epoll, kqueue, io_uring Comparison
+------------+--------+-----------+----------+----------------------------+
| Mechanism | OS | Complexity| FD Limit | Characteristics |
+------------+--------+-----------+----------+----------------------------+
| select | All | O(n) | 1024 | Oldest, most portable |
| poll | All | O(n) | None | Improved select, still O(n)|
| epoll | Linux | O(1) | None | Event-driven, scalable |
| kqueue | BSD | O(1) | None | macOS/FreeBSD, feature-rich|
| io_uring | Linux | O(1) | None | 5.1+, async I/O revolution |
+------------+--------+-----------+----------+----------------------------+
6.2 epoll Deep Dive
#include <sys/epoll.h>
#define MAX_EVENTS 1024
int epoll_fd = epoll_create1(0);
// Register server socket
struct epoll_event ev;
ev.events = EPOLLIN;
ev.data.fd = server_fd;
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, server_fd, &ev);
struct epoll_event events[MAX_EVENTS];
while (1) {
// Wait for events (blocking)
int nfds = epoll_wait(epoll_fd, events, MAX_EVENTS, -1);
for (int i = 0; i < nfds; i++) {
if (events[i].data.fd == server_fd) {
// Accept new connection
int client_fd = accept(server_fd, NULL, NULL);
// Set non-blocking
fcntl(client_fd, F_SETFL,
fcntl(client_fd, F_GETFL, 0) | O_NONBLOCK);
// Register with Edge-Triggered mode
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = client_fd;
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, client_fd, &ev);
} else {
// Handle client data
handle_client(events[i].data.fd);
}
}
}
Level-Triggered vs Edge-Triggered:
- LT (default): Notifies continuously as long as data is available. Safe but may cause unnecessary syscalls.
- ET: Notifies only on state changes. Better performance but requires reading all data at once.
// Reading all data in Edge-Triggered mode
void handle_client_et(int fd) {
char buf[4096];
while (1) {
ssize_t n = read(fd, buf, sizeof(buf));
if (n < 0) {
if (errno == EAGAIN) break; // All data read
perror("read");
break;
}
if (n == 0) {
// Connection closed
close(fd);
break;
}
process_data(buf, n);
}
}
6.3 io_uring (Linux 5.1+)
#include <liburing.h>
struct io_uring ring;
io_uring_queue_init(256, &ring, 0);
// Submit async read request
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, client_fd, buf, sizeof(buf), 0);
io_uring_sqe_set_data(sqe, client_ctx);
io_uring_submit(&ring);
// Harvest completion events
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
struct client_ctx *ctx = io_uring_cqe_get_data(cqe);
int bytes_read = cqe->res;
// Process...
io_uring_cqe_seen(&ring, cqe);
io_uring innovations:
- Communicates with kernel without syscalls (shared ring buffer)
- Batch submission/completion minimizes overhead
- Supports file I/O, network I/O, and timers
- Up to 2-3x throughput improvement over epoll
7. HTTP Protocol Evolution
7.1 HTTP/1.1
HTTP/1.1 Key Features:
1. Keep-Alive (Persistent Connections)
- Multiple requests/responses over a single TCP connection
- Connection: keep-alive (default)
2. Pipelining
- Send multiple requests without waiting for responses
- Rarely used due to HOL Blocking in practice
3. Chunked Transfer Encoding
- Streaming responses without Content-Length
HTTP/1.1 Limitation -- Head-of-Line Blocking:
Client Server
|-- GET /a -------->|
| | (response a takes 3 seconds to generate)
|-- GET /b -------->|
| | b is ready but waits for a to finish
|<-- Response /a ---|
|<-- Response /b ---| (unnecessary delay)
In practice, domain sharding (6-8 TCP connections) was used as a workaround.
7.2 HTTP/2
HTTP/2 Key Features:
1. Multiplexing
- Multiple streams over a single TCP connection
- Each stream is independent (solves HOL Blocking... except at TCP level)
2. HPACK Header Compression
- Static/dynamic tables eliminate header duplication
- First request: "content-type: application/json" (full)
- Subsequent: just the index number
3. Server Push
- Push CSS/JS proactively when HTML is requested
- Rarely used in practice due to cache issues
4. Binary Framing
- Binary frames instead of text
- Improved parsing efficiency
HTTP/2 Frame Structure:
+-----------------------------------------------+
| Length (24) |
+---------------+-------------------------------+
| Type (8) | Flags (8) |
+---------------+-------------------------------+
| Reserved (1) | Stream Identifier (31) |
+-----------------------------------------------+
| Frame Payload |
+-----------------------------------------------+
Stream IDs:
- Odd: client-initiated
- Even: server-initiated (Server Push)
- 0: control frames
7.3 HTTP/3 (QUIC)
HTTP Version Comparison:
+----------+----------+---------+---------+----------------------------+
| Version | Transport| Encrypt | RTT | HOL Blocking |
+----------+----------+---------+---------+----------------------------+
| HTTP/1.1 | TCP | Optional| 2-3 RTT | Yes (response-level) |
| HTTP/2 | TCP | De facto| 2-3 RTT | Yes (TCP-level) |
| HTTP/3 | QUIC/UDP | Required| 1 RTT | None (stream-independent) |
| | | | (0 RTT) | |
+----------+----------+---------+---------+----------------------------+
8. QUIC Protocol Deep Dive
8.1 QUIC Architecture
Traditional Stack: QUIC Stack:
+----------+ +----------+
| HTTP/2 | | HTTP/3 |
+----------+ +----------+
| TLS 1.2+ | | QUIC | <- TLS 1.3 built-in
+----------+ +----------+
| TCP | | UDP |
+----------+ +----------+
| IP | | IP |
+----------+ +----------+
8.2 0-RTT Connection
First Connection (1-RTT):
Client Server
| |
|--- Initial (CHLO) ---->| ClientHello + transport parameters
| |
|<-- Initial (SHLO) -----| ServerHello + certificate + transport params
| |
|--- Handshake Done ----->|
| |
|=== Data Transfer ======| Connected in just 1-RTT!
Reconnection (0-RTT):
Client Server
| |
|--- Initial + 0-RTT --->| Data encrypted with previous session key
| data |
| | Server processes data immediately
|<-- Handshake ----------|
| |
|=== Instant Exchange ===| 0-RTT!
0-RTT security caveat: Vulnerable to replay attacks. Only idempotent requests should use 0-RTT.
8.3 Connection Migration
Switching from Wi-Fi to cellular:
TCP: New IP address -> new connection needed -> 3-way handshake + TLS again
QUIC: Connection ID based -> connection survives IP change!
Client (Wi-Fi: 192.168.1.10) ---QUIC (CID: abc123)---> Server
|
(Wi-Fi drops, switch to cellular)
|
Client (Cell: 10.0.0.5) ---QUIC (CID: abc123)---> Server
Connection maintained!
8.4 Stream Multiplexing and HOL Blocking Resolution
HTTP/2 over TCP:
Stream A: [1] [2] [_] [4] ... <- Packet 3 lost
Stream B: [1] [2] [3] [4] ... <- B also blocked! (TCP guarantees order)
Stream C: [1] [2] [3] [4] ... <- C also blocked!
All streams wait until TCP retransmission completes
HTTP/3 over QUIC:
Stream A: [1] [2] [_] [4] ... <- Packet 3 lost (only A waits)
Stream B: [1] [2] [3] [4] ... <- Proceeds normally!
Stream C: [1] [2] [3] [4] ... <- Proceeds normally!
Each stream handles retransmission independently
9. DNS Deep Dive
9.1 DNS Resolution Process
User enters www.example.com:
(1) Check local cache
Browser -------> OS Resolver -------> /etc/hosts
|
(2) Cache miss
|
v
Recursive Resolver (ISP/8.8.8.8)
|
(3) Query root server
|
v
Root Server (.)
"Ask this NS for .com"
|
(4) Query TLD server
|
v
.com TLD Server
"Ask this NS for example.com"
|
(5) Query authoritative server
|
v
example.com Authoritative NS
"www.example.com = 93.184.216.34"
|
(6) Cache result and return
|
v
Browser
9.2 DNS Record Types
# A record (IPv4)
dig A example.com
# example.com. 300 IN A 93.184.216.34
# AAAA record (IPv6)
dig AAAA example.com
# CNAME (alias)
dig CNAME www.example.com
# www.example.com. 300 IN CNAME example.com.
# MX (mail)
dig MX example.com
# NS (nameserver)
dig NS example.com
# TXT (text - SPF, DKIM, etc.)
dig TXT example.com
# Trace full resolution path
dig +trace www.example.com
9.3 DNSSEC
DNSSEC ensures integrity of DNS responses.
Signature Chain:
Root Zone (.)
|-- KSK (Key Signing Key) signs ZSK
|-- ZSK (Zone Signing Key) signs records
|-- DS record: hash of .com KSK
|
v
.com TLD
|-- KSK, ZSK
|-- DS record: hash of example.com KSK
|
v
example.com
|-- KSK, ZSK
|-- RRSIG: digital signature of each record
|-- A record + RRSIG(A)
# Verify DNSSEC
dig +dnssec example.com
# Check signature status
dig +sigchase +trusted-key=./trusted-key.key example.com
9.4 DoH / DoT
Traditional DNS: Plaintext UDP port 53 (eavesdropping possible)
DoT (DNS over TLS):
- DNS encrypted via TLS
- TCP port 853
- System-level configuration
DoH (DNS over HTTPS):
- DNS encapsulated in HTTPS
- TCP port 443 (indistinguishable from normal HTTPS)
- Browser-level configuration (Firefox, Chrome)
# DNS query via DoH (curl)
curl -H "accept: application/dns-json" \
"https://cloudflare-dns.com/dns-query?name=example.com&type=A"
10. TLS 1.3 Deep Dive
10.1 TLS 1.2 vs TLS 1.3
+----------------+------------------+------------------+
| Feature | TLS 1.2 | TLS 1.3 |
+----------------+------------------+------------------+
| Handshake | 2-RTT | 1-RTT |
| Reconnection | 1-RTT | 0-RTT |
| Key Exchange | RSA, DH, ECDH | ECDHE, X25519 |
| Encryption | Includes CBC, RC4| AEAD only |
| | | (AES-GCM, |
| | | ChaCha20-Poly) |
| Static RSA | Possible (no PFS)| Removed (PFS |
| | | required) |
| Compression | Yes (CRIME vuln) | Removed |
+----------------+------------------+------------------+
10.2 TLS 1.3 Handshake
Client Server
| |
|--- ClientHello ------------------>|
| + supported_versions |
| + key_share (ECDHE public key) |
| + signature_algorithms |
| + psk_key_exchange_modes |
| |
|<-- ServerHello --------------------|
| + key_share (server ECDHE key) |
| |
| [All subsequent comms encrypted]|
| |
|<-- EncryptedExtensions ------------|
|<-- Certificate --------------------|
|<-- CertificateVerify --------------|
|<-- Finished -----------------------|
| |
|--- Finished ---------------------->|
| |
|=== 1-RTT Handshake Complete ======|
10.3 Certificate Transparency
Certificate Issuance Process:
1. Domain owner requests certificate from CA
2. CA issues certificate
3. CA logs certificate in CT log (transparency!)
4. CT log returns SCT (Signed Certificate Timestamp)
5. Server includes SCT in TLS handshake
CT Log Monitoring:
- Detects unauthorized certificate issuance for your domain
- Query all certificates for a domain at crt.sh
# Check server TLS certificate
openssl s_client -connect example.com:443 -servername example.com \
| openssl x509 -noout -text
# Test TLS 1.3 connection
openssl s_client -connect example.com:443 -tls1_3
# View certificate chain
openssl s_client -connect example.com:443 -showcerts
11. Network Debugging
11.1 tcpdump
# Filter by host and port
tcpdump -i eth0 host 10.0.0.1 and port 80
# Filter by TCP flags (SYN packets only)
tcpdump -i eth0 'tcp[tcpflags] & (tcp-syn) != 0'
# Capture 3-way handshake
tcpdump -i eth0 'tcp[tcpflags] & (tcp-syn|tcp-ack) != 0' -c 100
# View HTTP request/response content
tcpdump -i eth0 -A -s 0 'tcp port 80'
# Save to pcap file (analyze in Wireshark)
tcpdump -i eth0 -w capture.pcap -c 10000
# Capture DNS queries
tcpdump -i eth0 udp port 53
# Check retransmission packets
tcpdump -i eth0 'tcp[tcpflags] & (tcp-syn) != 0' | grep "retransmit"
11.2 ss (Socket Statistics)
# TCP connection state summary
ss -s
# LISTEN state sockets (server ports)
ss -tlnp
# ESTABLISHED connections
ss -tnp state established
# TIME_WAIT socket count
ss -s | grep TIME-WAIT
# Connection info for a specific port
ss -tnp dst :443
# Check socket buffer sizes
ss -tnm
# Congestion control info
ss -ti dst :80
# Example output: cubic wscale:7,7 rto:204 rtt:1.5/0.5
# cwnd:10 ssthresh:7 send 77.0Mbps
11.3 traceroute / mtr
# Path tracing
traceroute example.com
# TCP traceroute (bypass firewalls)
traceroute -T -p 443 example.com
# mtr (traceroute + ping combined, real-time monitoring)
mtr --report-wide example.com
# Example output:
# HOST Loss% Snt Last Avg Best Wrst StDev
# 1. gateway.local 0.0% 10 1.2 1.3 0.9 2.1 0.4
# 2. isp-router.net 0.0% 10 5.3 5.1 4.8 5.9 0.3
# 3. core-router.isp.net 0.0% 10 12.1 11.8 11.2 13.0 0.5
# 4. cdn-edge.example.com 0.0% 10 15.2 15.0 14.5 16.1 0.5
11.4 curl Debugging
# Verbose connection info
curl -v https://example.com
# Detailed timing information
curl -w @- -o /dev/null -s https://example.com <<'EOF'
time_namelookup: %{time_namelookup}s\n
time_connect: %{time_connect}s\n
time_appconnect: %{time_appconnect}s\n
time_pretransfer: %{time_pretransfer}s\n
time_redirect: %{time_redirect}s\n
time_starttransfer: %{time_starttransfer}s\n
----------\n
time_total: %{time_total}s\n
EOF
# Force HTTP/2
curl --http2 -v https://example.com
# HTTP/3 (curl 7.88+ with nghttp3)
curl --http3 -v https://example.com
# TLS handshake info
curl -v --tlsv1.3 https://example.com 2>&1 | grep -E "SSL|TLS"
12. Performance Tuning
12.1 Nagle's Algorithm and TCP_NODELAY
// Nagle's algorithm: aggregates small packets for bandwidth efficiency
// Problem: introduces latency (especially for real-time applications)
// Disable Nagle with TCP_NODELAY
int flag = 1;
setsockopt(sock, IPPROTO_TCP, TCP_NODELAY,
&flag, sizeof(flag));
// When to use TCP_NODELAY:
// - Real-time gaming (per-frame data)
// - Interactive protocols (SSH, telnet)
// - Frequent small message sends
// - Request-response patterns (Redis, Memcached)
// When to keep Nagle:
// - Large file transfers
// - Streaming (already large packets)
// - Bandwidth-constrained environments
12.2 Socket Buffer Tuning
# Check current TCP buffer settings
sysctl net.ipv4.tcp_rmem # Receive buffer (min, default, max)
sysctl net.ipv4.tcp_wmem # Send buffer (min, default, max)
# Tuning for high-bandwidth environments (10Gbps+)
sysctl -w net.ipv4.tcp_rmem="4096 1048576 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 1048576 16777216"
# Total TCP memory limit (in pages)
sysctl -w net.ipv4.tcp_mem="786432 1048576 1572864"
# Enable receive buffer auto-tuning
sysctl -w net.ipv4.tcp_moderate_rcvbuf=1
12.3 Kernel Network Parameters
# === Connection Management ===
# SYN backlog size (DDoS defense)
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
# Accept queue size
sysctl -w net.core.somaxconn=65535
# SYN cookies (SYN Flood defense)
sysctl -w net.ipv4.tcp_syncookies=1
# === Timeouts ===
# FIN-WAIT-2 timeout (default 60s)
sysctl -w net.ipv4.tcp_fin_timeout=15
# Keepalive settings
sysctl -w net.ipv4.tcp_keepalive_time=600
sysctl -w net.ipv4.tcp_keepalive_intvl=60
sysctl -w net.ipv4.tcp_keepalive_probes=3
# === File Descriptors ===
# System-wide file descriptor limit
sysctl -w fs.file-max=2097152
# Per-process limit
ulimit -n 1048576
13. Practice Quiz
Quiz 1: TCP State Transition
After a server receives a FIN from the client but still has data to send, what TCP state is the server in?
Answer: CLOSE_WAIT
When the server receives the client's FIN, it sends an ACK and transitions to CLOSE_WAIT. In this state, the server can still send data. After sending all remaining data and its own FIN, the server transitions to LAST_ACK.
If many CLOSE_WAIT sockets accumulate, it typically indicates the application is not properly calling close() -- this is a bug.
# Check CLOSE_WAIT sockets
ss -tnp state close-wait
Quiz 2: HTTP/2 vs HTTP/3 HOL Blocking
Why does a single lost TCP packet in HTTP/2 affect other streams on the same connection?
Answer: TCP's in-order delivery guarantee
HTTP/2 multiplexes multiple streams over a single TCP connection. Since TCP guarantees byte-stream ordering, when an intermediate packet is lost, all subsequent bytes (including data for other streams) must wait in the TCP receive buffer.
HTTP/3 (QUIC) solves this by managing each stream independently over UDP. A packet loss in Stream A does not affect Streams B or C.
Quiz 3: BBR vs Cubic
Why does BBR outperform Cubic in high packet loss environments (e.g., 2% loss on satellite links)?
Answer: Cubic is a loss-based algorithm that interprets packet loss as a signal of congestion, aggressively reducing cwnd. In a 2% loss environment, cwnd is constantly decreased even when there is no actual congestion.
BBR is model-based, measuring actual bandwidth (BtlBw) and minimum RTT (RTprop) to determine optimal send rate. It does not use packet loss as a congestion signal, so it effectively utilizes available bandwidth even with high loss rates.
Quiz 4: TLS 1.3 0-RTT Security
Why is TLS 1.3 0-RTT reconnection vulnerable to replay attacks, and how can this be mitigated?
Answer: 0-RTT data is encrypted with the previous session's PSK but does not include the server's ServerHello, so freshness cannot be guaranteed. An attacker who captures 0-RTT packets can replay them, causing the server to process the same request again.
Mitigations:
- Only allow idempotent requests (GET, HEAD) via 0-RTT
- Implement server-side anti-replay mechanisms (store Client Hello hashes)
- Disable 0-RTT for sensitive operations like financial transactions
Quiz 5: epoll ET Mode
What happens if you don't read all available data in epoll Edge-Triggered mode?
Answer: Edge-Triggered mode only fires events on state changes. After a readability change event, if you only read partial data and return to epoll_wait, no new event is generated for the remaining data.
Therefore, in ET mode, you must repeatedly call read() until EAGAIN is returned to consume all available data. This is why ET mode must always be used with non-blocking sockets.
Level-Triggered mode does not have this problem because it keeps generating events as long as data remains in the buffer.
14. References
- TCP/IP Illustrated, Volume 1 - W. Richard Stevens (The TCP Bible)
- Unix Network Programming - W. Richard Stevens (Socket programming classic)
- High Performance Browser Networking - Ilya Grigorik (Free online: hpbn.co)
- QUIC RFC 9000 - https://datatracker.ietf.org/doc/html/rfc9000
- TLS 1.3 RFC 8446 - https://datatracker.ietf.org/doc/html/rfc8446
- HTTP/3 RFC 9114 - https://datatracker.ietf.org/doc/html/rfc9114
- BBR Congestion Control - Google Research, https://research.google/pubs/bbr-congestion-based-congestion-control/
- Linux epoll(7) man page - https://man7.org/linux/man-pages/man7/epoll.7.html
- io_uring documentation - https://kernel.dk/io_uring.pdf
- Brendan Gregg - Network Performance - https://www.brendangregg.com/networking.html
- Cloudflare Blog - HTTP/3 - https://blog.cloudflare.com/http3-the-past-present-and-future/
- DNS Flag Day - https://dnsflagday.net/
- Certificate Transparency - https://certificate.transparency.dev/
- HPACK RFC 7541 - https://datatracker.ietf.org/doc/html/rfc7541