Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

0. Why does a server get slow — the C10K problem revisited

In 1999, when Dan Kegel wrote "The C10K problem", handling 10,000 concurrent connections on one server seemed impossible. The conventional wisdom:

One connection = one thread.
10,000 threads = 10GB of stack alone (2MB x 10000).
Context switching costs explode.
Kernel data structures exhausted.

Yet in the 2020s nginx handles one million connections on a single server. What happened? This post follows 30 years of Linux I/O interface evolution — the revolutions hiding behind a single line of read().

1. Blocking I/O — a 1970s legacy

1.1 The simplest model

char buf[1024];
int n = read(fd, buf, sizeof(buf));  // blocks until data arrives

The kernel puts the thread on a wait queue and runs another thread.
Wakes it when data arrives.
Synchronous from the programmer's view, so code stays clean.

1.2 The problem: server handling only one connection

while (1) {
    int client = accept(server_fd, ...);
    while (1) {
        int n = read(client, ...);
        write(client, ...);
    }
}

Can't serve multiple clients at once.

1.3 Thread pool — the Apache prefork model

// master
while (1) {
    int client = accept(server_fd, ...);
    spawn_worker(client);  // thread/process per connection
}

Pros: intuitive logic.
Cons: thread creation cost, stack memory (2MB each), context-switch overhead, system collapse at 10k connections.

Apache's prefork MPM ran straight into the C10K wall.

2. Select — the first I/O multiplexing (1983)

2.1 "One thread watching many connections"

fd_set readfds;
FD_ZERO(&readfds);
FD_SET(fd1, &readfds);
FD_SET(fd2, &readfds);

int n = select(maxfd+1, &readfds, NULL, NULL, NULL);

Introduced in 4.2BSD, 1983. Idea: "watch many fds at once, tell me when any is ready."

2.2 Three fatal limits of select

Limit 1: FD_SETSIZE ceiling (1024)

fd_set is 1024 bits. Setting fd 1025 overflows the stack. Raising it on Linux requires recompiling glibc.

Limit 2: O(n) scan

select on every call:

Copies the whole fd bitmap to the kernel.
Kernel walks every fd checking state.
Copies result bitmap back.
User walks every fd again to find which are ready.

10k connections x 10k scan each time = linear slowdown. Even with 1% active, 99% get scanned every time.

Limit 3: bitmap reset per call

select overwrites fd_set on return — must be reset before every call.

3. Poll — same problem, different packaging (1986)

3.1 struct pollfd array

struct pollfd fds[10000];
fds[0].fd = sock1; fds[0].events = POLLIN;

int n = poll(fds, 10000, timeout);

Introduced in System V. Improvements:

No fd count limit (as large as the array).
Struct array instead of bitmap — meaningful error codes (POLLHUP, POLLERR).
events preserved across calls — no resetting.

But O(n) scan remains. 10k connections still scan 10k per call.

4. Nonblocking I/O — the basis for not blocking

4.1 O_NONBLOCK flag

fcntl(fd, F_SETFL, O_NONBLOCK);
int n = read(fd, buf, size);
if (n == -1 && errno == EAGAIN) {
    // no data yet, retry later
}

Returns immediately. If nothing is available, returns EAGAIN (or EWOULDBLOCK).

4.2 select/poll + nonblocking combination

The reactor pattern in embryo: one thread watches many fds, uses nonblocking reads to do actual I/O.

5. Epoll — the Linux revolution (2002)

5.1 Context

Early 2000s C10K was real (ICQ, IRC, game servers). Linux 2.5.44, Davide Libenzi introduced epoll.

5.2 Three-call API

int epfd = epoll_create1(0);

struct epoll_event ev;
ev.events = EPOLLIN;
ev.data.fd = sock;
epoll_ctl(epfd, EPOLL_CTL_ADD, sock, &ev);

struct epoll_event events[64];
int n = epoll_wait(epfd, events, 64, timeout);
for (int i = 0; i < n; i++) {
    read(events[i].data.fd, ...);
}

5.3 Why O(1)

Kernel permanently stores the fd set (registered once via epoll_ctl).
On event, kernel adds fd to internal red-black tree + ready list.
epoll_wait returns only the ready list — you iterate only what's ready.

Of 10k connections with 100 active, only 100 are walked. Compare with select/poll walking 10k every time.

5.4 Level-triggered vs Edge-triggered

Level-triggered (LT, default):

Keeps notifying while the condition is true. e.g. data in buffer.
"Even if you don't read it all, next epoll_wait will tell you again."
Intuitive, compatible with select/poll mental model.

Edge-triggered (ET):

Only notifies on state transitions.
Exactly once "as it becomes readable".
Must read until EAGAIN — otherwise miss the next notification.
Higher performance, harder coding.

// must be written this way in ET mode
while (1) {
    int n = read(fd, buf, size);
    if (n == -1 && errno == EAGAIN) break;
    if (n <= 0) break;
    process(buf, n);
}

nginx uses ET for extreme performance.

5.5 EPOLLEXCLUSIVE — solving thundering herd

If multiple threads register the same listen socket to epoll, all wake when accept is possible — only one succeeds, the rest get EAGAIN. Waste.

Linux 4.5+ EPOLLEXCLUSIVE: "only notify one waiter per fd." nginx and HAProxy use this.

5.6 Limits of epoll

Still many syscalls: each event triggers read, write. At high load the syscall overhead dominates.
LT extra wakeups: notified again even after processing.
Regular files are always ready: epoll is useless for disk I/O, which ends up blocking.

These limits spawned io_uring.

6. kqueue — the BSD alternative

Same era, FreeBSD (2000) introduced kqueue. Similar to epoll but:

More event sources: filesystem events, signals, timers.
Unified API for everything.

macOS also uses kqueue. Cross-platform libraries (libevent, libuv) abstract over "epoll on Linux, kqueue on BSD/macOS, IOCP on Windows".

7. Async disk I/O — a different war from networking

7.1 Why epoll doesn't work for files

epoll_ctl on a regular file always returns "ready". Files are either in the page cache or not — there's no "becoming ready" state. If not cached, you block on disk.

7.2 POSIX AIO — a failed first try

struct aiocb cb = { .aio_fildes = fd, .aio_buf = buf, .aio_nbytes = size };
aio_read(&cb);

Linux's glibc POSIX AIO is actually a userspace thread pool faking it. Not true kernel async.

7.3 Linux AIO (libaio) — limited success

io_submit, io_getevents. Real kernel AIO but:

Only supports O_DIRECT (bypass page cache).
Still blocks in many cases.
Rarely used in practice.

Used by MySQL InnoDB and some DBs. Never went mainstream.

8. io_uring — the 2019 Linux I/O revolution

8.1 Jens Axboe's vision

In 2019, Linux 5.1, block I/O maintainer Jens Axboe introduced io_uring. Core idea:

"Submit I/O and collect results without a syscall at all."

8.2 Two ring buffers

Submission Queue (SQ): user inserts I/O requests.
Completion Queue (CQ): kernel places results.

Both are mmap'd as shared memory between user and kernel:

user:
  writes request to SQ
  calls io_uring_enter() (optional)

kernel:
  reads request from SQ
  performs I/O
  writes result to CQ

user:
  reads result from CQ

8.3 Why it is revolutionary

1. Fewer syscalls

Batch submit: many requests per io_uring_enter().
SQ_POLL mode: a kernel thread polls SQ — zero syscalls.
10x+ gains at high QPS.

2. Unified interface

Network, files, timers, signals — same API. No mixing epoll/AIO.

3. Linked SQEs

"When this succeeds, run the next". e.g. submit openat -> read -> close as one.

4. Buffer selection

Instead of preallocating buffers for thousands of connections, pick from a pool on demand.

8.4 Growth trajectory

5.1 (2019): initial.
5.5: accept support.
5.7: signals, file open/close.
5.19: zero-copy send for networking.
6.x: more opcodes, multishot accept.

As of 2025 nearly every Linux syscall is available via io_uring.

8.5 The dark side

Many security issues found. Google ChromeOS and Android disabled io_uring in 2023. Reasons:

Large attack surface (many opcodes).
New path to kernel vulnerabilities.
Hard to control with existing seccomp.

Direction: seccomp extensions for io_uring, ACLs, "allowed opcode subset" policies.

9. Real architectures — the event loop pattern

9.1 Reactor pattern (Node.js, nginx, Redis)

Event Loop (single thread)
  while (true) {
    events = epoll_wait()
    for (e in events) handle(e)
  }

One thread watches all I/O.
Short handlers run on events — back to waiting.
Handlers must not block (would stall the loop).
CPU-heavy work offloaded to worker threads.

9.2 Node.js structure

V8 JavaScript Runtime
libuv (cross-platform)
  epoll / kqueue / IOCP
  Thread Pool
OS Kernel

I/O goes through libuv + epoll -> JavaScript callback.
DNS, file reads, crypto use a thread pool (default 4).
CPU-heavy work via Worker threads (v10+).

9.3 nginx master-worker

Master (root, config)
  Worker 0 (epoll loop, tens of thousands of conns)
  Worker 1
  Worker N (usually CPU core count)

Each worker has its own event loop.
Shared listen socket (SO_REUSEPORT) — kernel distributes.
Master-worker split enables zero-downtime reload.

9.4 Redis — the beauty of single-threaded

Redis uses one main thread for all commands:

epoll watches thousands of connections.
All memory ops — microsecond command latency.
No locks, no races, fewer bugs.

Since 6.0, I/O threading: network read/write across multiple threads, command execution still single-threaded. Helps when the bottleneck is I/O, not CPU.

9.5 io_uring-based modern architectures

ScyllaDB: io_uring-centric from the start. Cassandra-compatible, 10x faster.
QEMU/KVM: virtual disk I/O via io_uring — 40% speedup.
Ceph: io_uring in backend storage.
nginx experimental: io_uring plugin.

10. Reactor vs Proactor — two async philosophies

10.1 Reactor (notify-based)

"Tell me when ready, I'll read."
epoll, kqueue style.
User manages buffers.

10.2 Proactor (completion-based)

"Read here and tell me when done."
Windows IOCP, io_uring style.
Kernel writes directly to the buffer.

10.3 Why Proactor is faster

Reactor: "readable" -> read() syscall -> copy -> process. Proactor: kernel completes copy in background -> user processes directly.

One fewer syscall. At high traffic this is decisive.

Windows has had IOCP-based proactor since NT 3.5 in 1994. Linux was long reactor-only via epoll; in 2019 it joined the proactor camp with io_uring.

11. Zero-copy in the network stack

11.1 sendfile vs read+write

For file -> socket:

normal:
  read(file)  : disk -> kernel -> user buffer (copy 1)
  write(sock) : user buffer -> kernel -> NIC (copy 2)

sendfile:
  disk -> kernel -> NIC (zero copies, direct DMA)

Why nginx and Apache use sendfile for static files. 2x faster + 10x less CPU.

11.2 splice, tee, vmsplice

splice moves data between fds through a pipe, no user-space copy.

splice(fd_in, NULL, pipefd[1], NULL, size, SPLICE_F_MORE);
splice(pipefd[0], NULL, fd_out, NULL, size, SPLICE_F_MORE);

11.3 MSG_ZEROCOPY

Linux 4.14+ send(fd, buf, size, MSG_ZEROCOPY):

DMA direct from user buffer to NIC.
Must not touch the buffer until done -> completion via errqueue.
~30% gain on large transfers.

11.4 io_uring + zero-copy

io_uring_prep_send_zc(sqe, fd, buf, size, 0, 0);

Same effect as MSG_ZEROCOPY inside io_uring. Decisive for CDNs and video servers pushing large responses.

12. Observation and tuning — in practice

12.1 Connection limits

ulimit -n              # fd ceiling (often 1024 or 1M)
ulimit -n 1000000
/etc/security/limits.conf

Also:

sysctl fs.file-max
sysctl net.core.somaxconn
sysctl net.ipv4.ip_local_port_range
sysctl net.ipv4.tcp_tw_reuse

12.2 TCP buffer sizes

sysctl net.core.rmem_max
sysctl net.core.wmem_max
sysctl net.ipv4.tcp_rmem

BDP (Bandwidth-Delay Product) larger than default requires tuning. 10Gbps x 100ms = 125MB — 208KB default is insufficient.

12.3 Observation tools

ss -antp: current connections (replaces netstat).
iftop, nload: real-time network.
tcpdump / wireshark: packet dump.
bpftrace: kernel-internal tracing.
perf trace: syscall profiling.

12.4 io_uring adoption strategy

Gradual migration: io_uring for hotspots, epoll elsewhere.
Kernel 5.15+ recommended: earlier versions were buggy.
Security: restrict opcodes with seccomp.
Libraries: liburing (Jens Axboe official), tokio-uring (Rust), io_uring-rs.

13. Closing — lessons from 30 years of I/O evolution

The journey from a single read() in the 1970s:

1983 select: one thread, many fds.
1986 poll: no limit, still O(n).
1994 Windows IOCP: first proactor.
2000 FreeBSD kqueue: unified events.
2002 Linux epoll: O(1) events.
2019 Linux io_uring: syscall-free I/O.

Each generation was born from the prior one's limits. Candidates for the next revolution:

DPDK / XDP: kernel bypass, 10Gbps+ line rate.
Userspace TCP: kernel bypass for microsecond latency.
RDMA: CPU-bypass memory access.
Smart NIC: offload I/O to the NIC.

No "final answer" — continuous change. But the core principles — "syscalls are expensive", "copies are expensive", "as the watched set grows, it must be O(1)" — are the same as 50 years ago.

Next time: the network stack itself — TCP state machine, congestion control (Cubic vs BBR), nagle, delayed ack, Fast Open, and why QUIC built a new stack on UDP.

References

Dan Kegel — "The C10K problem" (1999-2014).
Davide Libenzi — "Improving (network) I/O performance..." epoll proposal (2002).
Jens Axboe — "Efficient IO with io_uring" (2019).
Jens Axboe — "Ringing in a new asynchronous I/O API" (LWN.net, 2019).
Linux kernel source: fs/eventpoll.c, fs/io_uring.c, io_uring/.
liburing: https://github.com/axboe/liburing
"The Secret To 10 Million Concurrent Connections" — Robert Graham.
Felix Uherek — "The method to epoll's madness".
ScyllaDB Engineering Blog — Seastar / io_uring series.
"What Every Systems Programmer Should Know About Concurrency" (PDF) — Matt Kline.