Skip to content

✍️ 필사 모드: Linux I/O Evolution Deep Dive — blocking, select, poll, epoll, io_uring (2025)

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

0. Why does a server get slow — the C10K problem revisited

In 1999, when Dan Kegel wrote "The C10K problem", handling 10,000 concurrent connections on one server seemed impossible. The conventional wisdom:

  • One connection = one thread.
  • 10,000 threads = 10GB of stack alone (2MB x 10000).
  • Context switching costs explode.
  • Kernel data structures exhausted.

Yet in the 2020s nginx handles one million connections on a single server. What happened? This post follows 30 years of Linux I/O interface evolution — the revolutions hiding behind a single line of read().

1. Blocking I/O — a 1970s legacy

1.1 The simplest model

char buf[1024];
int n = read(fd, buf, sizeof(buf));  // blocks until data arrives
  • The kernel puts the thread on a wait queue and runs another thread.
  • Wakes it when data arrives.
  • Synchronous from the programmer's view, so code stays clean.

1.2 The problem: server handling only one connection

while (1) {
    int client = accept(server_fd, ...);
    while (1) {
        int n = read(client, ...);
        write(client, ...);
    }
}

Can't serve multiple clients at once.

1.3 Thread pool — the Apache prefork model

// master
while (1) {
    int client = accept(server_fd, ...);
    spawn_worker(client);  // thread/process per connection
}
  • Pros: intuitive logic.
  • Cons: thread creation cost, stack memory (2MB each), context-switch overhead, system collapse at 10k connections.

Apache's prefork MPM ran straight into the C10K wall.

2. Select — the first I/O multiplexing (1983)

2.1 "One thread watching many connections"

fd_set readfds;
FD_ZERO(&readfds);
FD_SET(fd1, &readfds);
FD_SET(fd2, &readfds);

int n = select(maxfd+1, &readfds, NULL, NULL, NULL);

Introduced in 4.2BSD, 1983. Idea: "watch many fds at once, tell me when any is ready."

2.2 Three fatal limits of select

Limit 1: FD_SETSIZE ceiling (1024)

fd_set is 1024 bits. Setting fd 1025 overflows the stack. Raising it on Linux requires recompiling glibc.

Limit 2: O(n) scan

select on every call:

  1. Copies the whole fd bitmap to the kernel.
  2. Kernel walks every fd checking state.
  3. Copies result bitmap back.
  4. User walks every fd again to find which are ready.

10k connections x 10k scan each time = linear slowdown. Even with 1% active, 99% get scanned every time.

Limit 3: bitmap reset per call

select overwrites fd_set on return — must be reset before every call.

3. Poll — same problem, different packaging (1986)

3.1 struct pollfd array

struct pollfd fds[10000];
fds[0].fd = sock1; fds[0].events = POLLIN;

int n = poll(fds, 10000, timeout);

Introduced in System V. Improvements:

  • No fd count limit (as large as the array).
  • Struct array instead of bitmap — meaningful error codes (POLLHUP, POLLERR).
  • events preserved across calls — no resetting.

But O(n) scan remains. 10k connections still scan 10k per call.

4. Nonblocking I/O — the basis for not blocking

4.1 O_NONBLOCK flag

fcntl(fd, F_SETFL, O_NONBLOCK);
int n = read(fd, buf, size);
if (n == -1 && errno == EAGAIN) {
    // no data yet, retry later
}

Returns immediately. If nothing is available, returns EAGAIN (or EWOULDBLOCK).

4.2 select/poll + nonblocking combination

The reactor pattern in embryo: one thread watches many fds, uses nonblocking reads to do actual I/O.

5. Epoll — the Linux revolution (2002)

5.1 Context

Early 2000s C10K was real (ICQ, IRC, game servers). Linux 2.5.44, Davide Libenzi introduced epoll.

5.2 Three-call API

int epfd = epoll_create1(0);

struct epoll_event ev;
ev.events = EPOLLIN;
ev.data.fd = sock;
epoll_ctl(epfd, EPOLL_CTL_ADD, sock, &ev);

struct epoll_event events[64];
int n = epoll_wait(epfd, events, 64, timeout);
for (int i = 0; i < n; i++) {
    read(events[i].data.fd, ...);
}

5.3 Why O(1)

  • Kernel permanently stores the fd set (registered once via epoll_ctl).
  • On event, kernel adds fd to internal red-black tree + ready list.
  • epoll_wait returns only the ready list — you iterate only what's ready.

Of 10k connections with 100 active, only 100 are walked. Compare with select/poll walking 10k every time.

5.4 Level-triggered vs Edge-triggered

Level-triggered (LT, default):

  • Keeps notifying while the condition is true. e.g. data in buffer.
  • "Even if you don't read it all, next epoll_wait will tell you again."
  • Intuitive, compatible with select/poll mental model.

Edge-triggered (ET):

  • Only notifies on state transitions.
  • Exactly once "as it becomes readable".
  • Must read until EAGAIN — otherwise miss the next notification.
  • Higher performance, harder coding.
// must be written this way in ET mode
while (1) {
    int n = read(fd, buf, size);
    if (n == -1 && errno == EAGAIN) break;
    if (n <= 0) break;
    process(buf, n);
}

nginx uses ET for extreme performance.

5.5 EPOLLEXCLUSIVE — solving thundering herd

If multiple threads register the same listen socket to epoll, all wake when accept is possible — only one succeeds, the rest get EAGAIN. Waste.

Linux 4.5+ EPOLLEXCLUSIVE: "only notify one waiter per fd." nginx and HAProxy use this.

5.6 Limits of epoll

  • Still many syscalls: each event triggers read, write. At high load the syscall overhead dominates.
  • LT extra wakeups: notified again even after processing.
  • Regular files are always ready: epoll is useless for disk I/O, which ends up blocking.

These limits spawned io_uring.

6. kqueue — the BSD alternative

Same era, FreeBSD (2000) introduced kqueue. Similar to epoll but:

  • More event sources: filesystem events, signals, timers.
  • Unified API for everything.

macOS also uses kqueue. Cross-platform libraries (libevent, libuv) abstract over "epoll on Linux, kqueue on BSD/macOS, IOCP on Windows".

7. Async disk I/O — a different war from networking

7.1 Why epoll doesn't work for files

epoll_ctl on a regular file always returns "ready". Files are either in the page cache or not — there's no "becoming ready" state. If not cached, you block on disk.

7.2 POSIX AIO — a failed first try

struct aiocb cb = { .aio_fildes = fd, .aio_buf = buf, .aio_nbytes = size };
aio_read(&cb);

Linux's glibc POSIX AIO is actually a userspace thread pool faking it. Not true kernel async.

7.3 Linux AIO (libaio) — limited success

io_submit, io_getevents. Real kernel AIO but:

  • Only supports O_DIRECT (bypass page cache).
  • Still blocks in many cases.
  • Rarely used in practice.

Used by MySQL InnoDB and some DBs. Never went mainstream.

8. io_uring — the 2019 Linux I/O revolution

8.1 Jens Axboe's vision

In 2019, Linux 5.1, block I/O maintainer Jens Axboe introduced io_uring. Core idea:

"Submit I/O and collect results without a syscall at all."

8.2 Two ring buffers

  • Submission Queue (SQ): user inserts I/O requests.
  • Completion Queue (CQ): kernel places results.

Both are mmap'd as shared memory between user and kernel:

user:
  writes request to SQ
  calls io_uring_enter() (optional)

kernel:
  reads request from SQ
  performs I/O
  writes result to CQ

user:
  reads result from CQ

8.3 Why it is revolutionary

1. Fewer syscalls

  • Batch submit: many requests per io_uring_enter().
  • SQ_POLL mode: a kernel thread polls SQ — zero syscalls.
  • 10x+ gains at high QPS.

2. Unified interface

Network, files, timers, signals — same API. No mixing epoll/AIO.

3. Linked SQEs

"When this succeeds, run the next". e.g. submit openat -> read -> close as one.

4. Buffer selection

Instead of preallocating buffers for thousands of connections, pick from a pool on demand.

8.4 Growth trajectory

  • 5.1 (2019): initial.
  • 5.5: accept support.
  • 5.7: signals, file open/close.
  • 5.19: zero-copy send for networking.
  • 6.x: more opcodes, multishot accept.

As of 2025 nearly every Linux syscall is available via io_uring.

8.5 The dark side

Many security issues found. Google ChromeOS and Android disabled io_uring in 2023. Reasons:

  • Large attack surface (many opcodes).
  • New path to kernel vulnerabilities.
  • Hard to control with existing seccomp.

Direction: seccomp extensions for io_uring, ACLs, "allowed opcode subset" policies.

9. Real architectures — the event loop pattern

9.1 Reactor pattern (Node.js, nginx, Redis)

Event Loop (single thread)
  while (true) {
    events = epoll_wait()
    for (e in events) handle(e)
  }
  • One thread watches all I/O.
  • Short handlers run on events — back to waiting.
  • Handlers must not block (would stall the loop).
  • CPU-heavy work offloaded to worker threads.

9.2 Node.js structure

V8 JavaScript Runtime
libuv (cross-platform)
  epoll / kqueue / IOCP
  Thread Pool
OS Kernel
  • I/O goes through libuv + epoll -> JavaScript callback.
  • DNS, file reads, crypto use a thread pool (default 4).
  • CPU-heavy work via Worker threads (v10+).

9.3 nginx master-worker

Master (root, config)
  Worker 0 (epoll loop, tens of thousands of conns)
  Worker 1
  Worker N (usually CPU core count)
  • Each worker has its own event loop.
  • Shared listen socket (SO_REUSEPORT) — kernel distributes.
  • Master-worker split enables zero-downtime reload.

9.4 Redis — the beauty of single-threaded

Redis uses one main thread for all commands:

  • epoll watches thousands of connections.
  • All memory ops — microsecond command latency.
  • No locks, no races, fewer bugs.

Since 6.0, I/O threading: network read/write across multiple threads, command execution still single-threaded. Helps when the bottleneck is I/O, not CPU.

9.5 io_uring-based modern architectures

  • ScyllaDB: io_uring-centric from the start. Cassandra-compatible, 10x faster.
  • QEMU/KVM: virtual disk I/O via io_uring — 40% speedup.
  • Ceph: io_uring in backend storage.
  • nginx experimental: io_uring plugin.

10. Reactor vs Proactor — two async philosophies

10.1 Reactor (notify-based)

  • "Tell me when ready, I'll read."
  • epoll, kqueue style.
  • User manages buffers.

10.2 Proactor (completion-based)

  • "Read here and tell me when done."
  • Windows IOCP, io_uring style.
  • Kernel writes directly to the buffer.

10.3 Why Proactor is faster

Reactor: "readable" -> read() syscall -> copy -> process. Proactor: kernel completes copy in background -> user processes directly.

One fewer syscall. At high traffic this is decisive.

Windows has had IOCP-based proactor since NT 3.5 in 1994. Linux was long reactor-only via epoll; in 2019 it joined the proactor camp with io_uring.

11. Zero-copy in the network stack

11.1 sendfile vs read+write

For file -> socket:

normal:
  read(file)  : disk -> kernel -> user buffer (copy 1)
  write(sock) : user buffer -> kernel -> NIC (copy 2)

sendfile:
  disk -> kernel -> NIC (zero copies, direct DMA)

Why nginx and Apache use sendfile for static files. 2x faster + 10x less CPU.

11.2 splice, tee, vmsplice

splice moves data between fds through a pipe, no user-space copy.

splice(fd_in, NULL, pipefd[1], NULL, size, SPLICE_F_MORE);
splice(pipefd[0], NULL, fd_out, NULL, size, SPLICE_F_MORE);

11.3 MSG_ZEROCOPY

Linux 4.14+ send(fd, buf, size, MSG_ZEROCOPY):

  • DMA direct from user buffer to NIC.
  • Must not touch the buffer until done -> completion via errqueue.
  • ~30% gain on large transfers.

11.4 io_uring + zero-copy

io_uring_prep_send_zc(sqe, fd, buf, size, 0, 0);

Same effect as MSG_ZEROCOPY inside io_uring. Decisive for CDNs and video servers pushing large responses.

12. Observation and tuning — in practice

12.1 Connection limits

ulimit -n              # fd ceiling (often 1024 or 1M)
ulimit -n 1000000
/etc/security/limits.conf

Also:

sysctl fs.file-max
sysctl net.core.somaxconn
sysctl net.ipv4.ip_local_port_range
sysctl net.ipv4.tcp_tw_reuse

12.2 TCP buffer sizes

sysctl net.core.rmem_max
sysctl net.core.wmem_max
sysctl net.ipv4.tcp_rmem

BDP (Bandwidth-Delay Product) larger than default requires tuning. 10Gbps x 100ms = 125MB — 208KB default is insufficient.

12.3 Observation tools

  • ss -antp: current connections (replaces netstat).
  • iftop, nload: real-time network.
  • tcpdump / wireshark: packet dump.
  • bpftrace: kernel-internal tracing.
  • perf trace: syscall profiling.

12.4 io_uring adoption strategy

  • Gradual migration: io_uring for hotspots, epoll elsewhere.
  • Kernel 5.15+ recommended: earlier versions were buggy.
  • Security: restrict opcodes with seccomp.
  • Libraries: liburing (Jens Axboe official), tokio-uring (Rust), io_uring-rs.

13. Closing — lessons from 30 years of I/O evolution

The journey from a single read() in the 1970s:

  • 1983 select: one thread, many fds.
  • 1986 poll: no limit, still O(n).
  • 1994 Windows IOCP: first proactor.
  • 2000 FreeBSD kqueue: unified events.
  • 2002 Linux epoll: O(1) events.
  • 2019 Linux io_uring: syscall-free I/O.

Each generation was born from the prior one's limits. Candidates for the next revolution:

  • DPDK / XDP: kernel bypass, 10Gbps+ line rate.
  • Userspace TCP: kernel bypass for microsecond latency.
  • RDMA: CPU-bypass memory access.
  • Smart NIC: offload I/O to the NIC.

No "final answer" — continuous change. But the core principles — "syscalls are expensive", "copies are expensive", "as the watched set grows, it must be O(1)" — are the same as 50 years ago.

Next time: the network stack itself — TCP state machine, congestion control (Cubic vs BBR), nagle, delayed ack, Fast Open, and why QUIC built a new stack on UDP.

References

  • Dan Kegel — "The C10K problem" (1999-2014).
  • Davide Libenzi — "Improving (network) I/O performance..." epoll proposal (2002).
  • Jens Axboe — "Efficient IO with io_uring" (2019).
  • Jens Axboe — "Ringing in a new asynchronous I/O API" (LWN.net, 2019).
  • Linux kernel source: fs/eventpoll.c, fs/io_uring.c, io_uring/.
  • liburing: https://github.com/axboe/liburing
  • "The Secret To 10 Million Concurrent Connections" — Robert Graham.
  • Felix Uherek — "The method to epoll's madness".
  • ScyllaDB Engineering Blog — Seastar / io_uring series.
  • "What Every Systems Programmer Should Know About Concurrency" (PDF) — Matt Kline.

현재 단락 (1/241)

In 1999, when Dan Kegel wrote "[The C10K problem](http://www.kegel.com/c10k.html)", handling **10,00...

작성 글자: 0원문 글자: 11,923작성 단락: 0/241