- Published on
Linux I/O Evolution Deep Dive — blocking, select, poll, epoll, io_uring (2025)
- Authors

- Name
- Youngju Kim
- @fjvbn20031
0. Why does a server get slow — the C10K problem revisited
In 1999, when Dan Kegel wrote "The C10K problem", handling 10,000 concurrent connections on one server seemed impossible. The conventional wisdom:
- One connection = one thread.
- 10,000 threads = 10GB of stack alone (2MB x 10000).
- Context switching costs explode.
- Kernel data structures exhausted.
Yet in the 2020s nginx handles one million connections on a single server. What happened? This post follows 30 years of Linux I/O interface evolution — the revolutions hiding behind a single line of read().
1. Blocking I/O — a 1970s legacy
1.1 The simplest model
char buf[1024];
int n = read(fd, buf, sizeof(buf)); // blocks until data arrives
- The kernel puts the thread on a wait queue and runs another thread.
- Wakes it when data arrives.
- Synchronous from the programmer's view, so code stays clean.
1.2 The problem: server handling only one connection
while (1) {
int client = accept(server_fd, ...);
while (1) {
int n = read(client, ...);
write(client, ...);
}
}
Can't serve multiple clients at once.
1.3 Thread pool — the Apache prefork model
// master
while (1) {
int client = accept(server_fd, ...);
spawn_worker(client); // thread/process per connection
}
- Pros: intuitive logic.
- Cons: thread creation cost, stack memory (2MB each), context-switch overhead, system collapse at 10k connections.
Apache's prefork MPM ran straight into the C10K wall.
2. Select — the first I/O multiplexing (1983)
2.1 "One thread watching many connections"
fd_set readfds;
FD_ZERO(&readfds);
FD_SET(fd1, &readfds);
FD_SET(fd2, &readfds);
int n = select(maxfd+1, &readfds, NULL, NULL, NULL);
Introduced in 4.2BSD, 1983. Idea: "watch many fds at once, tell me when any is ready."
2.2 Three fatal limits of select
Limit 1: FD_SETSIZE ceiling (1024)
fd_set is 1024 bits. Setting fd 1025 overflows the stack. Raising it on Linux requires recompiling glibc.
Limit 2: O(n) scan
select on every call:
- Copies the whole fd bitmap to the kernel.
- Kernel walks every fd checking state.
- Copies result bitmap back.
- User walks every fd again to find which are ready.
10k connections x 10k scan each time = linear slowdown. Even with 1% active, 99% get scanned every time.
Limit 3: bitmap reset per call
select overwrites fd_set on return — must be reset before every call.
3. Poll — same problem, different packaging (1986)
3.1 struct pollfd array
struct pollfd fds[10000];
fds[0].fd = sock1; fds[0].events = POLLIN;
int n = poll(fds, 10000, timeout);
Introduced in System V. Improvements:
- No fd count limit (as large as the array).
- Struct array instead of bitmap — meaningful error codes (POLLHUP, POLLERR).
- events preserved across calls — no resetting.
But O(n) scan remains. 10k connections still scan 10k per call.
4. Nonblocking I/O — the basis for not blocking
4.1 O_NONBLOCK flag
fcntl(fd, F_SETFL, O_NONBLOCK);
int n = read(fd, buf, size);
if (n == -1 && errno == EAGAIN) {
// no data yet, retry later
}
Returns immediately. If nothing is available, returns EAGAIN (or EWOULDBLOCK).
4.2 select/poll + nonblocking combination
The reactor pattern in embryo: one thread watches many fds, uses nonblocking reads to do actual I/O.
5. Epoll — the Linux revolution (2002)
5.1 Context
Early 2000s C10K was real (ICQ, IRC, game servers). Linux 2.5.44, Davide Libenzi introduced epoll.
5.2 Three-call API
int epfd = epoll_create1(0);
struct epoll_event ev;
ev.events = EPOLLIN;
ev.data.fd = sock;
epoll_ctl(epfd, EPOLL_CTL_ADD, sock, &ev);
struct epoll_event events[64];
int n = epoll_wait(epfd, events, 64, timeout);
for (int i = 0; i < n; i++) {
read(events[i].data.fd, ...);
}
5.3 Why O(1)
- Kernel permanently stores the fd set (registered once via epoll_ctl).
- On event, kernel adds fd to internal red-black tree + ready list.
- epoll_wait returns only the ready list — you iterate only what's ready.
Of 10k connections with 100 active, only 100 are walked. Compare with select/poll walking 10k every time.
5.4 Level-triggered vs Edge-triggered
Level-triggered (LT, default):
- Keeps notifying while the condition is true. e.g. data in buffer.
- "Even if you don't read it all, next epoll_wait will tell you again."
- Intuitive, compatible with select/poll mental model.
Edge-triggered (ET):
- Only notifies on state transitions.
- Exactly once "as it becomes readable".
- Must read until EAGAIN — otherwise miss the next notification.
- Higher performance, harder coding.
// must be written this way in ET mode
while (1) {
int n = read(fd, buf, size);
if (n == -1 && errno == EAGAIN) break;
if (n <= 0) break;
process(buf, n);
}
nginx uses ET for extreme performance.
5.5 EPOLLEXCLUSIVE — solving thundering herd
If multiple threads register the same listen socket to epoll, all wake when accept is possible — only one succeeds, the rest get EAGAIN. Waste.
Linux 4.5+ EPOLLEXCLUSIVE: "only notify one waiter per fd." nginx and HAProxy use this.
5.6 Limits of epoll
- Still many syscalls: each event triggers
read,write. At high load the syscall overhead dominates. - LT extra wakeups: notified again even after processing.
- Regular files are always ready: epoll is useless for disk I/O, which ends up blocking.
These limits spawned io_uring.
6. kqueue — the BSD alternative
Same era, FreeBSD (2000) introduced kqueue. Similar to epoll but:
- More event sources: filesystem events, signals, timers.
- Unified API for everything.
macOS also uses kqueue. Cross-platform libraries (libevent, libuv) abstract over "epoll on Linux, kqueue on BSD/macOS, IOCP on Windows".
7. Async disk I/O — a different war from networking
7.1 Why epoll doesn't work for files
epoll_ctl on a regular file always returns "ready". Files are either in the page cache or not — there's no "becoming ready" state. If not cached, you block on disk.
7.2 POSIX AIO — a failed first try
struct aiocb cb = { .aio_fildes = fd, .aio_buf = buf, .aio_nbytes = size };
aio_read(&cb);
Linux's glibc POSIX AIO is actually a userspace thread pool faking it. Not true kernel async.
7.3 Linux AIO (libaio) — limited success
io_submit, io_getevents. Real kernel AIO but:
- Only supports O_DIRECT (bypass page cache).
- Still blocks in many cases.
- Rarely used in practice.
Used by MySQL InnoDB and some DBs. Never went mainstream.
8. io_uring — the 2019 Linux I/O revolution
8.1 Jens Axboe's vision
In 2019, Linux 5.1, block I/O maintainer Jens Axboe introduced io_uring. Core idea:
"Submit I/O and collect results without a syscall at all."
8.2 Two ring buffers
- Submission Queue (SQ): user inserts I/O requests.
- Completion Queue (CQ): kernel places results.
Both are mmap'd as shared memory between user and kernel:
user:
writes request to SQ
calls io_uring_enter() (optional)
kernel:
reads request from SQ
performs I/O
writes result to CQ
user:
reads result from CQ
8.3 Why it is revolutionary
1. Fewer syscalls
- Batch submit: many requests per
io_uring_enter(). - SQ_POLL mode: a kernel thread polls SQ — zero syscalls.
- 10x+ gains at high QPS.
2. Unified interface
Network, files, timers, signals — same API. No mixing epoll/AIO.
3. Linked SQEs
"When this succeeds, run the next". e.g. submit openat -> read -> close as one.
4. Buffer selection
Instead of preallocating buffers for thousands of connections, pick from a pool on demand.
8.4 Growth trajectory
- 5.1 (2019): initial.
- 5.5: accept support.
- 5.7: signals, file open/close.
- 5.19: zero-copy send for networking.
- 6.x: more opcodes, multishot accept.
As of 2025 nearly every Linux syscall is available via io_uring.
8.5 The dark side
Many security issues found. Google ChromeOS and Android disabled io_uring in 2023. Reasons:
- Large attack surface (many opcodes).
- New path to kernel vulnerabilities.
- Hard to control with existing seccomp.
Direction: seccomp extensions for io_uring, ACLs, "allowed opcode subset" policies.
9. Real architectures — the event loop pattern
9.1 Reactor pattern (Node.js, nginx, Redis)
Event Loop (single thread)
while (true) {
events = epoll_wait()
for (e in events) handle(e)
}
- One thread watches all I/O.
- Short handlers run on events — back to waiting.
- Handlers must not block (would stall the loop).
- CPU-heavy work offloaded to worker threads.
9.2 Node.js structure
V8 JavaScript Runtime
libuv (cross-platform)
epoll / kqueue / IOCP
Thread Pool
OS Kernel
- I/O goes through libuv + epoll -> JavaScript callback.
- DNS, file reads, crypto use a thread pool (default 4).
- CPU-heavy work via Worker threads (v10+).
9.3 nginx master-worker
Master (root, config)
Worker 0 (epoll loop, tens of thousands of conns)
Worker 1
Worker N (usually CPU core count)
- Each worker has its own event loop.
- Shared listen socket (
SO_REUSEPORT) — kernel distributes. - Master-worker split enables zero-downtime reload.
9.4 Redis — the beauty of single-threaded
Redis uses one main thread for all commands:
- epoll watches thousands of connections.
- All memory ops — microsecond command latency.
- No locks, no races, fewer bugs.
Since 6.0, I/O threading: network read/write across multiple threads, command execution still single-threaded. Helps when the bottleneck is I/O, not CPU.
9.5 io_uring-based modern architectures
- ScyllaDB: io_uring-centric from the start. Cassandra-compatible, 10x faster.
- QEMU/KVM: virtual disk I/O via io_uring — 40% speedup.
- Ceph: io_uring in backend storage.
- nginx experimental: io_uring plugin.
10. Reactor vs Proactor — two async philosophies
10.1 Reactor (notify-based)
- "Tell me when ready, I'll read."
- epoll, kqueue style.
- User manages buffers.
10.2 Proactor (completion-based)
- "Read here and tell me when done."
- Windows IOCP, io_uring style.
- Kernel writes directly to the buffer.
10.3 Why Proactor is faster
Reactor: "readable" -> read() syscall -> copy -> process.
Proactor: kernel completes copy in background -> user processes directly.
One fewer syscall. At high traffic this is decisive.
Windows has had IOCP-based proactor since NT 3.5 in 1994. Linux was long reactor-only via epoll; in 2019 it joined the proactor camp with io_uring.
11. Zero-copy in the network stack
11.1 sendfile vs read+write
For file -> socket:
normal:
read(file) : disk -> kernel -> user buffer (copy 1)
write(sock) : user buffer -> kernel -> NIC (copy 2)
sendfile:
disk -> kernel -> NIC (zero copies, direct DMA)
Why nginx and Apache use sendfile for static files. 2x faster + 10x less CPU.
11.2 splice, tee, vmsplice
splice moves data between fds through a pipe, no user-space copy.
splice(fd_in, NULL, pipefd[1], NULL, size, SPLICE_F_MORE);
splice(pipefd[0], NULL, fd_out, NULL, size, SPLICE_F_MORE);
11.3 MSG_ZEROCOPY
Linux 4.14+ send(fd, buf, size, MSG_ZEROCOPY):
- DMA direct from user buffer to NIC.
- Must not touch the buffer until done -> completion via errqueue.
- ~30% gain on large transfers.
11.4 io_uring + zero-copy
io_uring_prep_send_zc(sqe, fd, buf, size, 0, 0);
Same effect as MSG_ZEROCOPY inside io_uring. Decisive for CDNs and video servers pushing large responses.
12. Observation and tuning — in practice
12.1 Connection limits
ulimit -n # fd ceiling (often 1024 or 1M)
ulimit -n 1000000
/etc/security/limits.conf
Also:
sysctl fs.file-max
sysctl net.core.somaxconn
sysctl net.ipv4.ip_local_port_range
sysctl net.ipv4.tcp_tw_reuse
12.2 TCP buffer sizes
sysctl net.core.rmem_max
sysctl net.core.wmem_max
sysctl net.ipv4.tcp_rmem
BDP (Bandwidth-Delay Product) larger than default requires tuning. 10Gbps x 100ms = 125MB — 208KB default is insufficient.
12.3 Observation tools
- ss -antp: current connections (replaces netstat).
- iftop, nload: real-time network.
- tcpdump / wireshark: packet dump.
- bpftrace: kernel-internal tracing.
- perf trace: syscall profiling.
12.4 io_uring adoption strategy
- Gradual migration: io_uring for hotspots, epoll elsewhere.
- Kernel 5.15+ recommended: earlier versions were buggy.
- Security: restrict opcodes with seccomp.
- Libraries: liburing (Jens Axboe official), tokio-uring (Rust), io_uring-rs.
13. Closing — lessons from 30 years of I/O evolution
The journey from a single read() in the 1970s:
- 1983 select: one thread, many fds.
- 1986 poll: no limit, still O(n).
- 1994 Windows IOCP: first proactor.
- 2000 FreeBSD kqueue: unified events.
- 2002 Linux epoll: O(1) events.
- 2019 Linux io_uring: syscall-free I/O.
Each generation was born from the prior one's limits. Candidates for the next revolution:
- DPDK / XDP: kernel bypass, 10Gbps+ line rate.
- Userspace TCP: kernel bypass for microsecond latency.
- RDMA: CPU-bypass memory access.
- Smart NIC: offload I/O to the NIC.
No "final answer" — continuous change. But the core principles — "syscalls are expensive", "copies are expensive", "as the watched set grows, it must be O(1)" — are the same as 50 years ago.
Next time: the network stack itself — TCP state machine, congestion control (Cubic vs BBR), nagle, delayed ack, Fast Open, and why QUIC built a new stack on UDP.
References
- Dan Kegel — "The C10K problem" (1999-2014).
- Davide Libenzi — "Improving (network) I/O performance..." epoll proposal (2002).
- Jens Axboe — "Efficient IO with io_uring" (2019).
- Jens Axboe — "Ringing in a new asynchronous I/O API" (LWN.net, 2019).
- Linux kernel source: fs/eventpoll.c, fs/io_uring.c, io_uring/.
- liburing: https://github.com/axboe/liburing
- "The Secret To 10 Million Concurrent Connections" — Robert Graham.
- Felix Uherek — "The method to epoll's madness".
- ScyllaDB Engineering Blog — Seastar / io_uring series.
- "What Every Systems Programmer Should Know About Concurrency" (PDF) — Matt Kline.