Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

TL;DR

Async I/O is the foundation of modern servers: Nginx, Node.js, Redis — every high-performance server.
Evolution: select → poll → epoll/kqueue → io_uring.
Reactor vs Proactor: event readiness notification vs operation-completion notification.
io_uring: the Linux 5.1+ revolution. More efficient than epoll, and truly async.
async/await: the compiler rewrites your code as a state machine.

1. Evolution of I/O Models

1.1 Five I/O Models

POSIX classification:

Blocking I/O: wait until the operation completes.
Non-blocking I/O: return immediately, poll for completion.
I/O Multiplexing: select/poll/epoll.
Signal-driven I/O: notified via signals.
Asynchronous I/O: true async (POSIX AIO, io_uring).

1.2 Blocking I/O — The Simplest

data = sock.recv(1024)  # blocks until data arrives
process(data)

Problems:

One thread = one connection.
10K connections = 10K threads (memory blow-up).
The C10K problem.

1.3 Multi-Threaded as the Fix?

def handle_client(sock):
    data = sock.recv(1024)
    process(data)

while True:
    client = server.accept()
    threading.Thread(target=handle_client, args=(client,)).start()

Limits:

Per-thread memory (1–8 MB).
Context-switching cost.
10K threads = death.

1.4 Non-blocking I/O

sock.setblocking(False)
try:
    data = sock.recv(1024)
except BlockingIOError:
    pass  # no data — do something else

Problem: infinite-loop polling pins CPU at 100%.

1.5 Enter I/O Multiplexing

Idea: one syscall watching many fds at once.

ready, _, _ = select([sock1, sock2, sock3], [], [])
for sock in ready:
    data = sock.recv(1024)

A single thread can now handle thousands of connections.

2. select and poll

2.1 select (1983)

fd_set readfds;
FD_ZERO(&readfds);
FD_SET(sock, &readfds);

int n = select(max_fd + 1, &readfds, NULL, NULL, &timeout);

if (FD_ISSET(sock, &readfds)) {
    // readable
}

Problems:

FD_SETSIZE limit (typically 1024).
O(n) scan: every call inspects every fd.
fd_set copy: user space ↔ kernel on each call.

2.2 poll (1986)

struct pollfd fds[1024];
fds[0].fd = sock;
fds[0].events = POLLIN;

int n = poll(fds, 1024, timeout);

if (fds[0].revents & POLLIN) {
    // readable
}

Improvements:

No hard fd-count limit.
Dynamic array.

Still problematic:

O(n) scan.
The fd list is passed on every call.

2.3 Limits of select/poll

10,000 connections with only 100 active:

select/poll scan all 10,000 every time.
Wasted CPU.

A new mechanism was needed.

3. epoll — The Linux Revolution

3.1 Arrival (2002, Linux 2.5.44)

int epfd = epoll_create1(0);

struct epoll_event ev;
ev.events = EPOLLIN;
ev.data.fd = sock;
epoll_ctl(epfd, EPOLL_CTL_ADD, sock, &ev);

struct epoll_event events[64];
int n = epoll_wait(epfd, events, 64, -1);

for (int i = 0; i < n; i++) {
    int fd = events[i].data.fd;
    // handle
}

3.2 Why epoll Wins

1. O(1) event detection:

Kernel returns only ready fds.
10K connections + 100 active returns 100.

2. Register once:

epoll_ctl registers the fd.
No need to pass them every call.

3. Trigger modes:

Level-Triggered (LT): keep notifying while data is present (default).
Edge-Triggered (ET): notify only on state change (Nginx).

3.3 ET vs LT

LT (Level-Triggered):

// 16KB available, you read 8KB → next epoll_wait still notifies

Easier to use, slightly slower.

ET (Edge-Triggered):

// Only notified when new data arrives → must drain fully
while (true) {
    n = recv(fd, buf, sizeof(buf), 0);
    if (n < 0 && errno == EAGAIN) break;  // fully drained
    process(buf, n);
}

Faster, but easier to lose events if you mishandle.

3.4 Systems Built on epoll

Nginx — core.
Node.js (via libuv).
Redis.
HAProxy.
Memcached.
Nearly every high-performance Linux server.

3.5 Equivalents on Other OSes

OS	API
Linux	epoll
macOS/BSD	kqueue
Windows	IOCP
Solaris	/dev/poll (deprecated), event ports

4. kqueue (BSD/macOS)

4.1 More Powerful Than epoll

int kq = kqueue();

struct kevent change;
EV_SET(&change, sock, EVFILT_READ, EV_ADD | EV_ENABLE, 0, 0, NULL);
kevent(kq, &change, 1, NULL, 0, NULL);

struct kevent events[64];
int n = kevent(kq, NULL, 0, events, 64, NULL);

Richer than epoll:

File changes (EVFILT_VNODE).
Signals (EVFILT_SIGNAL).
Timers (EVFILT_TIMER).
Process exit (EVFILT_PROC).
Disk I/O, and more.

Downside: smaller user base than epoll.

5. io_uring — The New Revolution

5.1 Limits of epoll

Even epoll has limits:

Synchronous I/O: you still call read/write after epoll_wait.
Syscall cost: user space ↔ kernel on every call.
Not truly async: read/write themselves can still block.

5.2 io_uring (2019, Linux 5.1)

Built by Jens Axboe (Linux kernel developer). True async I/O.

Core ideas:

Two ring buffers (Submission, Completion).
Shared memory between user space and kernel.
Near-zero syscalls.

[User Space]                    [Kernel Space]
[Submission Queue (SQ)] ────→ [Worker]
                                ↓
[Completion Queue (CQ)] ←──── [I/O complete]

5.3 Example Usage

struct io_uring ring;
io_uring_queue_init(256, &ring, 0);

// Submit work
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, sizeof(buf), 0);
io_uring_submit(&ring);

// Wait for completion
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
// cqe->res = bytes read
io_uring_cqe_seen(&ring, cqe);

5.4 Advantages of io_uring

1. Truly async:

Disk I/O no longer blocks.
Unified model for network and disk.

2. Fewer syscalls:

Batch submission.
Some modes allow zero syscalls.

3. Many operations:

read, write, send, recv.
accept, connect.
fsync, fallocate.
splice, tee.
statx, openat, close.

4. Faster:

30–50% improvement over epoll on certain workloads.

5.5 io_uring in the Wild

Nginx 1.21+: optional support.
PostgreSQL 17: partial adoption.
Tokio (Rust): io-uring backend.
ScyllaDB: core technology.
Cloud Hypervisor.

5.6 Future of io_uring

eBPF integration.
Direct NVMe access.
GPU/FPGA I/O.
All syscalls async.

io_uring is the future of Linux I/O.

6. Reactor vs Proactor

6.1 Reactor Pattern

[Event Loop]
   ↓ epoll_wait
[Event: fd=5 readable]
   ↓
[Handler]
   ↓
[read(5, ...)]  ← the user calls read directly

Characteristics:

Event-readiness notification.
The user calls read/write.
Built on epoll, kqueue.

Examples: Nginx, Node.js, Redis.

6.2 Proactor Pattern

[User] → "read 1024 bytes from this fd"
   ↓
[Kernel] → async work
   ↓
[Completion: data ready]
   ↓
[Handler] → user receives data

Characteristics:

Delegate the operation itself to the kernel.
Notified on completion.
Built on IOCP, io_uring.

Examples: Boost.ASIO, Windows IOCP, io_uring.

6.3 Comparison

	Reactor	Proactor
Event	"ready"	"completed"
Read	user calls	kernel handles
OS support	broad (epoll, kqueue)	limited (io_uring, IOCP)
Abstraction	simple	complex
Performance	good	better (in theory)

6.4 ASIO's Unified Model

Boost.ASIO (C++) exposes Reactor and Proactor through the same interface:

async_read(socket, buffer, [](error_code ec, size_t bytes) {
    // called on completion
});

Internally: epoll on Linux, IOCP on Windows, io_uring on recent builds.

7. async/await Internals

7.1 What async/await Actually Does

async def fetch_data():
    data = await http_request("https://api.example.com")
    return process(data)

There is no magic. The compiler rewrites this as a state machine.

7.2 The Rewrite (Pseudo-code)

class FetchDataStateMachine:
    state = 0

    def resume(self):
        if self.state == 0:
            self.future = http_request("https://api.example.com")
            self.future.set_callback(self.resume)
            self.state = 1
            return  # suspend

        if self.state == 1:
            data = self.future.result()
            return process(data)

Core idea:

Suspend the function at each await.
Resume when the result is ready.
Yield the stack back to the event loop.

7.3 The Event Loop

loop = asyncio.get_event_loop()

while True:
    # 1. Run tasks on the ready queue
    while ready_queue:
        task = ready_queue.pop()
        task.run()

    # 2. Wait for I/O events via epoll_wait
    events = epoll.wait(timeout)

    # 3. Move waiters back onto the ready queue
    for event in events:
        task = pending[event.fd]
        ready_queue.append(task)

7.4 Python asyncio

import asyncio

async def main():
    # Concurrent execution
    results = await asyncio.gather(
        fetch_user(1),
        fetch_user(2),
        fetch_user(3),
    )
    return results

asyncio.run(main())

Internals:

The selectors module (abstracts epoll/kqueue/select).
Event loop.
Tasks and Futures.
Coroutines.

7.5 JavaScript (Node.js)

async function main() {
    const data = await fetch("https://api.example.com")
    const json = await data.json()
    console.log(json)
}

Internals:

libuv (C library).
Event loop.
epoll/kqueue/IOCP.

7.6 Rust Tokio

#[tokio::main]
async fn main() {
    let data = fetch("https://api.example.com").await;
    println!("{:?}", data);
}

Internals:

mio (abstracts epoll/kqueue).
Optional io-uring.
M:N scheduling across multiple OS threads.
Work-stealing.

8. Single-Threaded vs Multi-Threaded

8.1 Single-Threaded (Node.js, Redis, Nginx)

[1 thread]
   ↓
[Event Loop] (epoll)
   ↓
[Callback 1] [Callback 2] [Callback 3]

Pros:

No locks (no race conditions).
Simple.
Memory-efficient.

Cons:

CPU work blocks the loop.
Only one core is used.

8.2 Multi-Threaded + Event Loops

[N threads (one per CPU core)]
   ↓
[Event Loop per thread]
   ↓
[Work-stealing]

Examples: Tokio (Rust), Akka (JVM), Kotlin Coroutines.

Pros:

Multi-core utilization.
Hundreds of thousands of concurrent tasks in a single process.

Cons:

Synchronization required.
Harder to debug.

8.3 SO_REUSEPORT

Linux 3.9+ feature:

setsockopt(sock, SOL_SOCKET, SO_REUSEPORT, &one, sizeof(one));

Multiple processes can listen on the same port:

The kernel load-balances automatically.
Each process gets its own epoll instance.
Used by Nginx and HAProxy.

[Port 80]
   ├─ [Worker 1] (epoll)
   ├─ [Worker 2] (epoll)
   └─ [Worker 3] (epoll)

9. Performance Comparison

9.1 Handling 10K Connections

Model	Memory	Throughput
Thread per connection	~10 GB	low
select	low	very low
poll	low	low
epoll	low	high
io_uring	low	very high

9.2 Real Benchmarks

HTTP server (10K connections):

Apache (prefork):    5,000 req/s
Apache (worker):     20,000 req/s
Nginx (epoll):       100,000 req/s
Nginx (io_uring):    150,000 req/s

ScyllaDB (io_uring):

10x+ throughput over Cassandra.
1M+ ops/s on a single node.

9.3 NewSQL and Storage

ScyllaDB's secret sauce:

io_uring + DPDK.
One shard per core.
Lock-free.
Userspace TCP stack.

10. In Practice — Building a High-Performance Server

10.1 Python (asyncio)

import asyncio

async def handle_client(reader, writer):
    while True:
        data = await reader.read(1024)
        if not data:
            break
        writer.write(data)
        await writer.drain()
    writer.close()

async def main():
    server = await asyncio.start_server(handle_client, '0.0.0.0', 8080)
    async with server:
        await server.serve_forever()

asyncio.run(main())

Internally uses selectors → epoll.

10.2 Rust (Tokio)

use tokio::net::TcpListener;
use tokio::io::{AsyncReadExt, AsyncWriteExt};

#[tokio::main]
async fn main() {
    let listener = TcpListener::bind("0.0.0.0:8080").await.unwrap();

    loop {
        let (mut socket, _) = listener.accept().await.unwrap();

        tokio::spawn(async move {
            let mut buf = [0; 1024];
            loop {
                let n = socket.read(&mut buf).await.unwrap();
                if n == 0 { return }
                socket.write_all(&buf[..n]).await.unwrap();
            }
        });
    }
}

Tokio handles:

mio (epoll).
M:N scheduling.
Work-stealing.

10.3 Go

listener, _ := net.Listen("tcp", ":8080")

for {
    conn, _ := listener.Accept()
    go handle(conn)
}

func handle(conn net.Conn) {
    buf := make([]byte, 1024)
    for {
        n, err := conn.Read(buf)
        if err != nil { return }
        conn.Write(buf[:n])
    }
}

The Go runtime handles:

netpoll (epoll/kqueue).
Goroutine scheduling.
Channel integration.

10.4 Comparison

	Python asyncio	Rust Tokio	Go
Syntax	async/await	async/await	go func()
Performance	moderate	top	very good
Learning curve	easy	steep	easy
Memory	moderate	very low	low
Best for	fast prototyping	maximum performance	balance

Quiz

1. What is the core difference between select and epoll?

Answer: select scans every fd on every call (O(n)), is limited by FD_SETSIZE (typically 1024), and copies the fd_set between user and kernel on every call. epoll returns only fds with events (O(1)), has no fd-count limit, and registers fds once via epoll_ctl. With 10K connections and 100 active: select scans 10K every time, epoll returns just 100. That is why Nginx, Node.js, and Redis are all built on epoll.

2. Why is io_uring better than epoll?

Answer: epoll still has limits — synchronous read/write (you still call them after epoll_wait), syscall cost, and disk I/O that can still block. io_uring: (1) two ring buffers (SQ/CQ) plus shared memory, (2) near-zero syscalls (batch submission), (3) truly async (disk I/O no longer blocks), (4) supports not just read/write but also accept, connect, fsync, and more. 30–50% faster. Used by ScyllaDB and PostgreSQL 17. The future of Linux I/O.

3. What is the difference between the Reactor and Proactor patterns?

Answer: Reactor (epoll-based): the event loop signals "fd is ready" and the user calls read. Proactor (io_uring / IOCP-based): the user requests "read this data for me", the kernel performs the work, and signals on completion. Reactor makes the user do the work; Proactor makes the kernel do it. Boost.ASIO abstracts both patterns behind the same interface.

4. How does async/await actually work?

Answer: No magic. The compiler rewrites the function as a state machine. At each await the function suspends and registers a callback on the future/promise. When the result is ready, it resumes. The event loop (1) runs tasks on the ready queue, (2) waits for I/O via epoll_wait, (3) moves completed tasks back onto the ready queue. Python asyncio, JS/Node.js, Rust Tokio, and Go goroutines all follow this pattern.

5. What is SO_REUSEPORT and how is it used?

Answer: A Linux 3.9+ feature. Multiple processes can listen on the same port simultaneously. The kernel distributes new connections automatically. Each process has its own epoll instance, giving you true lock-free parallelism. Used by Nginx and HAProxy — each worker process listens on the same port, and the kernel balances between them. It sidesteps single-process limits. One worker per CPU core is the common pattern.

References

The C10K Problem — Dan Kegel
epoll man page
Introduction to io_uring — Jens Axboe
tokio — Rust async runtime
Boost.ASIO
libuv — the foundation of Node.js
Nginx Event Loop
What is io_uring?
ScyllaDB — powered by io_uring
The Reactor Pattern
Python asyncio Internals