Split View: Linux I/O 진화사 완전 정복 — blocking, select, poll, epoll, io_uring까지 (2025)

Linux I/O 진화사 완전 정복 — blocking, select, poll, epoll, io_uring까지 (2025)

0. 서버는 왜 느려지는가 — C10K 문제의 재방문

1999년 Dan Kegel 이 "C10K problem" 이라는 글을 쓸 때, 한 서버에서 1만 개 의 동시 연결을 처리하는 게 불가능해 보였다. 그때의 상식:

연결 하나 = 스레드 하나.
스레드 1만 개 = 스택만 10GB (2MB × 10000).
문맥 전환 비용 폭증.
커널 자료구조 고갈.

그런데 2020년대 지금 nginx 는 100만 연결을 단일 서버에서 처리한다. 무슨 일이 일어난 걸까? 이 글은 30년에 걸친 Linux I/O 인터페이스의 진화를 따라간다. 한 줄 read() 호출의 뒤에 어떤 혁명들이 있었는지.

1. 블로킹 I/O — 1970년대의 유산

1.1 가장 단순한 모델

char buf[1024];
int n = read(fd, buf, sizeof(buf));  // 데이터 올 때까지 블록

커널은 이 스레드를 대기 큐 에 넣고 다른 스레드를 실행.
데이터가 도착하면 스레드를 깨움.
프로그래머 관점에서는 동기식이라 코드가 깔끔.

1.2 문제: 한 연결만 처리하는 서버

while (1) {
    int client = accept(server_fd, ...);
    while (1) {
        int n = read(client, ...);  // 이 연결만 처리
        write(client, ...);
    }
}

여러 클라이언트를 동시에 못 받음. 해결:

1.3 스레드 풀 — Apache prefork 모델

// 마스터
while (1) {
    int client = accept(server_fd, ...);
    spawn_worker(client);  // 각 연결마다 스레드/프로세스
}

// 워커
void* worker(int client) {
    while (1) read/write loop
}

장점: 로직이 직관적.
단점:
- 스레드 생성 비용.
- 스택 메모리 (스레드당 2MB).
- 문맥 전환 오버헤드.
- 1만 연결에서 시스템이 무너짐.

Apache 의 prefork MPM 이 이 모델. C10K 의 벽에 정면으로 부딪힘.

2. Select — 최초의 I/O 멀티플렉싱 (1983)

2.1 "하나의 스레드로 여러 연결 감시"

fd_set readfds;
FD_ZERO(&readfds);
FD_SET(fd1, &readfds);
FD_SET(fd2, &readfds);
FD_SET(fd3, &readfds);

int n = select(maxfd+1, &readfds, NULL, NULL, NULL);
// n: 준비된 fd 개수, readfds: 어느 fd가 준비됐는지

1983년 4.2BSD 가 도입. 아이디어: "여러 fd 를 한 번에 감시하고, 하나라도 준비되면 알려달라."

2.2 select 의 3가지 치명적 한계

한계 1: FD_SETSIZE 상한 (1024)

typedef struct {
    long __fds_bits[1024/64];  // 1024 비트
} fd_set;

fd_set 이 1024 비트 고정. 1025번 fd 를 SET 하면 스택 오버플로우. 리눅스에서 이 상수를 늘리려면 glibc 재컴파일 해야 함.

한계 2: O(n) 스캔

select 는 호출마다:

모든 fd 비트맵을 커널로 복사.
커널이 모든 fd 를 순회하며 상태 체크.
결과 비트맵을 유저로 복사.
유저가 다시 모든 fd 를 순회하며 어느 것이 준비됐는지 찾음.

1만 연결 × 매번 1만 번 스캔 = 선형 저하. 실제 활성 연결이 1% 여도 99% 를 매번 체크.

한계 3: 호출마다 비트맵 재설정

select 가 반환하면 fd_set 이 덮어써진다 → 매 호출 전에 재설정 필요.

3. Poll — 같은 문제, 다른 포장 (1986)

3.1 struct pollfd 배열

struct pollfd fds[10000];
fds[0].fd = sock1; fds[0].events = POLLIN;
fds[1].fd = sock2; fds[1].events = POLLIN;
...

int n = poll(fds, 10000, timeout);
for (int i = 0; i < 10000; i++) {
    if (fds[i].revents & POLLIN) {
        read(fds[i].fd, ...);
    }
}

System V 에서 도입. 개선점:

FD 개수 제한 없음 (배열 크기만큼).
비트맵 대신 구조체 배열 → 의미 있는 에러 코드 (POLLHUP, POLLERR 등).
반환 후 events 가 유지됨 → 매번 재설정 불필요.

하지만 O(n) 스캔 문제는 여전. 1만 연결에서 각 poll() 호출마다 1만 번 스캔.

4. Nonblocking I/O — 블로킹 방지의 기본기

4.1 O_NONBLOCK 플래그

fcntl(fd, F_SETFL, O_NONBLOCK);
int n = read(fd, buf, size);
if (n == -1 && errno == EAGAIN) {
    // 지금은 데이터 없음, 나중에 다시 시도
}

블로킹 없이 즉시 반환. 없으면 EAGAIN (또는 EWOULDBLOCK) 에러.

4.2 select/poll + nonblocking 조합

실전 패턴:

while (1) {
    int n = select(..., &readfds, ...);
    for (각 fd) {
        if (FD_ISSET(fd, &readfds)) {
            while (read(fd, buf, size) > 0) {
                // 데이터 처리
            }
            // EAGAIN 나오면 빠져나옴
        }
    }
}

이게 reactor 패턴의 원형. 한 스레드가 여러 fd 를 감시 + nonblocking read 로 실제 I/O.

5. Epoll — Linux 의 혁명 (2002)

5.1 등장 배경

2000년대 초 C10K 문제가 실제로 중요해짐 (ICQ, IRC, 게임 서버). Linux 2.5.44 에서 Davide Libenzi 가 epoll 도입.

5.2 API 3개로 구성

int epfd = epoll_create1(0);                    // 1. 인스턴스 생성

struct epoll_event ev;
ev.events = EPOLLIN;
ev.data.fd = sock;
epoll_ctl(epfd, EPOLL_CTL_ADD, sock, &ev);      // 2. fd 등록 (한 번)

struct epoll_event events[64];
int n = epoll_wait(epfd, events, 64, timeout);  // 3. 이벤트 대기
for (int i = 0; i < n; i++) {
    read(events[i].data.fd, ...);
}

핵심 혁신:

5.3 왜 O(1) 인가

fd 집합을 커널이 영구 보관 (epoll_ctl 로 한 번 등록).
이벤트 발생 시에만 커널이 내부 red-black tree + ready list 에 등록.
epoll_wait 는 ready list 만 반환 → "준비된 것의 개수" 만큼만 처리.

1만 연결 중 100개가 활성이면 100번만 순회. select/poll 이 매번 1만 번 순회하던 것과 비교.

5.4 Level-triggered vs Edge-triggered

Level-triggered (LT, 기본):

조건이 참인 한 계속 알림. 예: 버퍼에 데이터가 있는 한.
"한 번에 다 안 읽어도 다음 epoll_wait 에서 또 알려줌".
select/poll 과 호환되는 직관적 모델.

Edge-triggered (ET):

상태 변화가 있을 때만 알림.
"읽을 수 있게 되는 순간" 딱 한 번.
EAGAIN 나올 때까지 모두 읽어야 함 — 안 그러면 다음 데이터 알림 누락.
높은 성능, 하지만 코딩 어려움.

// ET 모드에서는 이렇게 써야 함
while (1) {
    int n = read(fd, buf, size);
    if (n == -1 && errno == EAGAIN) break;  // 다 읽었음
    if (n <= 0) break;                       // 에러/종료
    process(buf, n);
}

nginx 는 ET 로 구현되어 극한 성능을 뽑는다.

5.5 EPOLLEXCLUSIVE — Thundering Herd 해결

여러 스레드가 같은 listen socket 을 epoll 에 등록하면, accept 가능해질 때 모두 깨어남 → 하나만 accept 성공, 나머지는 EAGAIN → 자원 낭비.

Linux 4.5+ 의 EPOLLEXCLUSIVE: "이 fd 는 한 명한테만 알림". nginx, HAProxy 가 이걸 활용.

5.6 epoll 의 한계

여전히 시스템 콜이 많음: 매 이벤트마다 read, write 호출. 고성능에서는 시스템 콜 오버헤드가 주요 비용.
LT 모드의 추가 웨이크업: 처리해도 다시 알림 → 쓸모 없는 context 전환.
파일 I/O 는 항상 준비됨: epoll 이 regular file 에는 쓸모 없음. 디스크 I/O 는 결국 블로킹.

이 한계들이 io_uring 을 낳았다.

6. kqueue — BSD 의 대안

같은 시기 FreeBSD (2000) 가 kqueue 를 도입. epoll 과 비슷하지만:

파일 시스템 이벤트, 시그널, 타이머 등 더 많은 이벤트 소스.
통합된 API 로 모든 것.

macOS 도 kqueue. libevent, libuv 같은 크로스 플랫폼 라이브러리는 "epoll on Linux, kqueue on BSD/macOS, IOCP on Windows" 를 추상화.

7. 비동기 디스크 I/O — 네트워크와는 다른 전쟁

7.1 epoll 이 파일엔 안 먹히는 이유

Regular file 에 epoll_ctl 하면 항상 "준비됨" 을 반환. 이유: 파일은 페이지 캐시에 이미 있거나 없거나이지 "준비 중" 이 없음. 없으면 블로킹 I/O 로 디스크에서 읽어옴.

7.2 POSIX AIO — 실패한 첫 시도

struct aiocb cb = { .aio_fildes = fd, .aio_buf = buf, .aio_nbytes = size };
aio_read(&cb);
while (aio_error(&cb) == EINPROGRESS) {
    // 뭔가 다른 일
}
int n = aio_return(&cb);

Linux 의 glibc POSIX AIO 는 사실 사용자 공간에서 스레드 풀 로 흉내낸 것. 진짜 커널 비동기가 아님.

7.3 Linux AIO (libaio) — 제한적 성공

io_submit, io_getevents 시스템 콜. 진짜 커널 AIO 지만:

O_DIRECT 만 지원 (OS 페이지 캐시 우회).
많은 상황에서 여전히 블로킹.
쓰기 흔하지 않음.

MySQL InnoDB, 일부 DB 에서만 사용. 일반 서버에는 도입 안 됨.

8. io_uring — 2019년 리눅스 I/O 혁명

8.1 Jens Axboe 의 비전

2019년 리눅스 5.1, 커널 블록 I/O 담당자 Jens Axboe 가 io_uring 도입. 핵심:

"시스템 콜을 아예 호출하지 않고도 I/O 를 제출하고 결과를 받을 수 있다."

8.2 두 개의 링 버퍼

Submission Queue (SQ): 유저가 I/O 요청을 넣음.
Completion Queue (CQ): 커널이 완료 결과를 넣음.

둘 다 mmap 으로 유저-커널 공유 메모리:

유저:
  SQ 에 요청 작성
  io_uring_enter() 호출 (선택적)

커널:
  SQ 에서 요청 읽음
  I/O 수행
  CQ 에 결과 씀

유저:
  CQ 에서 결과 읽음

8.3 왜 혁명적인가

1. 시스템 콜 감소

배치 제출: 한 번의 io_uring_enter() 로 여러 요청.
SQ_POLL 모드: 커널 스레드가 SQ 를 poll → 시스템 콜 0번.
높은 QPS 에서 10배 이상 성능 향상.

2. 통합 인터페이스

네트워크, 파일, 타이머, 시그널 — 모두 같은 API. epoll/AIO 섞어 쓸 필요 없음.

3. 링크된 요청 (linked SQE)

"이 요청이 성공하면 다음 요청을 자동 실행". 예: openat → read → close 를 한 번에 제출.

4. Buffer Selection

수천 연결에 각각 버퍼를 미리 할당하지 않고, 풀에서 필요할 때만 선택.

8.4 io_uring 의 성장 속도

5.1 (2019): 기본 도입.
5.5: Accept 지원.
5.7: 시그널, 파일 열기/닫기.
5.19: 네트워크에 대한 zero-copy 전송.
6.x: 더 많은 op code, multishot accept.

2025년 현재 거의 모든 Linux 시스템 콜이 io_uring 으로 가능.

8.5 io_uring 의 어두운 면

보안 문제가 많이 발견됨. Google ChromeOS, Android 가 io_uring 을 비활성화 했다 (2023). 이유:

공격 표면이 넓음 (수많은 op code).
커널 취약점의 새로운 경로.
기존 seccomp 필터로 제어 어려움.

해결 방향: io_uring 용 seccomp 확장, ACL, 그리고 "제어된 op code 서브셋만 허용" 정책.

9. 실전 아키텍처 — Event Loop 패턴

9.1 Reactor 패턴 (Node.js, nginx, Redis)

┌─────────────────────────┐
│    Event Loop (1개 스레드) │
│  while (true) {          │
│    events = epoll_wait() │
│    for (e in events)     │
│      handle(e)           │
│  }                       │
└─────────────────────────┘

한 스레드가 모든 I/O 감시.
이벤트 올 때 짧은 핸들러 실행 → 즉시 다음 이벤트.
핸들러는 블로킹 금지 (그러면 전체 루프 멈춤).
CPU 집약적 작업은 worker thread 로 오프로드.

9.2 Node.js 의 구조

┌──────────────────────────┐
│   V8 JavaScript Runtime   │
├──────────────────────────┤
│   libuv (크로스플랫폼)    │
│   ├── epoll/kqueue/IOCP   │
│   └── Thread Pool         │
├──────────────────────────┤
│   OS Kernel               │
└──────────────────────────┘

I/O 는 libuv 가 epoll 로 비동기 처리 → JavaScript 콜백 실행.
DNS resolution, 파일 읽기, 암호화 같은 "비동기 에뮬레이션" 은 thread pool (기본 4개).
CPU 집약적 작업은 Worker threads (v10+) 로.

9.3 nginx 의 master-worker 모델

Master (루트 권한, 설정 관리)
  │
  ├── Worker 0 (epoll 루프, 수만 연결)
  ├── Worker 1 (epoll 루프, 수만 연결)
  └── Worker N (보통 CPU 코어 수)

각 worker 는 독립 event loop.
listen socket 을 공유 (SO_REUSEPORT) → 커널이 연결 분배.
Master-worker 분리로 무중단 설정 리로드.

9.4 Redis — 싱글 스레드의 미학

Redis 는 하나의 메인 스레드 가 모든 명령 처리:

epoll 로 수천 연결 감시.
메모리만 쓰므로 명령 처리가 µs 단위.
락 없음 → 경쟁 조건 없음 → 버그 적음.

6.0+ 부터 I/O threading: 네트워크 읽기/쓰기만 여러 스레드, 명령 실행은 여전히 싱글 스레드. CPU 병목이 I/O 인 환경에서 개선.

9.5 io_uring 기반 현대 아키텍처

ScyllaDB: 처음부터 io_uring 중심 설계. Cassandra 호환이지만 10배 빠름.
QEMU/KVM: 가상 디스크 I/O 를 io_uring 으로 → 40% 성능 향상.
Ceph: 백엔드 스토리지에 io_uring 도입.
nginx experimental: io_uring 플러그인.

10. Reactor vs Proactor — 두 가지 비동기 철학

10.1 Reactor (알림 기반)

"준비됐다고 알려주면 내가 읽을게".
epoll, kqueue 스타일.
유저가 버퍼 관리.

10.2 Proactor (완료 기반)

"여기에 읽어서 다 되면 알려줘".
Windows IOCP, io_uring 스타일.
커널이 직접 버퍼에 씀.

10.3 왜 Proactor 가 더 빠른가

Reactor: "읽을 수 있음" → read() 시스템 콜 → 데이터 복사 → 처리. Proactor: 커널이 백그라운드에서 복사 완료 → 유저는 바로 처리.

시스템 콜 한 번 덜 들어감. 대량 트래픽에서 이 차이가 결정적.

Windows 는 1994년 NT 3.5 부터 IOCP 로 proactor. 오랫동안 Linux 가 epoll 로 reactor 였다가 2019년 io_uring 으로 proactor 진영 합류.

11. 네트워크 스택의 Zero-Copy

11.1 일반 sendfile vs 일반 read+write

파일 → 소켓 전송 시:

일반:
  read(file)  : 디스크 → 커널 → 유저 버퍼 (복사 1)
  write(sock) : 유저 버퍼 → 커널 → NIC (복사 2)

sendfile:
  디스크 → 커널 → NIC (복사 0번, DMA 로 직접)

nginx, Apache 가 정적 파일 전송에 sendfile 을 쓰는 이유. 2배 빠름 + CPU 10배 절감.

11.2 splice, tee, vmsplice

splice: 파이프를 통해 fd 간 데이터 이동, 유저 공간 복사 없음.

splice(fd_in, NULL, pipefd[1], NULL, size, SPLICE_F_MORE);
splice(pipefd[0], NULL, fd_out, NULL, size, SPLICE_F_MORE);

11.3 MSG_ZEROCOPY

Linux 4.14+ send(fd, buf, size, MSG_ZEROCOPY):

유저 버퍼를 직접 NIC 로 DMA.
전송 완료까지 버퍼를 건드리면 안 됨 → errqueue 로 완료 알림.
큰 전송에서 30% 성능 향상.

11.4 io_uring + zero-copy

io_uring_prep_send_zc(sqe, fd, buf, size, 0, 0);

io_uring 안에서 MSG_ZEROCOPY 와 동등 효과. 큰 응답을 내보내는 CDN, 비디오 서버에서 결정적.

12. 관찰과 튜닝 — 실무

12.1 연결 수 한계

ulimit -n              # fd 상한 확인 (보통 1024 또는 1M)
ulimit -n 1000000      # 임시로 올림
/etc/security/limits.conf  # 영구 설정

추가로:

sysctl fs.file-max                   # 시스템 전체 fd 한도
sysctl net.core.somaxconn             # listen 백로그 상한
sysctl net.ipv4.ip_local_port_range   # 사용 가능 ephemeral 포트 범위
sysctl net.ipv4.tcp_tw_reuse          # TIME_WAIT 재사용

12.2 TCP 버퍼 크기

sysctl net.core.rmem_max        # 수신 버퍼 최대
sysctl net.core.wmem_max        # 송신 버퍼 최대
sysctl net.ipv4.tcp_rmem        # 자동 조정 범위

BDP (Bandwidth-Delay Product) 가 기본 버퍼 크기보다 크면 조정 필요. 10Gbps × 100ms = 125MB → 기본 208KB 로는 부족.

12.3 관찰 도구

ss -antp: 현재 연결 상태 (netstat 대체).
iftop, nload: 실시간 네트워크.
tcpdump / wireshark: 패킷 덤프.
bpftrace: 커널 내부 이벤트 트레이싱.
perf trace: 시스템 콜 프로파일링.

12.4 io_uring 도입 전략

점진적 마이그레이션: 성능 핫스팟만 io_uring 으로, 나머지는 epoll.
커널 5.15+ 권장: 초기 버전은 버그 많음.
보안 고려: seccomp 로 허용 op code 제한.
라이브러리: liburing (Jens Axboe 공식), tokio-uring (Rust), io_uring-rs.

13. 마치며 — 30년 I/O 진화의 교훈

1970년대 read() 한 줄에서 시작된 여정:

1983 select: 한 스레드가 여러 fd.
1986 poll: 제한 해제, 하지만 여전히 O(n).
1994 Windows IOCP: 최초 proactor.
2000 FreeBSD kqueue: 통합 이벤트.
2002 Linux epoll: O(1) 이벤트.
2019 Linux io_uring: 시스템 콜 없는 I/O.

각 세대는 이전 세대의 한계에서 태어났다. 다음 혁명은 무엇일까? 후보들:

DPDK / XDP: 커널 우회, 10Gbps+ 라인 레이트 처리.
Userspace TCP: Kernel bypass 로 레이턴시 µs 단위.
RDMA: CPU 우회 메모리 접근.
Smart NIC: I/O 처리를 NIC 에 오프로드.

"정해진 답" 이 없는 변화의 연속이다. 하지만 핵심 원리 — "시스템 콜은 비싸다", "복사는 비싸다", "감시 대상이 커질수록 O(1) 이어야 한다" — 는 50년 전이나 지금이나 같다.

다음 글에서는 네트워크 스택 자체 — TCP 상태 머신, 혼잡 제어 (Cubic vs BBR), nagle, delayed ack, Fast Open, 그리고 QUIC 가 왜 UDP 위에 새 스택을 만들었는지 — 를 파볼 예정이다.

참고 자료

Dan Kegel — "The C10K problem" (1999-2014).
Davide Libenzi — "Improving (network) I/O performance..." epoll 제안 (2002).
Jens Axboe — "Efficient IO with io_uring" (2019).
Jens Axboe — "Ringing in a new asynchronous I/O API" (LWN.net, 2019).
Linux kernel source: fs/eventpoll.c, fs/io_uring.c, io_uring/.
liburing: https://github.com/axboe/liburing
"The Secret To 10 Million Concurrent Connections" — Robert Graham.
Felix Uherek — "The method to epoll's madness".
ScyllaDB Engineering Blog — Seastar / io_uring 시리즈.
"What Every Systems Programmer Should Know About Concurrency" (PDF) — Matt Kline.

Linux I/O Evolution Deep Dive — blocking, select, poll, epoll, io_uring (2025)

0. Why does a server get slow — the C10K problem revisited

In 1999, when Dan Kegel wrote "The C10K problem", handling 10,000 concurrent connections on one server seemed impossible. The conventional wisdom:

One connection = one thread.
10,000 threads = 10GB of stack alone (2MB x 10000).
Context switching costs explode.
Kernel data structures exhausted.

Yet in the 2020s nginx handles one million connections on a single server. What happened? This post follows 30 years of Linux I/O interface evolution — the revolutions hiding behind a single line of read().

1. Blocking I/O — a 1970s legacy

1.1 The simplest model

char buf[1024];
int n = read(fd, buf, sizeof(buf));  // blocks until data arrives

The kernel puts the thread on a wait queue and runs another thread.
Wakes it when data arrives.
Synchronous from the programmer's view, so code stays clean.

1.2 The problem: server handling only one connection

while (1) {
    int client = accept(server_fd, ...);
    while (1) {
        int n = read(client, ...);
        write(client, ...);
    }
}

Can't serve multiple clients at once.

1.3 Thread pool — the Apache prefork model

// master
while (1) {
    int client = accept(server_fd, ...);
    spawn_worker(client);  // thread/process per connection
}

Pros: intuitive logic.
Cons: thread creation cost, stack memory (2MB each), context-switch overhead, system collapse at 10k connections.

Apache's prefork MPM ran straight into the C10K wall.

2. Select — the first I/O multiplexing (1983)

2.1 "One thread watching many connections"

fd_set readfds;
FD_ZERO(&readfds);
FD_SET(fd1, &readfds);
FD_SET(fd2, &readfds);

int n = select(maxfd+1, &readfds, NULL, NULL, NULL);

Introduced in 4.2BSD, 1983. Idea: "watch many fds at once, tell me when any is ready."

2.2 Three fatal limits of select

Limit 1: FD_SETSIZE ceiling (1024)

fd_set is 1024 bits. Setting fd 1025 overflows the stack. Raising it on Linux requires recompiling glibc.

Limit 2: O(n) scan

select on every call:

Copies the whole fd bitmap to the kernel.
Kernel walks every fd checking state.
Copies result bitmap back.
User walks every fd again to find which are ready.

10k connections x 10k scan each time = linear slowdown. Even with 1% active, 99% get scanned every time.

Limit 3: bitmap reset per call

select overwrites fd_set on return — must be reset before every call.

3. Poll — same problem, different packaging (1986)

3.1 struct pollfd array

struct pollfd fds[10000];
fds[0].fd = sock1; fds[0].events = POLLIN;

int n = poll(fds, 10000, timeout);

Introduced in System V. Improvements:

No fd count limit (as large as the array).
Struct array instead of bitmap — meaningful error codes (POLLHUP, POLLERR).
events preserved across calls — no resetting.

But O(n) scan remains. 10k connections still scan 10k per call.

4. Nonblocking I/O — the basis for not blocking

4.1 O_NONBLOCK flag

fcntl(fd, F_SETFL, O_NONBLOCK);
int n = read(fd, buf, size);
if (n == -1 && errno == EAGAIN) {
    // no data yet, retry later
}

Returns immediately. If nothing is available, returns EAGAIN (or EWOULDBLOCK).

4.2 select/poll + nonblocking combination

The reactor pattern in embryo: one thread watches many fds, uses nonblocking reads to do actual I/O.

5. Epoll — the Linux revolution (2002)

5.1 Context

Early 2000s C10K was real (ICQ, IRC, game servers). Linux 2.5.44, Davide Libenzi introduced epoll.

5.2 Three-call API

int epfd = epoll_create1(0);

struct epoll_event ev;
ev.events = EPOLLIN;
ev.data.fd = sock;
epoll_ctl(epfd, EPOLL_CTL_ADD, sock, &ev);

struct epoll_event events[64];
int n = epoll_wait(epfd, events, 64, timeout);
for (int i = 0; i < n; i++) {
    read(events[i].data.fd, ...);
}

5.3 Why O(1)

Kernel permanently stores the fd set (registered once via epoll_ctl).
On event, kernel adds fd to internal red-black tree + ready list.
epoll_wait returns only the ready list — you iterate only what's ready.

Of 10k connections with 100 active, only 100 are walked. Compare with select/poll walking 10k every time.

5.4 Level-triggered vs Edge-triggered

Level-triggered (LT, default):

Keeps notifying while the condition is true. e.g. data in buffer.
"Even if you don't read it all, next epoll_wait will tell you again."
Intuitive, compatible with select/poll mental model.

Edge-triggered (ET):

Only notifies on state transitions.
Exactly once "as it becomes readable".
Must read until EAGAIN — otherwise miss the next notification.
Higher performance, harder coding.

// must be written this way in ET mode
while (1) {
    int n = read(fd, buf, size);
    if (n == -1 && errno == EAGAIN) break;
    if (n <= 0) break;
    process(buf, n);
}

nginx uses ET for extreme performance.

5.5 EPOLLEXCLUSIVE — solving thundering herd

If multiple threads register the same listen socket to epoll, all wake when accept is possible — only one succeeds, the rest get EAGAIN. Waste.

Linux 4.5+ EPOLLEXCLUSIVE: "only notify one waiter per fd." nginx and HAProxy use this.

5.6 Limits of epoll

Still many syscalls: each event triggers read, write. At high load the syscall overhead dominates.
LT extra wakeups: notified again even after processing.
Regular files are always ready: epoll is useless for disk I/O, which ends up blocking.

These limits spawned io_uring.

6. kqueue — the BSD alternative

Same era, FreeBSD (2000) introduced kqueue. Similar to epoll but:

More event sources: filesystem events, signals, timers.
Unified API for everything.

macOS also uses kqueue. Cross-platform libraries (libevent, libuv) abstract over "epoll on Linux, kqueue on BSD/macOS, IOCP on Windows".

7. Async disk I/O — a different war from networking

7.1 Why epoll doesn't work for files

epoll_ctl on a regular file always returns "ready". Files are either in the page cache or not — there's no "becoming ready" state. If not cached, you block on disk.

7.2 POSIX AIO — a failed first try

struct aiocb cb = { .aio_fildes = fd, .aio_buf = buf, .aio_nbytes = size };
aio_read(&cb);

Linux's glibc POSIX AIO is actually a userspace thread pool faking it. Not true kernel async.

7.3 Linux AIO (libaio) — limited success

io_submit, io_getevents. Real kernel AIO but:

Only supports O_DIRECT (bypass page cache).
Still blocks in many cases.
Rarely used in practice.

Used by MySQL InnoDB and some DBs. Never went mainstream.

8. io_uring — the 2019 Linux I/O revolution

8.1 Jens Axboe's vision

In 2019, Linux 5.1, block I/O maintainer Jens Axboe introduced io_uring. Core idea:

"Submit I/O and collect results without a syscall at all."

8.2 Two ring buffers

Submission Queue (SQ): user inserts I/O requests.
Completion Queue (CQ): kernel places results.

Both are mmap'd as shared memory between user and kernel:

user:
  writes request to SQ
  calls io_uring_enter() (optional)

kernel:
  reads request from SQ
  performs I/O
  writes result to CQ

user:
  reads result from CQ

8.3 Why it is revolutionary

1. Fewer syscalls

Batch submit: many requests per io_uring_enter().
SQ_POLL mode: a kernel thread polls SQ — zero syscalls.
10x+ gains at high QPS.

2. Unified interface

Network, files, timers, signals — same API. No mixing epoll/AIO.

3. Linked SQEs

"When this succeeds, run the next". e.g. submit openat -> read -> close as one.

4. Buffer selection

Instead of preallocating buffers for thousands of connections, pick from a pool on demand.

8.4 Growth trajectory

5.1 (2019): initial.
5.5: accept support.
5.7: signals, file open/close.
5.19: zero-copy send for networking.
6.x: more opcodes, multishot accept.

As of 2025 nearly every Linux syscall is available via io_uring.

8.5 The dark side

Many security issues found. Google ChromeOS and Android disabled io_uring in 2023. Reasons:

Large attack surface (many opcodes).
New path to kernel vulnerabilities.
Hard to control with existing seccomp.

Direction: seccomp extensions for io_uring, ACLs, "allowed opcode subset" policies.

9. Real architectures — the event loop pattern

9.1 Reactor pattern (Node.js, nginx, Redis)

Event Loop (single thread)
  while (true) {
    events = epoll_wait()
    for (e in events) handle(e)
  }

One thread watches all I/O.
Short handlers run on events — back to waiting.
Handlers must not block (would stall the loop).
CPU-heavy work offloaded to worker threads.

9.2 Node.js structure

V8 JavaScript Runtime
libuv (cross-platform)
  epoll / kqueue / IOCP
  Thread Pool
OS Kernel

I/O goes through libuv + epoll -> JavaScript callback.
DNS, file reads, crypto use a thread pool (default 4).
CPU-heavy work via Worker threads (v10+).

9.3 nginx master-worker

Master (root, config)
  Worker 0 (epoll loop, tens of thousands of conns)
  Worker 1
  Worker N (usually CPU core count)

Each worker has its own event loop.
Shared listen socket (SO_REUSEPORT) — kernel distributes.
Master-worker split enables zero-downtime reload.

9.4 Redis — the beauty of single-threaded

Redis uses one main thread for all commands:

epoll watches thousands of connections.
All memory ops — microsecond command latency.
No locks, no races, fewer bugs.

Since 6.0, I/O threading: network read/write across multiple threads, command execution still single-threaded. Helps when the bottleneck is I/O, not CPU.

9.5 io_uring-based modern architectures

ScyllaDB: io_uring-centric from the start. Cassandra-compatible, 10x faster.
QEMU/KVM: virtual disk I/O via io_uring — 40% speedup.
Ceph: io_uring in backend storage.
nginx experimental: io_uring plugin.

10. Reactor vs Proactor — two async philosophies

10.1 Reactor (notify-based)

"Tell me when ready, I'll read."
epoll, kqueue style.
User manages buffers.

10.2 Proactor (completion-based)

"Read here and tell me when done."
Windows IOCP, io_uring style.
Kernel writes directly to the buffer.

10.3 Why Proactor is faster

Reactor: "readable" -> read() syscall -> copy -> process. Proactor: kernel completes copy in background -> user processes directly.

One fewer syscall. At high traffic this is decisive.

Windows has had IOCP-based proactor since NT 3.5 in 1994. Linux was long reactor-only via epoll; in 2019 it joined the proactor camp with io_uring.

11. Zero-copy in the network stack

11.1 sendfile vs read+write

For file -> socket:

normal:
  read(file)  : disk -> kernel -> user buffer (copy 1)
  write(sock) : user buffer -> kernel -> NIC (copy 2)

sendfile:
  disk -> kernel -> NIC (zero copies, direct DMA)

Why nginx and Apache use sendfile for static files. 2x faster + 10x less CPU.

11.2 splice, tee, vmsplice

splice moves data between fds through a pipe, no user-space copy.

splice(fd_in, NULL, pipefd[1], NULL, size, SPLICE_F_MORE);
splice(pipefd[0], NULL, fd_out, NULL, size, SPLICE_F_MORE);

11.3 MSG_ZEROCOPY

Linux 4.14+ send(fd, buf, size, MSG_ZEROCOPY):

DMA direct from user buffer to NIC.
Must not touch the buffer until done -> completion via errqueue.
~30% gain on large transfers.

11.4 io_uring + zero-copy

io_uring_prep_send_zc(sqe, fd, buf, size, 0, 0);

Same effect as MSG_ZEROCOPY inside io_uring. Decisive for CDNs and video servers pushing large responses.

12. Observation and tuning — in practice

12.1 Connection limits

ulimit -n              # fd ceiling (often 1024 or 1M)
ulimit -n 1000000
/etc/security/limits.conf

Also:

sysctl fs.file-max
sysctl net.core.somaxconn
sysctl net.ipv4.ip_local_port_range
sysctl net.ipv4.tcp_tw_reuse

12.2 TCP buffer sizes

sysctl net.core.rmem_max
sysctl net.core.wmem_max
sysctl net.ipv4.tcp_rmem

BDP (Bandwidth-Delay Product) larger than default requires tuning. 10Gbps x 100ms = 125MB — 208KB default is insufficient.

12.3 Observation tools

ss -antp: current connections (replaces netstat).
iftop, nload: real-time network.
tcpdump / wireshark: packet dump.
bpftrace: kernel-internal tracing.
perf trace: syscall profiling.

12.4 io_uring adoption strategy

Gradual migration: io_uring for hotspots, epoll elsewhere.
Kernel 5.15+ recommended: earlier versions were buggy.
Security: restrict opcodes with seccomp.
Libraries: liburing (Jens Axboe official), tokio-uring (Rust), io_uring-rs.

13. Closing — lessons from 30 years of I/O evolution

The journey from a single read() in the 1970s:

1983 select: one thread, many fds.
1986 poll: no limit, still O(n).
1994 Windows IOCP: first proactor.
2000 FreeBSD kqueue: unified events.
2002 Linux epoll: O(1) events.
2019 Linux io_uring: syscall-free I/O.

Each generation was born from the prior one's limits. Candidates for the next revolution:

DPDK / XDP: kernel bypass, 10Gbps+ line rate.
Userspace TCP: kernel bypass for microsecond latency.
RDMA: CPU-bypass memory access.
Smart NIC: offload I/O to the NIC.

No "final answer" — continuous change. But the core principles — "syscalls are expensive", "copies are expensive", "as the watched set grows, it must be O(1)" — are the same as 50 years ago.

Next time: the network stack itself — TCP state machine, congestion control (Cubic vs BBR), nagle, delayed ack, Fast Open, and why QUIC built a new stack on UDP.

References

Dan Kegel — "The C10K problem" (1999-2014).
Davide Libenzi — "Improving (network) I/O performance..." epoll proposal (2002).
Jens Axboe — "Efficient IO with io_uring" (2019).
Jens Axboe — "Ringing in a new asynchronous I/O API" (LWN.net, 2019).
Linux kernel source: fs/eventpoll.c, fs/io_uring.c, io_uring/.
liburing: https://github.com/axboe/liburing
"The Secret To 10 Million Concurrent Connections" — Robert Graham.
Felix Uherek — "The method to epoll's madness".
ScyllaDB Engineering Blog — Seastar / io_uring series.
"What Every Systems Programmer Should Know About Concurrency" (PDF) — Matt Kline.