Skip to content

✍️ 필사 모드: Rust Tokio Runtime Deep Dive — Future, Waker, Work-Stealing, Pin, Zero-Cost Async (2025)

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

TL;DR

  • Rust async fn is compiled into a state machine struct at compile time. No runtime stack — a zero-cost abstraction.
  • The core method of the Future trait is poll(&mut self, cx: &mut Context) -> Poll<Output>. It returns Ready when done, otherwise Pending and stores the waker.
  • A Waker is a callback meaning "poll this task again." When I/O arrives or a timer fires, the runtime calls the waker to reschedule the task.
  • Tokio is a three-layer stack: reactor (mio = epoll/kqueue/IOCP) + scheduler (work-stealing) + drivers (I/O, timer). #[tokio::main] wires them up.
  • Work-stealing scheduler: like Go, each worker has a local run queue. Tokio adds a LIFO slot (most recently spawned task) for better cache locality.
  • Pin makes self-referential structs safe. The state machines generated for async fn frequently reference their own fields.
  • Cooperative scheduling: Tokio yields only when you await. CPU-bound work must use tokio::task::yield_now() or spawn_blocking.
  • Function coloring: async and sync functions cannot directly call each other — Rust async's philosophical cost.

1. Why Rust async is peculiar

1.1 Two approaches

Stackful (goroutine style): each task has its own stack, can suspend/resume anywhere. Go, Erlang, Java virtual threads.

Stackless (state machine style): each task is a state machine struct; suspend points are decided at compile time; no runtime stack. Rust, C#, JS, Python asyncio.

Rust picked the latter because:

  1. It must run on embedded targets with no runtime.
  2. Zero-cost: pay only for what you use.
  3. No dynamic stack allocation means deterministic memory use.

1.2 Zero-cost abstraction

Rust async pushes this to the limit: an async fn compiles to a single struct; suspend points become enum variants; no heap allocation unless you Box the Future. With good inlining the result is close to hand-written event-loop C.

1.3 The cost

  • Function coloring — sync and async do not mix cleanly.
  • Pin complexity — because of self-referential state.
  • Runtime dependency — a Future does nothing unless polled; you need an executor like Tokio.
  • Hard to debug — stack traces point into state machine internals.

2. The Future trait and state machines

2.1 Future is simple

pub trait Future {
    type Output;
    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output>;
}

pub enum Poll<T> {
    Ready(T),
    Pending,
}
  • A Future is "a computation that eventually yields a value."
  • poll() means "make as much progress as you can right now."
  • Ready(value): done. Pending: try me again after the waker fires.

2.2 async fn produces a Future

async fn hello() -> u32 { 42 }
// ≈
fn hello() -> impl Future<Output = u32> {
    async move { 42 }
}

The async move block becomes a compiler-generated state machine struct.

2.3 Anatomy of the state machine

async fn example() -> u32 {
    let x = read_value().await;   // suspend point 1
    let y = compute(x).await;     // suspend point 2
    x + y
}

Conceptually it becomes:

enum ExampleState {
    Start,
    Awaiting1 { fut: impl Future<Output = u32> },
    Awaiting2 { x: u32, fut: impl Future<Output = u32> },
    Done,
}

Observations:

  1. Suspend points become enum variants.
  2. poll() makes as much progress as possible per call.
  3. .await desugars to match poll() { Pending => return Pending, Ready(v) => v }.
  4. Locals that must survive across an .await live as struct fields.

2.4 State machine size

The struct size equals the max set of locals live at any suspend point.

async fn big() {
    let buf = [0u8; 1024 * 1024];  // 1 MB stack array!
    do_io().await;
    println!("{}", buf[0]);
}

This Future contains 1 MB. The infamous "async fn size explosion" trap. Fix: Box::pin the Future, or move big data behind Vec/Box.


3. Waker — how resumption works

3.1 Why Waker

If poll returns Pending, when should the runtime poll again? Busy-looping wastes CPU; periodic polling does not scale. Solution: the Future says "I am ready now." That signal is the Waker.

3.2 Mechanism

pub struct Waker { waker: RawWaker }
impl Waker {
    pub fn wake(self);
    pub fn wake_by_ref(&self);
    pub fn clone(&self) -> Waker;
}

Each poll() receives a Context from which you extract the waker:

fn poll(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output> {
    if self.io_ready() {
        Poll::Ready(value)
    } else {
        self.register_io(cx.waker().clone());
        Poll::Pending
    }
}

When I/O is ready the driver calls waker.wake() and the runtime reschedules the task.

3.3 RawWaker vtable

pub struct RawWakerVTable {
    clone: unsafe fn(*const ()) -> RawWaker,
    wake: unsafe fn(*const ()),
    wake_by_ref: unsafe fn(*const ()),
    drop: unsafe fn(*const ()),
}

Dynamic dispatch keeps the Future independent of the runtime. Tokio, async-std, smol, embedded runtimes all ship their own implementation.

3.4 Cost

clone() is typically Arc::clone; wake() enqueues the task (or no-ops if already queued). Tokio skips re-registering when the waker already corresponds to a task on the current worker.


4. Pin and self-referential structs

4.1 The problem

async fn example() {
    let data = vec![1, 2, 3];
    let ptr = &data[0];
    some_future().await;
    println!("{}", *ptr);
}

The generated state machine holds data and a pointer into it. If the struct moves in memory, the pointer dangles — use-after-free.

4.2 Pin's role

Pin<P> is a type-level promise that the pointee never moves again. Future::poll takes Pin<&mut Self> so the runtime cannot move the Future after polling begins — making self-references safe.

4.3 Creating Pins

let fut = Box::pin(my_async_fn());    // heap-pinned
let fut = std::pin::pin!(my_async_fn()); // stack-pinned (1.68+)
tokio::spawn(fut); // internally Box::pin

4.4 Unpin

Most types are Unpin (auto trait) — Pin is a no-op. State machines from async fn are !Unpin because of potential self-references, which is why spawning requires pinning.

4.5 Limits

Pin is widely criticized for complexity, poor teaching resources, and viral Pin<&mut Self> APIs. pin-project and the pin! macro help but the core is unavoidable.


5. Tokio runtime architecture

5.1 Three layers

User async code  (async fn, .await)
Scheduler        (tasks, workers, queues, park/unpark)
Driver           (I/O = mio, time = hashed wheel)
OS               (epoll / kqueue / IOCP)

5.2 #[tokio::main]

Expands to:

fn main() {
    tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()
        .unwrap()
        .block_on(async { do_something().await })
}

5.3 Flavors

Multi-threaded (default): N workers, work-stealing, tasks require Send + 'static.

Current-thread: one worker, no Send bound — ideal for CLIs, embedded, tests.


6. Tokio scheduler internals

6.1 Worker and task

struct Worker {
    local: LocalQueue<Task>,     // ring buffer, cap 256
    lifo_slot: Option<Task>,     // most recent spawn
    remote: Arc<Mutex<RemoteQueue<Task>>>,
    park: Parker,
}

6.2 LIFO slot — cache locality

The idea: a freshly spawned task is most likely to hit the L1 cache of the spawning worker. Great for request/response fan-out. But it is unfair — Tokio periodically bypasses the LIFO slot in favour of the FIFO queue.

6.3 Pick order

loop {
    // 1. LIFO slot
    // 2. Local queue
    // 3. Global queue (every N ticks)
    // 4. Poll the I/O driver
    // 5. Steal from another worker
    // 6. Park
}

6.4 Work-stealing

Steals half of the victim's local queue, runs the first, pushes the rest locally. Same "steal half" policy as the Go runtime.

6.5 Task lifecycle

Spawn → Poll → Pending (register waker) → Wake (event) → Poll again → Ready (result surfaces through JoinHandle).

6.6 JoinHandle

A Future wrapping the task's result. Task and handle share an Arc; memory is freed only when both drop.


7. I/O driver — mio and epoll

7.1 mio

Platform abstraction: epoll on Linux, kqueue on BSD/macOS, IOCP on Windows.

7.2 Non-blocking I/O

Every fd is non-blocking. A TCP connect pseudo-sequence:

  1. socket(), fcntl(O_NONBLOCK)
  2. connect() returns EINPROGRESS
  3. register WRITE readiness in epoll
  4. pollPending
  5. epoll_wait → writable → call waker
  6. poll completes

7.3 PollEvented

fn poll_read(self: Pin<&mut Self>, cx: &mut Context, buf: &mut [u8])
    -> Poll<io::Result<usize>>
{
    match self.io.read(buf) {
        Ok(n) => Poll::Ready(Ok(n)),
        Err(e) if e.kind() == WouldBlock => {
            self.register_readiness(cx, Readiness::READ);
            Poll::Pending
        }
        Err(e) => Poll::Ready(Err(e)),
    }
}

7.4 When does the driver run?

A worker with no runnable tasks calls epoll_wait itself — no dedicated I/O thread. Idle workers naturally cover I/O polling, which is good for cache and power.


8. Timer driver — hashed wheel

Tokio uses a hierarchical hashed timer wheel (like Netty and the Linux kernel):

Level 0: 64 slots x 1 ms
Level 1: 64 slots x 64 ms
Level 2: 64 slots x 4 s
Level 3: 64 slots x 4 m
Level 4: 64 slots x 4 h

Insert: O(1). Expire: O(1) amortized. Millions of timers remain cheap. Each revolution "cascades" to a finer level.


9. spawn vs spawn_blocking vs yield_now

9.1 tokio::spawn

Schedules an async task; needs Send + 'static on a multi-thread runtime; hogs a worker if it never awaits.

9.2 CPU-bound trap

tokio::spawn(async {
    for i in 0..1_000_000_000 { hash(i); }
});

No .await means no yielding. Throughput drops to (N-1)/N per stuck worker. Fixes:

// Manual yielding
for i in 0..1_000_000_000 {
    hash(i);
    if i % 1000 == 0 { tokio::task::yield_now().await; }
}

// Or isolate on blocking pool
tokio::task::spawn_blocking(|| expensive_computation()).await.unwrap();

spawn_blocking uses a separate thread pool (default cap 512).

9.3 Rules of thumb

  • Short CPU work (< 100 µs): inline.
  • Long CPU work: spawn_blocking.
  • Blocking I/O (file, sync DB drivers): spawn_blocking.
  • Chunked parallel CPU: rayon + spawn_blocking or yield_now.

10. Function coloring

10.1 Symptom

fn sync_fn() {}
async fn async_fn() {}

async fn caller1() { sync_fn(); async_fn().await; } // OK

fn caller2() {
    async_fn();        // returns a Future, never runs
    async_fn().await;  // compile error
}

Sync must enter a runtime to drive async:

fn caller3() {
    let rt = tokio::runtime::Handle::current();
    rt.block_on(async_fn());
}

10.2 Bob Nystrom's take

From "What Color is Your Function?" (2015): JS, C#, Python, Rust all share this divide. Sync cannot easily call async; libraries pick a color and the ecosystem splits.

10.3 Why Go has no coloring

Every Go function is implicitly async. The runtime is always there; blocking calls are handled by the scheduler. Cost: Go binaries ship with a ~2 MB runtime, unsuitable for embedded.

10.4 Mitigations

Ship sync wrappers (reqwest::blocking), use runtime adapters, or keep sync and async sharply separated and bridge at the boundary.

10.5 block_on danger

tokio::spawn(async {
    let rt = tokio::runtime::Handle::current();
    rt.block_on(another_async()); // deadlock risk
});

The current worker is stuck. Use tokio::task::block_in_place to move the worker into the blocking pool and let another worker take over.


11. Channels

11.1 mpsc

let (tx, mut rx) = tokio::sync::mpsc::channel::<String>(32);
tokio::spawn(async move {
    for i in 0..10 { tx.send(format!("msg {}", i)).await.unwrap(); }
});
while let Some(msg) = rx.recv().await { println!("{}", msg); }

Bounded, multi-producer, single-consumer. send().await blocks on a full buffer.

11.2 oneshot

let (tx, rx) = tokio::sync::oneshot::channel::<u32>();
tokio::spawn(async move { tx.send(42).unwrap(); });
let value = rx.await.unwrap();

Single-value, ideal for request/response.

11.3 broadcast

Fan-out: every subscribed receiver sees every value. Slow receivers may lag.

11.4 watch

Latest-value pattern for configuration updates — older values may be dropped.

11.5 select!

tokio::select! {
    val = rx.recv() => { /* ... */ }
    _ = tokio::time::sleep(Duration::from_secs(1)) => { /* timeout */ }
    _ = shutdown.recv() => { /* stop */ }
}

Runs the first branch to become Ready; drops the rest.

11.6 Cancellation safety

Dropped branches lose intermediate state. AsyncReadExt::read_exact is not cancel-safe (partially read bytes vanish). mpsc::Receiver::recv is (the message stays queued).


12. Performance tuning

12.1 tokio-console

tokio = { version = "1", features = ["full", "tracing"] }
console-subscriber = "0.4"
fn main() { console_subscriber::init(); /* ... */ }

Shows every task, wake frequency, avg poll time, blocking tasks.

12.2 tracing

#[tracing::instrument]
async fn handle_request(req: Request) -> Response { /* ... */ }

Structured logging designed for async — spans survive across .await.

12.3 Flamegraph

cargo flamegraph --bin myapp

12.4 Knobs

Builder::new_multi_thread()
    .worker_threads(8)
    .max_blocking_threads(1024)
    .build()

Also box large Futures and prefer join_all to nested select! when waiting on many Futures.

12.5 Common anti-patterns

  • std::sync::Mutex held across .await — use tokio::sync::Mutex.
  • Long CPU work inside tokio::spawn — offload via spawn_blocking or rayon.
  • Hand-rolled Arc<Mutex<Vec<Task>>> — use tokio::task::JoinSet.

13. Tokio vs Go

AspectRust TokioGo
ModelStackless state machineStackful goroutine
StackNone (struct fields)2 KB dynamic
Size overheadFuture size (KB to MB)2 KB
SchedulerWork-stealing + LIFOWork-stealing + runnext
PreemptionCooperative onlyAsync (SIGURG, 1.14+)
CPU-boundCan stallAuto-preempted
RuntimeTens of KB~2 MB
EmbeddedYes (no_std runtimes)Practically no
Learning curveSteep (Pin, lifetimes)Gentle

Go trades size for simplicity; Rust trades simplicity for zero-cost and embedability.


14. Ecosystem

Runtimes: Tokio (de facto), async-std, smol, embassy (no_std), glommio (io_uring per-core).

Libraries: hyper/axum, reqwest, sqlx, sea-orm, tonic, tracing/opentelemetry, tower.

async fn in traits: historically required #[async_trait] with boxed Futures. Rust 1.75 stabilized native async fn in traits — though dyn Trait support is still evolving.


15. 2025 updates

  • Native async fn in traits since 1.75 (no dyn yet without workarounds).
  • Return Type Notation (RTN) lets you bound the Future returned by a trait method.
  • Borrow checker (Polonius, NLL) continues to cut false positives in async code.
  • Rust 2024 edition stabilizes gen blocks (sync generators).
  • Momentum around io_uring-based per-core runtimes like glommio.

16. Production scenarios

16.1 Slowdown after N requests

  • Connection leaks — check TIME_WAIT/CLOSE_WAIT.
  • Hot mutex — tokio-console shows many tasks queued on lock().await.
  • Worker saturation — top -H shows pinned worker threads; offload CPU work.

16.2 spawn_blocking pool exhausted

Default cap is 512. Either raise max_blocking_threads, replace a blocking library with an async one, or front-load the work into a bounded queue.

16.3 "20% CPU but no throughput"

Likely I/O waiting. Use strace / perf trace. Lots of futex time = lock contention.


17. Learning roadmap

  • Async Book, Tokio tutorial, Jon Gjengset "Crust of Rust" videos.
  • "Rust for Rustaceans", "Zero to Production in Rust", "Asynchronous Programming in Rust".
  • Tokio source under tokio-rt/src/runtime/scheduler/multi_thread/.

18. Cheat sheet

Future::poll(Pin<&mut Self>, &mut Context) -> Poll<Output>
async fn       -> state machine struct
.await         -> match poll { Pending => return; .. }
Waker          -> "poll me again" callback
Pin            -> "will not move" type guarantee
Tokio          -> worker + LIFO + local + global queues
                  + I/O driver polling + work-stealing
Spawning       -> spawn / spawn_blocking / block_in_place
Channels       -> mpsc, oneshot, broadcast, watch
Avoid          -> std Mutex across await, long CPU work,
                  non cancel-safe branches in select!
Tools          -> tokio-console, tracing, cargo flamegraph

19. Quiz

Q1. What does async fn become at compile time?

A compiler-generated state machine struct. Each .await becomes an enum variant; locals that cross suspend points become fields. It implements Future::poll as a big match. No runtime stack — zero-cost.

Q2. What is a Waker for?

A callback meaning "poll this task again." When poll returns Pending, the Future stores the current context's waker and registers it with an I/O driver, timer or channel. Later the event fires waker.wake() and the runtime re-queues the task.

Q3. Why Pin?

Because async state machines can be self-referential (a field that borrows another field must stay valid across suspension). Moving such a struct would dangle the internal pointer. Pin<&mut T> promises the value never moves again, making self-references sound.

Q4. CPU-bound loop inside tokio::spawn?

It monopolises one worker because Tokio is cooperative. Throughput drops to (N-1)/N while the loop runs. Fix with periodic tokio::task::yield_now().await or by moving the work to spawn_blocking.

Q5. Goroutine vs Future at the core?

Go is stackful (2 KB dynamic stack, suspend anywhere, auto preemption). Rust is stackless (state machine struct, suspend only at .await, cooperative). Rust pays with function coloring and Pin; Go pays with a mandatory runtime.

Q6. What is select! cancellation safety?

select! drops the non-winning Futures. If a dropped Future held intermediate state (bytes already read, for example) it is lost. Cancel-safe APIs survive the drop; mpsc::Receiver::recv is cancel-safe, AsyncReadExt::read_exact is not.

Q7. How did Go avoid function coloring?

By making every function implicitly async: the runtime is always there, any function can become a goroutine, blocking I/O is rewired by the scheduler. The cost is a ~2 MB mandatory runtime — no embedded or kernel targets. Rust chose the opposite trade-off.


If you liked this, see also:

  • "Go Runtime and GMP Scheduler Deep Dive" — the stackful counterpart.
  • "Python GIL and CPython internals" — another alternative.
  • "eBPF Deep Dive" — the kernel programming revolution.
  • "io_uring and Async I/O models" — the future of Tokio.

현재 단락 (1/302)

- Rust `async fn` is **compiled into a state machine struct** at compile time. No runtime stack — a ...

작성 글자: 0원문 글자: 15,799작성 단락: 0/302