Skip to content

필사 모드: The Evolution of the Linux CPU Scheduler — From CFS to EEVDF

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

"CPU utilization is 60 percent, so why does response time spike?" — anyone who has run production systems has faced this question. The answer usually lies in the scheduler. Even with CPU to spare, if your thread spends a long time waiting on the run queue (scheduling delay), the user experiences slowness. Throttling caused by Kubernetes CPU limits, noisy neighbors, and context switch storms are the usual suspects that barely show up on utilization graphs.

In this post we start from the essence of the problem a CPU scheduler must solve, walk through the principles and limits of CFS — the default for over 15 years — and then EEVDF, which became the default scheduler in kernel 6.6. In the second half we cover the connection between cgroup CPU control and Kubernetes, isolation for low-latency workloads, and a step-by-step real-world scenario improving API server tail latency.

The Essence of Scheduling — Three Rabbits to Chase

A CPU scheduler fundamentally negotiates a trade-off among three goals.

Fairness

"everyone gets CPU proportional to weight"

/\

/ \

/ \

/ ?? \

/ \

/----------\

Latency Throughput

"a woken task "minimize context switches,

runs soon" keep caches warm"

- Fairness: tasks of equal priority should receive equal CPU time; different nice values should yield proportional shares.

- Latency: a task that wakes from I/O (a keystroke, a network reply) should get the CPU quickly for good perceived responsiveness.

- Throughput: switching tasks too often burns context switch cost and pollutes caches, reducing total work done.

Reducing latency demands frequent preemption; raising throughput demands long uninterrupted runs. The two collide head-on. The transition from CFS to EEVDF is the story of resolving this collision by keeping fairness while making latency requirements a first-class citizen.

How CFS Works — vruntime and the Red-Black Tree

CFS (Completely Fair Scheduler) had been the default scheduler since kernel 2.6.23 in 2007. Its core idea is a simulation of an "ideal perfectly fair CPU." With N tasks, each should ideally receive 1/N of the CPU simultaneously; since a real CPU runs one task at a time, CFS approximates the ideal by tracking how much CPU time each task has received and always running the one that has received the least.

That "time received so far" is vruntime (virtual runtime).

Run queue red-black tree (sorted by vruntime)

[ T3: 220ms ]

/ \

[ T1: 180ms ] [ T5: 300ms ]

/ \ \

[ T7: 150ms ] [ T2: 200ms ] [ T4: 350ms ]

^

|

Leftmost node = minimum vruntime = next to run

(cached O(1); insert/delete are O(log N))

The operating rules:

1. Runnable tasks are sorted in a red-black tree keyed by vruntime.

2. The scheduler always picks the leftmost task (smallest vruntime).

3. While a task runs its vruntime grows; once large enough, it drifts to the right of the tree and others get their turn.

4. A task that slept and woke up tends to have small vruntime (it received less) and so runs soon. To prevent monopolizing the tree with an ancient value, its vruntime is clamped near the current minimum.

Nice values (priority) are reflected in the rate at which vruntime grows.

vruntime increment = actual runtime x (base weight 1024 / task weight)

nice 0 -> weight 1024 (real time accrues unchanged)

nice -5 -> weight 3121 (vruntime grows at ~1/3 speed -> 3x CPU)

nice 5 -> weight 335 (vruntime grows ~3x faster -> 1/3 CPU)

The weight table is designed so one nice step changes

CPU share by roughly a factor of 1.25.

A high-priority task accrues vruntime slowly, stays on the left of the tree longer, and thus receives more CPU. An elegant design.

The Limits of CFS — Fair but Latency-Blind

CFS is fair in the long run, but offered weak guarantees about when a task runs in the short term.

First, there was no way to express a latency requirement. If an audio processing thread and a batch compile job both run at nice 0, CFS treats them identically. The audio thread does not need lots of CPU; it needs CPU often and on time. Nice adjusts quantity, not timing.

Second, preemption decisions relied on heuristics: wakeup granularity, minimum slice lengths, and other tuning knobs whose good values differed per workload, fueling endless "magic constant" debates.

The latency-nice patch series tried to bolt latency priorities onto CFS for years, but layering it over the CFS structure never came out clean. Eventually the community chose to replace the algorithm itself.

EEVDF — The New Default Scheduler in Kernel 6.6

EEVDF (Earliest Eligible Virtual Deadline First) is an algorithm proposed in a 1995 academic paper, implemented in the kernel by Peter Zijlstra, and since 6.6 (October 2023) it replaces the internal algorithm of the fair class previously provided by CFS. From the outside it is still SCHED_OTHER/SCHED_NORMAL; the internal pick logic changed.

There are two core concepts.

First, eligibility. The scheduler tracks each task share received versus deserved; a task that has already received more than its share (negative lag) is temporarily not eligible to be picked. This guards fairness.

Second, the virtual deadline. Among eligible tasks, the one with the earliest virtual deadline is picked. The virtual deadline is roughly "now plus the task time slice scaled by its weight."

EEVDF pick logic:

1) eligibility filter: only tasks with lag >= 0 are candidates

(tasks that ran beyond their share are excluded -> fairness)

2) deadline pick: among candidates, run the task whose

virtual deadline is earliest

task requesting a short slice -> nearer deadline

-> picked soon and often (low latency)

task requesting a long slice -> farther deadline

-> picked less often, runs longer

(high throughput)

=> total CPU received stays weight-proportional, while each task

can choose "small and frequent" vs "large and rare" slices

This brings an important advance. Because slice length now carries meaning, there is a foundation for expressing a latency need — "give me the same total CPU but run me more often" — through the slice request of sched_setattr (since kernel 6.12 a slice hint can be set via the sched_runtime field). Another win: several CFS-era heuristic knobs were removed and replaced by choices with algorithmic grounding.

Comparing CFS and EEVDF:

| Aspect | CFS | EEVDF |

| --- | --- | --- |

| Pick criterion | minimum vruntime | earliest virtual deadline among eligible |

| Fairness | via vruntime convergence | via lag tracking (more rigorous) |

| Latency expression | none (nice adjusts quantity only) | expressible via slice length |

| Preemption decision | wakeup granularity heuristics | decided by deadline comparison |

| Tuning knobs | many (magic constant debates) | greatly reduced |

| Theoretical basis | intuitive fair queueing | latency bound proofs in the 1995 paper |

The operational takeaways: woken latency-sensitive tasks run more predictably and sooner, and upgrading the kernel across 6.6 subtly changes scheduler behavior — run performance regression tests.

Inspect scheduler features of the running kernel

uname -r # 6.6 or later means EEVDF

cat /sys/kernel/debug/sched/features | tr ' ' '\n' | head

Per-task scheduling statistics

cat /proc/1234/sched | head -20 # vruntime, slice, wait time, ...

The Scheduling Class Hierarchy — fair Is One of Several

EEVDF/CFS is actually only part of the picture. The Linux scheduler is a hierarchy of classes with strict priority: if a higher class has a runnable task, lower classes get no CPU at all.

Higher priority

+------------------+

| stop | kernel-internal (CPU hotplug etc.)

+------------------+

| deadline (DL) | SCHED_DEADLINE: runtime/deadline/period guarantees

+------------------+

| realtime (RT) | SCHED_FIFO / SCHED_RR: static priorities 1-99

+------------------+

| fair (EEVDF) | SCHED_OTHER / SCHED_BATCH: all normal tasks

+------------------+

| idle | SCHED_IDLE: only when nothing else runs

+------------------+

Lower priority

Inspect and change class/priority of a task

chrt -p 1234 # current policy

chrt -f -p 50 1234 # SCHED_FIFO priority 50

chrt -d --sched-runtime 5000000 --sched-deadline 10000000 \

--sched-period 10000000 -p 0 1234 # deadline class

Safety valve preventing RT tasks from monopolizing the CPU

sysctl kernel.sched_rt_runtime_us # default 950000 (95% per second)

Practical caution: a runaway SCHED_FIFO thread at priority 99 can freeze the whole system. When using RT, always verify the sched_rt_runtime_us safety valve and, if possible, run a watchdog.

cgroup CPU Control — The Kubernetes Connection

In the container era, scheduling control centers on cgroups, not individual tasks. The cgroup v2 CPU controller provides two axes:

| Interface | Meaning | Kubernetes mapping |

| --- | --- | --- |

| cpu.weight (1-10000, default 100) | relative share under contention | derived from requests.cpu |

| cpu.max (quota period) | absolute ceiling per period | derived from limits.cpu |

| cpu.stat | usage and throttle statistics | the key to throttling diagnosis |

cpu.weight is a ratio that matters only under contention; cpu.max is an absolute ceiling enforced even when the CPU is otherwise idle. This difference is operationally decisive.

Hands-on cgroup v2 CPU control

mkdir /sys/fs/cgroup/demo

echo "+cpu" > /sys/fs/cgroup/cgroup.subtree_control

Weight 200 (2x versus a default-100 sibling under contention)

echo 200 > /sys/fs/cgroup/demo/cpu.weight

50ms per 100ms period = ceiling of 0.5 CPU

echo "50000 100000" > /sys/fs/cgroup/demo/cpu.max

Move the current shell into this cgroup and load-test

echo $$ > /sys/fs/cgroup/demo/cgroup.procs

Kubernetes CPU limits are implemented as the cpu.max quota (the mechanism formerly known as CFS bandwidth control, which behaves identically under EEVDF). The problem is throttling. A multithreaded application running 8 threads concurrently can burn through the quota of a 100ms period within tens of milliseconds and then everything freezes for the rest of the period. This is the classic pattern of low average CPU utilization with spiky p99 latency.

Throttling diagnosis: cpu.stat of the container cgroup

cat /sys/fs/cgroup/kubepods.slice/.../cpu.stat

nr_periods -- elapsed enforcement periods

nr_throttled -- periods in which the quota ran out

throttled_usec -- cumulative stalled time

even a 1% nr_throttled / nr_periods ratio affects tail latency

There are three response options: raise or remove limits and rely on weight (requests) based protection; align application worker counts with the quota (for example GOMAXPROCS in Go, ActiveProcessorCount in Java); or grant dedicated cores to latency-sensitive pods via Guaranteed QoS plus the static CPU manager.

CPU Affinity and Isolation — The Low-Latency Recipe

Workloads where microseconds matter (trading, real-time media, packet processing) do not hope the scheduler behaves; they carve out cores entirely.

1) Boot parameters isolating CPUs 2-5 from normal scheduling

(append to GRUB_CMDLINE_LINUX)

isolcpus=2-5 nohz_full=2-5 rcu_nocbs=2-5

isolcpus: excluded from scheduler load balancing

nohz_full: stop the timer tick when one task runs on the CPU

rcu_nocbs: offload RCU callbacks to other CPUs

2) Keep IRQs away from the isolated cores

echo 3 > /proc/irq/default_smp_affinity # CPUs 0,1 only (mask 0b0011)

3) Pin the workload onto isolated cores

taskset -c 2-5 ./latency-critical-app

or change at runtime

taskset -cp 2-5 1234

4) Verify the tick has stopped on isolated cores

cat /proc/interrupts | grep -i "local timer"

On recent kernels you can use isolated cpuset partitions instead of boot parameters.

Dynamic isolated partition via cgroup v2 cpuset

mkdir /sys/fs/cgroup/rt-part

echo "+cpuset" > /sys/fs/cgroup/cgroup.subtree_control

echo 2-5 > /sys/fs/cgroup/rt-part/cpuset.cpus

echo isolated > /sys/fs/cgroup/rt-part/cpuset.cpus.partition

Isolation is not free. Isolated cores are unavailable to normal workloads, lowering overall utilization, and a bad configuration leaves isolated cores idle while the rest overload. It is a tool to use only when justified by measurement.

Observing Context Switches and Run Queue Latency

Measuring scheduling problems revolves around two metrics: context switch frequency and the time spent waiting on the run queue (scheduling delay).

System-wide context switch rate

vmstat 1 # the cs column

Voluntary/involuntary switches per process

pidstat -w -p 1234 1

voluntary: yielded on its own, e.g. I/O wait (normal pattern)

nonvoluntary: slice exhausted/preempted (contention signal)

Precise recording/analysis with perf sched

perf sched record -- sleep 10

perf sched latency --sort max # max scheduling delay per task

perf sched timehist # timeline: wait time, sch delay columns

Run queue latency histogram with eBPF (BCC runqlat)

runqlat 10 1

or a bpftrace one-liner

bpftrace -e 'tracepoint:sched:sched_wakeup { @qt[args->pid] = nsecs; }

tracepoint:sched:sched_switch /@qt[args->next_pid]/

{ @lat = hist(nsecs - @qt[args->next_pid]); delete(@qt[args->next_pid]); }'

Interpretation in brief: if run queue latency p99 exceeds several milliseconds while CPU utilization is modest, suspect cgroup throttling or core imbalance; if utilization is also high, it is genuine CPU shortage — add capacity or move workloads. An explosion of involuntary switches means many threads fighting over the same cores; review affinity and worker counts.

NUMA Balancing — Scheduling That Considers Memory

On multi-socket servers, the scheduler decides not only which CPU but also near which memory a task should run. Remote NUMA node memory access is tens of percent slower than local access.

Inspect NUMA topology

numactl --hardware

lscpu | grep -i numa

Automatic NUMA balancing (moves tasks/memory closer based on page faults)

sysctl kernel.numa_balancing

Per-task NUMA statistics

cat /proc/1234/numa_maps | head

numastat -p 1234 # memory distribution per node

Explicit placement (large-memory workloads like databases)

numactl --cpunodebind=0 --membind=0 ./db-server

Automatic NUMA balancing benefits most workloads, but its page-fault-based sampling has a cost; large-memory databases often disable it and place explicitly. Again, measure first.

Real-World Scenario — Improving API Server Tail Latency

Situation: a Go API server on Kubernetes. Mean response 5ms, but p99 spikes to 150ms. Node CPU utilization is 55 percent. We diagnose step by step.

Step 1, check throttling first. Low utilization with latency spikes makes it suspect number one.

Find the pod cgroup path and inspect cpu.stat

kubectl exec api-pod -- cat /sys/fs/cgroup/cpu.stat

assume nr_throttled was 8% of nr_periods -> prime suspect confirmed

Step 2, interpret the cause. The limit was 2 CPUs, but the Go runtime had GOMAXPROCS set to 16, the node core count. Sixteen workers running concurrently exhaust the 200ms quota (100ms period x 2) almost instantly, then everything stalls until the period ends.

Step 3, apply remedies one at a time and measure.

Remedy A: align GOMAXPROCS with the quota (Go 1.25+ is container-aware

automatically; earlier versions use the automaxprocs library or env)

kubectl set env deployment/api GOMAXPROCS=2

Remedy B: if throttling remains, consider raising or removing limits

(keep requests -> cpu.weight protection stays intact)

Remedy C: if node-level contention is suspected, check neighbor

impact with runqlat

runqlat -p $(pgrep api-server) 10 1

Step 4, record the results. In this pattern, remedy A alone often drives nr_throttled to near zero and stabilizes p99. Any remaining tail may come from GC or networking; narrow the next segment with the same approach.

Key lesson: limit-based throttling is a problem that is invisible in averages and visible only in tails, and nr_throttled in cpu.stat is the most honest indicator.

One Step Further — PSI and sched_ext

Finally, two modern tools for scheduling observation and experimentation.

First, PSI (Pressure Stall Information). Measuring run queue latency directly with eBPF is precise, but PSI gives you "the share of time tasks were runnable yet waiting for CPU," aggregated continuously by the kernel and exposed as files. It is also available per cgroup, making it a highly practical summary metric of per-pod CPU pressure.

System-wide CPU pressure

cat /proc/pressure/cpu

some avg10=4.21 avg60=2.10 ... meaning that over the last 10

seconds, some tasks spent 4.21% of the time waiting for CPU

Per-cgroup (pod) CPU pressure

cat /sys/fs/cgroup/kubepods.slice/cpu.pressure

Memory/IO pressure share the same format

(useful to disambiguate multi-resource bottlenecks)

cat /proc/pressure/memory /proc/pressure/io

The combination "CPU utilization is low but the some line of cpu.pressure is high" is a strong signal of throttling or core imbalance. Alerting on pressure rather than utilization correlates far better with tail latency.

Second, sched_ext. Merged in kernel 6.12, the extensible scheduler class (SCHED_EXT) is a framework that lets you write the scheduling policy itself as an eBPF program and load/replace it at runtime.

| Aspect | fair (EEVDF) | sched_ext |

| --- | --- | --- |

| Who decides policy | built-in kernel algorithm | a BPF program you write |

| How to swap | kernel build/reboot | runtime load/unload |

| Safety on failure | not applicable | watchdog reverts to the default scheduler |

| Best suited for | general-purpose default | gaming/latency niches, experiments, bespoke workloads |

You can experiment with policies specialized for game workloads or tuned to the cache locality of a specific service without rebuilding the kernel, and thanks to the BPF verifier and the watchdog, failures do not take the system down. You will probably not change the default on general-purpose servers any time soon, but it is worth knowing that "fit the scheduler to the workload" is now an available option.

Pitfalls and Anti-patterns

- Trying to solve latency problems with nice values. Nice distributes CPU quantity; it does not guarantee response timing. For latency, look at EEVDF slices, RT classes, or isolation.

- Putting CPU limits on every pod as a "safety net." Throttling kicks in even without contention and hurts tail latency. Understand the distinct roles of weight (requests) protection and limits ceilings.

- Letting in-container runtimes see host core counts (GOMAXPROCS, Java thread pools). Mismatch between quota and worker count is the most frequent throttling cause.

- Using SCHED_FIFO without a watchdog. A runaway makes the system hang.

- Over-allocating isolcpus. Isolated cores cannot be used by normal tasks even when idle.

- Passing a major kernel upgrade (especially across 6.6) without scheduler performance testing.

- Watching only average CPU utilization. Without run queue latency and throttle counters, tail problems stay invisible.

Operations Checklist

- [ ] Do you know your kernel version and scheduler (EEVDF on 6.6+)?

- [ ] Are nr_throttled/throttled_usec from cpu.stat collected as pod metrics?

- [ ] Are tools ready to measure run queue latency (runqlat/perf sched)?

- [ ] Do container runtime worker counts match the CPU quota?

- [ ] Are the criteria for which pods get limits documented?

- [ ] When using RT classes, did you verify the sched_rt_runtime_us safety valve?

- [ ] Is low-latency isolation (dedicated cores, IRQ affinity) justified by measurement?

- [ ] Are you aware of NUMA topology and automatic balancing settings?

- [ ] Is there alerting on involuntary context switch explosions?

- [ ] One scheduler-related change at a time, recorded with before/after p99?

Closing

CFS endured 15 years on a simple and powerful principle — fairness for all — and EEVDF brought the dimension of latency, the question of when a task runs, into the algorithm while preserving that fairness. For operators, the scheduler is not a black box. Looking through the three windows of cpu.stat, runqlat, and perf sched, the answer to the mystery of "CPU to spare, yet slow" is usually right there. In the next post we dissect the isolation technologies that pair with the scheduler: cgroups and namespaces, the true substance of containers.

References

- CFS design document: https://docs.kernel.org/scheduler/sched-design-CFS.html

- EEVDF scheduler documentation: https://docs.kernel.org/scheduler/sched-eevdf.html

- LWN: An EEVDF CPU scheduler for Linux: https://lwn.net/Articles/925371/

- CFS bandwidth control (the mechanism behind cpu.max): https://docs.kernel.org/scheduler/sched-bwc.html

- SCHED_DEADLINE documentation: https://docs.kernel.org/scheduler/sched-deadline.html

- cgroup v2 official documentation: https://docs.kernel.org/admin-guide/cgroup-v2.html

- sched(7) man page — scheduling policies overview: https://man7.org/linux/man-pages/man7/sched.7.html

- NO_HZ (tickless) documentation: https://docs.kernel.org/timers/no_hz.html

- perf official wiki: https://perf.wiki.kernel.org/

- BCC tools collection (runqlat and friends): https://github.com/iovisor/bcc

- Kubernetes container resource management: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

- LWN: Completing the EEVDF scheduler: https://lwn.net/Articles/969062/

현재 단락 (1/196)

"CPU utilization is 60 percent, so why does response time spike?" — anyone who has run production sy...

작성 글자: 0원문 글자: 18,024작성 단락: 0/196