Skip to content

필사 모드: Hacking the Linux Memory Hierarchy — swap, zram, and the Wild Idea of Swapping to VRAM

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction — Swap to Your Spare VRAM?

A project called nbd-vram recently hit the Hacker News front page and kicked off quite a debate. The idea is simple and provocative. Modern GPUs ship with 16GB or 24GB of VRAM, and when you are not gaming or running LLM inference, most of it sits idle. So why not expose that fast memory as an NBD (Network Block Device) and use it as a Linux swap device? You get a swap tier far faster than SSD, essentially for free.

The comment section split in two. The excited camp ran the numbers — "PCIe 4.0 x16 is 32GB per second in theory, that is genuinely 5x faster swap than NVMe" — while the cynics countered with "just buy more RAM with that money." But the real value of this debate lies elsewhere. To judge whether this contrarian idea is reasonable, you need to properly understand what swap is, how Linux tiers its memory, and where zram and zswap fit in.

Using nbd-vram as the hook, this post covers Linux memory management fundamentals, swap tuning, compressed memory (zram/zswap), OOM handling, cgroup v2 memory control, and the container-environment traps — all in one pass.

Linux Memory Management Basics — Starting with the Swap Misconception

Page Cache and Anonymous Pages

The memory pages Linux manages come in two broad kinds.

┌──────────────────────── Physical RAM ─────────────────────────┐

│ │

│ File-backed pages (page cache) Anonymous pages │

│ ─ executables, libraries ─ heap (malloc), stacks │

│ ─ cached file contents ─ shared memory │

│ → original exists on disk → no original on disk │

│ → reclaim: just drop them → reclaim: must write to │

│ (writeback if dirty) swap first │

└───────────────────────────────────────────────────────────────┘

The key insight: file-backed pages have an original copy on disk, so under memory pressure the kernel can simply drop them (writing dirty ones back first). Anonymous pages have no original — to evict them, their contents must be stored somewhere. That somewhere is swap.

The Myth That "Swap Makes Things Slow"

Contrary to popular belief, a system without swap is not faster. Without swap, the kernel has no way to reclaim anonymous pages under pressure, so it keeps dropping file-backed pages instead — your code and libraries! The result is the system repeatedly re-reading executable code it just discarded, which is the infamous thrashing. Sending cold anonymous pages — like initialization code that runs once and never again — to swap, and keeping hot page cache in their place, is often a net win for overall performance.

In short, swap is not "an emergency exit for when memory runs out" but "a mechanism that widens the kernel's reclaim options." Kernel developer Chris Down's essay "In defence of swap" explains this perspective beautifully.

swappiness — A Balance Between What and What

vm.swappiness is a value from 0 to 200 (older docs say 0 to 100) that sets the balance between reclaiming anonymous pages versus file pages. It is not "how eagerly to start swapping."

Check the current value

cat /proc/sys/vm/swappiness

Change temporarily

sudo sysctl vm.swappiness=100

Make it permanent

echo 'vm.swappiness = 100' | sudo tee /etc/sysctl.d/99-swap.conf

sudo sysctl --system

The intuition for the values:

| Value | Meaning | Suitable for |

| --- | --- | --- |

| 0-10 | strongly avoid reclaiming anonymous pages | very slow HDD swap, dedicated DB servers |

| 60 | default, prefers reclaiming file pages | general servers |

| 100 | treat anonymous and file pages equally | fast swap like zram/zswap |

| 100-200 | prefer reclaiming anonymous pages | very fast swap + heavy page cache demand |

With swap close to RAM speed — as with zram — raising swappiness to 100 or beyond is standard practice, because the cost model has changed.

zram vs zswap — Sorting Out the Compressed Memory Tiers

Both technologies "extend RAM through compression," but their architectures differ. They are easy to confuse, so here is a picture.

[zram] [zswap]

anonymous page reclaim anonymous page reclaim

│ │

▼ ▼

zram block device zswap compressed pool (RAM cache)

(= compressed swap in RAM) │ when full / pages go cold

│ ▼

▼ real swap device

done. (no disk behind it*) (writeback to SSD/HDD)

* a backing device can be configured via the writeback option

| Aspect | zram | zswap |

| --- | --- | --- |

| Form | compressed RAM block device | compression cache in front of swap |

| Needs disk swap | no (standalone) | yes (real swap behind it) |

| Cold pages | stay resident in RAM | evicted to disk (LRU) |

| Incompressible pages | stay in RAM | bypass to disk |

| Best fit | machines without disk swap, laptops | servers that already have swap disks |

| Notable users | ChromeOS, Android, Fedora default | some server distributions |

zram in Practice

The zram-generator approach that Fedora adopted by default is the cleanest.

/etc/systemd/zram-generator.conf

[zram0]

Half of RAM, capped at 8GiB

zram-size = min(ram / 2, 8192)

compression-algorithm = zstd

swap-priority = 100

For manual setup:

Load the module and configure the device

sudo modprobe zram num_devices=1

echo zstd | sudo tee /sys/block/zram0/comp_algorithm

echo 8G | sudo tee /sys/block/zram0/disksize

sudo mkswap /dev/zram0

sudo swapon -p 100 /dev/zram0

Verify — note the compression ratio (COMPR vs DATA)

zramctl

NAME ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS MOUNTPOINT

/dev/zram0 zstd 8G 1.2G 310M 330M 8 [SWAP]

The two main algorithm choices are lzo-rle (fast, lower ratio) and zstd (slightly slower, better ratio). On a recent CPU, zstd usually wins. Compression ratios of 2x to 4x are typical depending on the workload.

zswap in Practice

zswap is built into the kernel — you just turn it on. But it only makes sense with a real swap device behind it.

Enable at boot — add to the kernel command line

zswap.enabled=1 zswap.compressor=zstd zswap.max_pool_percent=20

Toggle at runtime

echo 1 | sudo tee /sys/module/zswap/parameters/enabled

echo zstd | sudo tee /sys/module/zswap/parameters/compressor

echo 20 | sudo tee /sys/module/zswap/parameters/max_pool_percent

Check statistics (debugfs)

sudo grep -r . /sys/kernel/debug/zswap/

The decision rule is simple. Machines where you cannot or do not want a disk swap (laptops worried about flash wear, embedded devices) take zram; servers that already have SSD swap and want to cut swap IO take zswap. Running both at once means compressing twice — an anti-pattern to avoid.

NBD — How Network Block Devices Work

To understand nbd-vram you need NBD. NBD is a simple protocol that "forwards block device read/write requests to a server across a socket."

┌─────────── client ───────────────┐ ┌────── server process ─────┐

│ app → /dev/nbd0 (block device) │ │ receives requests and │

│ │ │ │ applies them to its │

│ nbd kernel module │ sock │ "storage" │

│ └──── read/write ────────┼─────▶│ │

│ │ │ the storage can be │

│ │ │ anything: a file, RAM, │

│ │ │ a remote disk, │

│ │ │ ...or GPU VRAM │

└──────────────────────────────────┘ └───────────────────────────┘

The crucial point: the client neither knows nor cares where the server stores the data. nbd-vram implements this server with CUDA/OpenCL, turning block writes into cudaMemcpy calls into a VRAM buffer — a "VRAM disk." Connecting over a Unix socket on the same host minimizes the networking overhead too.

Conceptual flow (per the nbd-vram README)

1. Start the nbd server backed by VRAM (e.g., allocate 8GB)

./nbd-vram-server --size 8G --socket /run/nbd-vram.sock &

2. Connect the client → /dev/nbd0 appears

sudo nbd-client -unix /run/nbd-vram.sock /dev/nbd0

3. Use it as swap

sudo mkswap /dev/nbd0

sudo swapon -p 50 /dev/nbd0 # higher priority than the SSD swap

The Performance Arithmetic of VRAM Swap — PCIe Bandwidth Math

Whether this idea is reasonable can be judged with numbers. Compare rough bandwidth and latency across the memory tiers:

| Tier | Bandwidth (approx.) | Latency (approx.) |

| --- | --- | --- |

| DDR5 RAM | 50-80 GB/s | 80-100 ns |

| GPU VRAM (via PCIe 4.0 x16) | 32 GB/s theoretical, 20-25 GB/s effective | microseconds |

| NVMe SSD (PCIe 4.0 x4) | 5-7 GB/s | tens of microseconds |

| SATA SSD | 0.55 GB/s | hundreds of microseconds |

| zram (zstd, in RAM) | CPU-bound, several GB/s | compress/decompress cost |

The arithmetic is clear. VRAM access over PCIe 4.0 x16 is 3-5x faster than NVMe swap in bandwidth and at least an order of magnitude lower in latency. With PCIe 5.0 x16 (64 GB/s theoretical) the gap widens further. What you must not forget, though, is the overhead of the NBD path: page fault → block layer → nbd kernel module → userspace server → CUDA copy. Every hop involves context switches and copies, so you never reach the theoretical bandwidth. Even so, the conclusion "faster swap than NVMe" largely holds.

When It Makes Sense, and When It Does Not

It makes sense when:

- Your desktop has a gaming GPU whose VRAM sits idle during weekday work hours

- You need a temporary large-memory job on a machine where adding RAM is physically impossible (slot limits, laptops)

- You want to experiment and learn — it is a superb teaching aid that covers the block layer, NBD, and GPU memory all at once

It should be avoided when:

- You run real GPU workloads (LLM inference, gaming) in parallel — VRAM contention plus the risk of losing the swap device

- It is a production server — a GPU driver or process failure becomes a swap device failure, and when anonymous pages in swap are lost, the kernel has no choice but to kill the affected processes

- It is a laptop that suspends frequently — VRAM contents are not guaranteed to survive suspend

In short, the HN consensus is sound: "fun, educational, practical in narrow situations — but if you can add RAM, buy RAM."

OOM — the Kernel's Last Resort and Userspace Helpers

The Problem with the Kernel OOM Killer

When memory truly runs out, the kernel OOM killer terminates the process with the highest oom_score. The problem is that it fires far too late. The kernel OOM killer only moves once the very last page is gone, and in the minutes leading up to that, the system is usually frozen in thrashing.

Check per-process OOM scores

cat /proc/1234/oom_score

Protect an important process from OOM (-1000 means exempt)

echo -500 | sudo tee /proc/1234/oom_score_adj

earlyoom and systemd-oomd

This is why userspace daemons emerged to intervene "earlier and smarter."

earlyoom — preemptive kill when free memory/swap drops below thresholds

sudo apt install earlyoom

Fire below 10% memory and 5% swap; prefer killing certain processes

sudo tee /etc/default/earlyoom <<'EOF'

EARLYOOM_ARGS="-m 10 -s 5 --avoid '(^|/)(sshd|systemd)$' --prefer '(^|/)(chrome|java)$'"

EOF

sudo systemctl enable --now earlyoom

systemd-oomd goes one step further and watches PSI (Pressure Stall Information). Instead of "what percentage of memory is left," it acts on "how long processes are stalling because of memory," and it kills whole cgroups at a time.

Read PSI directly — some: some tasks stalled, full: all stalled

cat /proc/pressure/memory

some avg10=0.00 avg60=0.12 avg300=0.05 total=8123456

full avg10=0.00 avg60=0.03 avg300=0.01 total=2345678

systemd-oomd status

oomctl

systemctl status systemd-oomd

On a desktop, I strongly recommend enabling either earlyoom or systemd-oomd. It prevents the "one browser tab takes the whole system hostage" scenario.

cgroup v2 — Memory Control per Process Group

A more precise tool than system-wide tuning is the cgroup v2 memory controller. There are four key knobs:

| Knob | Meaning | Behavior when exceeded |

| --- | --- | --- |

| memory.min | absolute guarantee (exempt from reclaim) | other cgroups yield |

| memory.low | soft protection | protected only when there is slack |

| memory.high | soft ceiling | throttle + aggressive reclaim (no kill) |

| memory.max | hard ceiling | OOM kill on excess |

memory.high is especially useful. On reaching the limit it slows allocations and forces reclaim instead of killing, which fits the "this batch job eats too much memory but I do not want it dead" situation perfectly.

Cleanest: apply to a systemd service

sudo systemctl set-property batch-job.service MemoryHigh=4G MemoryMax=6G

Run a one-off command inside a constrained cgroup

systemd-run --scope -p MemoryHigh=2G -p MemoryMax=3G ./heavy-script.sh

Per-cgroup memory usage

systemd-cgtop -m

cat /sys/fs/cgroup/system.slice/batch-job.service/memory.current

Swap can be controlled per cgroup too. Setting memory.swap.max to 0 bans swap for just that group, enabling policies like "no swap for the DB, swap allowed for batch jobs."

Memory Traps in Container Environments

In containers, the concepts above get subtly twisted. Three traps people hit constantly.

**Trap 1 — free inside a container shows host numbers.** Run free -h inside a container and you see the host's total memory, not the cgroup limit. The classic failure: the application concludes "plenty of memory," grows its caches, and gets OOM-killed. Container-aware runtimes like the JVM and .NET read cgroup limits, but every script that parses free directly is wrong.

How to see the real limits inside a container

cat /sys/fs/cgroup/memory.max # hard limit

cat /sys/fs/cgroup/memory.current # current usage

**Trap 2 — page cache counts against container memory.** If a container reads a lot of files, the page cache shows up in memory.current. This is usually the answer to "my app only uses 1GB, why am I near the limit?" The cache is reclaimed under pressure so it is usually harmless, but if you do not know how much reclaimable cache sits below the memory.max OOM line, you will misread your monitoring graphs. Look at working-set metrics (container_memory_working_set_bytes in Kubernetes).

**Trap 3 — Kubernetes kept swap off (for a long time).** Kubernetes has disabled node swap by default, and only since 1.28 has NodeSwap entered beta. Real-world leftovers persist: kubelet refusing to start because it detects swap after you enabled zram on the host, and confusion because requests/limits accounting does not consider swap. If you want compressed memory on container nodes, review the kubelet settings (failSwapOn=false, swapBehavior) and consider whether you need an approach that does not look like swap at all.

Observability — Reading /proc/meminfo and smem

Measurement precedes tuning. At minimum you should be able to read these:

$ grep -E 'MemTotal|MemAvailable|Buffers|^Cached|SwapTotal|SwapFree|Dirty|AnonPages' /proc/meminfo

MemTotal: 32768000 kB

MemAvailable: 18234560 kB # ← look at this, not free

Buffers: 512000 kB

Cached: 9876000 kB # page cache

SwapTotal: 8388604 kB

SwapFree: 7340032 kB

Dirty: 12340 kB # dirty pages awaiting writeback

AnonPages: 10240000 kB # total anonymous pages

A small MemFree does not mean memory shortage. Page cache is reclaimable at any time, so the real signal is MemAvailable — "how much can be handed to a new workload right now."

For per-process memory, avoid the RSS trap (shared libraries double-counted) and look at PSS instead.

smem — top processes by PSS (proportionally attributed usage)

sudo smem -rs pss | head

PID User Command Swap USS PSS RSS

2143 app java -Xmx4g ... 0 3145728 3167234 3210240

1021 app postgres: writer 0 102400 145000 512000

Who is using swap, per process

sudo smem -rs swap | head

System-wide swap in/out trend (si/so columns)

vmstat 1 5

If vmstat shows persistently nonzero si/so, suspect swap thrashing, then confirm the actual latency impact with PSI (/proc/pressure/memory). That order is the standard playbook.

Benchmark Methodology — If You Want to Validate VRAM Swap

Here is the methodology for evaluating experimental setups like nbd-vram. The core principle: swap performance is measured by latency distribution, not throughput.

1. Baseline: raw device performance (fio, 4K random — mimics swap IO patterns)

sudo fio --name=swaptest --filename=/dev/nbd0 --rw=randrw --bs=4k \

--iodepth=32 --runtime=60 --time_based --direct=1 --group_reporting

2. Real swap pressure: force a working set bigger than RAM

stress-ng --vm 4 --vm-bytes 120% --vm-method flip --metrics-brief -t 120

3. Observe: record swap in/out and PSI during the load

vmstat 1 > vmstat.log &

while true; do cat /proc/pressure/memory >> psi.log; sleep 1; done &

4. Perceived metric: latency of unrelated work during the load (e.g., shell response)

time ls -R /usr/share > /dev/null

Keep at least three comparison groups: NVMe swap alone, zram alone, and the experimental setup (VRAM swap). Run the same stress-ng scenario in each, and compare p99 latency and PSI full time rather than averages — that is what matches real-world feel. One more thing: to normalize cache state across reboots, drop the page cache before measuring.

Drop caches before measurement (never use outside benchmarking)

sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

Appendix — Quick Disk Swap Recipes

Before zram/zswap there is the fundamental: creating a swap file. It can be added instantly with no partition surgery, which makes it especially useful in emergencies.

Create a 4GB swap file (ext4/xfs)

sudo fallocate -l 4G /swapfile

sudo chmod 600 /swapfile

sudo mkswap /swapfile

sudo swapon /swapfile

Activate automatically at boot

echo '/swapfile none swap defaults 0 0' | sudo tee -a /etc/fstab

On btrfs, use the dedicated command (handles CoW/compression attributes)

sudo btrfs filesystem mkswapfile --size 4g /swap/swapfile

Check priorities — higher numbers are used first

swapon --show

cat /proc/swaps

With multiple swap devices, the kernel uses the highest-priority one first. Setting zram to 100 and disk swap to 10 creates a natural waterfall: "fast tier first, spill to disk."

Recommended Configurations at a Glance

By environment:

| Environment | Recommended setup |

| --- | --- |

| Desktop/laptop (16GB or less) | zram (half of RAM, zstd) + swappiness 100-180 + earlyoom |

| Desktop (32GB or more) | zram 4-8GB + systemd-oomd, small disk swap |

| General server (with SSD) | zswap + NVMe swap partition + per-service memory.high caps |

| DB server | keep small swap + memory.swap.max=0 for the DB cgroup only + low swappiness |

| Kubernetes node | default to no swap with precise requests/limits; adopt NodeSwap carefully |

| Workstation with an idle GPU | (for fun) nbd-vram swap, at lower priority than zram |

Closing Thoughts

nbd-vram is, in a sense, a toy. But like all good toys, playing with it teaches you serious things: that swap is not evil but a reclaim option; that compressed memory (zram/zswap) is already the default in mainstream distributions; that OOM should not be left to the kernel alone but handled preemptively with PSI; and that memory observability in the container era starts with reading cgroup files directly.

In 2026, AI workloads have made us more sensitive to the memory hierarchy than ever. In an era where the boundary between VRAM and RAM is blurring (with unified memory architectures spreading), a project that turns GPU memory into swap through nothing but the block device abstraction is also a fine showcase of the elegance of Linux abstractions. Give it a spin some weekend. Just not in production.

References

- nbd-vram (GitHub): https://github.com/c0dejedi/nbd-vram

- Hacker News discussion: https://news.ycombinator.com/

- Kernel docs — zram: https://docs.kernel.org/admin-guide/blockdev/zram.html

- Kernel docs — zswap: https://docs.kernel.org/admin-guide/mm/zswap.html

- Kernel docs — cgroup v2: https://docs.kernel.org/admin-guide/cgroup-v2.html

- Kernel docs — sysctl/vm (swappiness etc.): https://docs.kernel.org/admin-guide/sysctl/vm.html

- Chris Down — In defence of swap: https://chrisdown.name/2018/01/02/in-defence-of-swap.html

- PSI (Pressure Stall Information) docs: https://docs.kernel.org/accounting/psi.html

- earlyoom (GitHub): https://github.com/rfjakob/earlyoom

- systemd-oomd manual: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html

- Kubernetes — swap memory management blog: https://kubernetes.io/blog/2023/08/24/swap-linux-beta/

- GeekNews: https://news.hada.io/

현재 단락 (1/199)

A project called nbd-vram recently hit the Hacker News front page and kicked off quite a debate. The...

작성 글자: 0원문 글자: 17,011작성 단락: 0/199