Skip to content
Published on

Hacking the Linux Memory Hierarchy — swap, zram, and the Wild Idea of Swapping to VRAM

Authors

Introduction — Swap to Your Spare VRAM?

A project called nbd-vram recently hit the Hacker News front page and kicked off quite a debate. The idea is simple and provocative. Modern GPUs ship with 16GB or 24GB of VRAM, and when you are not gaming or running LLM inference, most of it sits idle. So why not expose that fast memory as an NBD (Network Block Device) and use it as a Linux swap device? You get a swap tier far faster than SSD, essentially for free.

The comment section split in two. The excited camp ran the numbers — "PCIe 4.0 x16 is 32GB per second in theory, that is genuinely 5x faster swap than NVMe" — while the cynics countered with "just buy more RAM with that money." But the real value of this debate lies elsewhere. To judge whether this contrarian idea is reasonable, you need to properly understand what swap is, how Linux tiers its memory, and where zram and zswap fit in.

Using nbd-vram as the hook, this post covers Linux memory management fundamentals, swap tuning, compressed memory (zram/zswap), OOM handling, cgroup v2 memory control, and the container-environment traps — all in one pass.

Linux Memory Management Basics — Starting with the Swap Misconception

Page Cache and Anonymous Pages

The memory pages Linux manages come in two broad kinds.

┌──────────────────────── Physical RAM ─────────────────────────┐
│                                                               │
│  File-backed pages (page cache)    Anonymous pages            │
│  ─ executables, libraries          ─ heap (malloc), stacks    │
│  ─ cached file contents            ─ shared memory            │
│  → original exists on disk         → no original on disk      │
│  → reclaim: just drop them         → reclaim: must write to   │
│    (writeback if dirty)              swap first               │
└───────────────────────────────────────────────────────────────┘

The key insight: file-backed pages have an original copy on disk, so under memory pressure the kernel can simply drop them (writing dirty ones back first). Anonymous pages have no original — to evict them, their contents must be stored somewhere. That somewhere is swap.

The Myth That "Swap Makes Things Slow"

Contrary to popular belief, a system without swap is not faster. Without swap, the kernel has no way to reclaim anonymous pages under pressure, so it keeps dropping file-backed pages instead — your code and libraries! The result is the system repeatedly re-reading executable code it just discarded, which is the infamous thrashing. Sending cold anonymous pages — like initialization code that runs once and never again — to swap, and keeping hot page cache in their place, is often a net win for overall performance.

In short, swap is not "an emergency exit for when memory runs out" but "a mechanism that widens the kernel's reclaim options." Kernel developer Chris Down's essay "In defence of swap" explains this perspective beautifully.

swappiness — A Balance Between What and What

vm.swappiness is a value from 0 to 200 (older docs say 0 to 100) that sets the balance between reclaiming anonymous pages versus file pages. It is not "how eagerly to start swapping."

# Check the current value
cat /proc/sys/vm/swappiness

# Change temporarily
sudo sysctl vm.swappiness=100

# Make it permanent
echo 'vm.swappiness = 100' | sudo tee /etc/sysctl.d/99-swap.conf
sudo sysctl --system

The intuition for the values:

ValueMeaningSuitable for
0-10strongly avoid reclaiming anonymous pagesvery slow HDD swap, dedicated DB servers
60default, prefers reclaiming file pagesgeneral servers
100treat anonymous and file pages equallyfast swap like zram/zswap
100-200prefer reclaiming anonymous pagesvery fast swap + heavy page cache demand

With swap close to RAM speed — as with zram — raising swappiness to 100 or beyond is standard practice, because the cost model has changed.

zram vs zswap — Sorting Out the Compressed Memory Tiers

Both technologies "extend RAM through compression," but their architectures differ. They are easy to confuse, so here is a picture.

[zram]                              [zswap]
                                  
 anonymous page reclaim              anonymous page reclaim
      │                                  │
      ▼                                  ▼
 zram block device                   zswap compressed pool (RAM cache)
 (= compressed swap in RAM)              │ when full / pages go cold
      │                                  ▼
      ▼                              real swap device
   done. (no disk behind it*)        (writeback to SSD/HDD)
                                  
 * a backing device can be configured via the writeback option
Aspectzramzswap
Formcompressed RAM block devicecompression cache in front of swap
Needs disk swapno (standalone)yes (real swap behind it)
Cold pagesstay resident in RAMevicted to disk (LRU)
Incompressible pagesstay in RAMbypass to disk
Best fitmachines without disk swap, laptopsservers that already have swap disks
Notable usersChromeOS, Android, Fedora defaultsome server distributions

zram in Practice

The zram-generator approach that Fedora adopted by default is the cleanest.

# /etc/systemd/zram-generator.conf
[zram0]
# Half of RAM, capped at 8GiB
zram-size = min(ram / 2, 8192)
compression-algorithm = zstd
swap-priority = 100

For manual setup:

# Load the module and configure the device
sudo modprobe zram num_devices=1
echo zstd | sudo tee /sys/block/zram0/comp_algorithm
echo 8G   | sudo tee /sys/block/zram0/disksize
sudo mkswap /dev/zram0
sudo swapon -p 100 /dev/zram0

# Verify — note the compression ratio (COMPR vs DATA)
zramctl
NAME       ALGORITHM DISKSIZE  DATA COMPR TOTAL STREAMS MOUNTPOINT
/dev/zram0 zstd            8G  1.2G  310M  330M       8 [SWAP]

The two main algorithm choices are lzo-rle (fast, lower ratio) and zstd (slightly slower, better ratio). On a recent CPU, zstd usually wins. Compression ratios of 2x to 4x are typical depending on the workload.

zswap in Practice

zswap is built into the kernel — you just turn it on. But it only makes sense with a real swap device behind it.

# Enable at boot — add to the kernel command line
# zswap.enabled=1 zswap.compressor=zstd zswap.max_pool_percent=20

# Toggle at runtime
echo 1    | sudo tee /sys/module/zswap/parameters/enabled
echo zstd | sudo tee /sys/module/zswap/parameters/compressor
echo 20   | sudo tee /sys/module/zswap/parameters/max_pool_percent

# Check statistics (debugfs)
sudo grep -r . /sys/kernel/debug/zswap/

The decision rule is simple. Machines where you cannot or do not want a disk swap (laptops worried about flash wear, embedded devices) take zram; servers that already have SSD swap and want to cut swap IO take zswap. Running both at once means compressing twice — an anti-pattern to avoid.

NBD — How Network Block Devices Work

To understand nbd-vram you need NBD. NBD is a simple protocol that "forwards block device read/write requests to a server across a socket."

 ┌─────────── client ───────────────┐      ┌────── server process ─────┐
 │  app → /dev/nbd0 (block device)  │      │  receives requests and    │
 │         │                        │      │  applies them to its      │
 │     nbd kernel module            │ sock │  "storage"                │
 │         └──── read/write ────────┼─────▶│                           │
 │                                  │      │  the storage can be       │
 │                                  │      │  anything: a file, RAM,   │
 │                                  │      │  a remote disk,           │
 │                                  │      │  ...or GPU VRAM           │
 └──────────────────────────────────┘      └───────────────────────────┘

The crucial point: the client neither knows nor cares where the server stores the data. nbd-vram implements this server with CUDA/OpenCL, turning block writes into cudaMemcpy calls into a VRAM buffer — a "VRAM disk." Connecting over a Unix socket on the same host minimizes the networking overhead too.

# Conceptual flow (per the nbd-vram README)
# 1. Start the nbd server backed by VRAM (e.g., allocate 8GB)
./nbd-vram-server --size 8G --socket /run/nbd-vram.sock &

# 2. Connect the client → /dev/nbd0 appears
sudo nbd-client -unix /run/nbd-vram.sock /dev/nbd0

# 3. Use it as swap
sudo mkswap /dev/nbd0
sudo swapon -p 50 /dev/nbd0   # higher priority than the SSD swap

The Performance Arithmetic of VRAM Swap — PCIe Bandwidth Math

Whether this idea is reasonable can be judged with numbers. Compare rough bandwidth and latency across the memory tiers:

TierBandwidth (approx.)Latency (approx.)
DDR5 RAM50-80 GB/s80-100 ns
GPU VRAM (via PCIe 4.0 x16)32 GB/s theoretical, 20-25 GB/s effectivemicroseconds
NVMe SSD (PCIe 4.0 x4)5-7 GB/stens of microseconds
SATA SSD0.55 GB/shundreds of microseconds
zram (zstd, in RAM)CPU-bound, several GB/scompress/decompress cost

The arithmetic is clear. VRAM access over PCIe 4.0 x16 is 3-5x faster than NVMe swap in bandwidth and at least an order of magnitude lower in latency. With PCIe 5.0 x16 (64 GB/s theoretical) the gap widens further. What you must not forget, though, is the overhead of the NBD path: page fault → block layer → nbd kernel module → userspace server → CUDA copy. Every hop involves context switches and copies, so you never reach the theoretical bandwidth. Even so, the conclusion "faster swap than NVMe" largely holds.

When It Makes Sense, and When It Does Not

It makes sense when:

  • Your desktop has a gaming GPU whose VRAM sits idle during weekday work hours
  • You need a temporary large-memory job on a machine where adding RAM is physically impossible (slot limits, laptops)
  • You want to experiment and learn — it is a superb teaching aid that covers the block layer, NBD, and GPU memory all at once

It should be avoided when:

  • You run real GPU workloads (LLM inference, gaming) in parallel — VRAM contention plus the risk of losing the swap device
  • It is a production server — a GPU driver or process failure becomes a swap device failure, and when anonymous pages in swap are lost, the kernel has no choice but to kill the affected processes
  • It is a laptop that suspends frequently — VRAM contents are not guaranteed to survive suspend

In short, the HN consensus is sound: "fun, educational, practical in narrow situations — but if you can add RAM, buy RAM."

OOM — the Kernel's Last Resort and Userspace Helpers

The Problem with the Kernel OOM Killer

When memory truly runs out, the kernel OOM killer terminates the process with the highest oom_score. The problem is that it fires far too late. The kernel OOM killer only moves once the very last page is gone, and in the minutes leading up to that, the system is usually frozen in thrashing.

# Check per-process OOM scores
cat /proc/1234/oom_score

# Protect an important process from OOM (-1000 means exempt)
echo -500 | sudo tee /proc/1234/oom_score_adj

earlyoom and systemd-oomd

This is why userspace daemons emerged to intervene "earlier and smarter."

# earlyoom — preemptive kill when free memory/swap drops below thresholds
sudo apt install earlyoom
# Fire below 10% memory and 5% swap; prefer killing certain processes
sudo tee /etc/default/earlyoom <<'EOF'
EARLYOOM_ARGS="-m 10 -s 5 --avoid '(^|/)(sshd|systemd)$' --prefer '(^|/)(chrome|java)$'"
EOF
sudo systemctl enable --now earlyoom

systemd-oomd goes one step further and watches PSI (Pressure Stall Information). Instead of "what percentage of memory is left," it acts on "how long processes are stalling because of memory," and it kills whole cgroups at a time.

# Read PSI directly — some: some tasks stalled, full: all stalled
cat /proc/pressure/memory
some avg10=0.00 avg60=0.12 avg300=0.05 total=8123456
full avg10=0.00 avg60=0.03 avg300=0.01 total=2345678

# systemd-oomd status
oomctl
systemctl status systemd-oomd

On a desktop, I strongly recommend enabling either earlyoom or systemd-oomd. It prevents the "one browser tab takes the whole system hostage" scenario.

cgroup v2 — Memory Control per Process Group

A more precise tool than system-wide tuning is the cgroup v2 memory controller. There are four key knobs:

KnobMeaningBehavior when exceeded
memory.minabsolute guarantee (exempt from reclaim)other cgroups yield
memory.lowsoft protectionprotected only when there is slack
memory.highsoft ceilingthrottle + aggressive reclaim (no kill)
memory.maxhard ceilingOOM kill on excess

memory.high is especially useful. On reaching the limit it slows allocations and forces reclaim instead of killing, which fits the "this batch job eats too much memory but I do not want it dead" situation perfectly.

# Cleanest: apply to a systemd service
sudo systemctl set-property batch-job.service MemoryHigh=4G MemoryMax=6G

# Run a one-off command inside a constrained cgroup
systemd-run --scope -p MemoryHigh=2G -p MemoryMax=3G ./heavy-script.sh

# Per-cgroup memory usage
systemd-cgtop -m
cat /sys/fs/cgroup/system.slice/batch-job.service/memory.current

Swap can be controlled per cgroup too. Setting memory.swap.max to 0 bans swap for just that group, enabling policies like "no swap for the DB, swap allowed for batch jobs."

Memory Traps in Container Environments

In containers, the concepts above get subtly twisted. Three traps people hit constantly.

Trap 1 — free inside a container shows host numbers. Run free -h inside a container and you see the host's total memory, not the cgroup limit. The classic failure: the application concludes "plenty of memory," grows its caches, and gets OOM-killed. Container-aware runtimes like the JVM and .NET read cgroup limits, but every script that parses free directly is wrong.

# How to see the real limits inside a container
cat /sys/fs/cgroup/memory.max        # hard limit
cat /sys/fs/cgroup/memory.current    # current usage

Trap 2 — page cache counts against container memory. If a container reads a lot of files, the page cache shows up in memory.current. This is usually the answer to "my app only uses 1GB, why am I near the limit?" The cache is reclaimed under pressure so it is usually harmless, but if you do not know how much reclaimable cache sits below the memory.max OOM line, you will misread your monitoring graphs. Look at working-set metrics (container_memory_working_set_bytes in Kubernetes).

Trap 3 — Kubernetes kept swap off (for a long time). Kubernetes has disabled node swap by default, and only since 1.28 has NodeSwap entered beta. Real-world leftovers persist: kubelet refusing to start because it detects swap after you enabled zram on the host, and confusion because requests/limits accounting does not consider swap. If you want compressed memory on container nodes, review the kubelet settings (failSwapOn=false, swapBehavior) and consider whether you need an approach that does not look like swap at all.

Observability — Reading /proc/meminfo and smem

Measurement precedes tuning. At minimum you should be able to read these:

$ grep -E 'MemTotal|MemAvailable|Buffers|^Cached|SwapTotal|SwapFree|Dirty|AnonPages' /proc/meminfo
MemTotal:       32768000 kB
MemAvailable:   18234560 kB   # ← look at this, not free
Buffers:          512000 kB
Cached:          9876000 kB   # page cache
SwapTotal:       8388604 kB
SwapFree:        7340032 kB
Dirty:             12340 kB   # dirty pages awaiting writeback
AnonPages:      10240000 kB   # total anonymous pages

A small MemFree does not mean memory shortage. Page cache is reclaimable at any time, so the real signal is MemAvailable — "how much can be handed to a new workload right now."

For per-process memory, avoid the RSS trap (shared libraries double-counted) and look at PSS instead.

# smem — top processes by PSS (proportionally attributed usage)
sudo smem -rs pss | head
  PID User     Command                         Swap      USS      PSS      RSS
 2143 app      java -Xmx4g ...                    0  3145728  3167234  3210240
 1021 app      postgres: writer                   0   102400   145000   512000

# Who is using swap, per process
sudo smem -rs swap | head

# System-wide swap in/out trend (si/so columns)
vmstat 1 5

If vmstat shows persistently nonzero si/so, suspect swap thrashing, then confirm the actual latency impact with PSI (/proc/pressure/memory). That order is the standard playbook.

Benchmark Methodology — If You Want to Validate VRAM Swap

Here is the methodology for evaluating experimental setups like nbd-vram. The core principle: swap performance is measured by latency distribution, not throughput.

# 1. Baseline: raw device performance (fio, 4K random — mimics swap IO patterns)
sudo fio --name=swaptest --filename=/dev/nbd0 --rw=randrw --bs=4k \
  --iodepth=32 --runtime=60 --time_based --direct=1 --group_reporting

# 2. Real swap pressure: force a working set bigger than RAM
stress-ng --vm 4 --vm-bytes 120% --vm-method flip --metrics-brief -t 120

# 3. Observe: record swap in/out and PSI during the load
vmstat 1 > vmstat.log &
while true; do cat /proc/pressure/memory >> psi.log; sleep 1; done &

# 4. Perceived metric: latency of unrelated work during the load (e.g., shell response)
time ls -R /usr/share > /dev/null

Keep at least three comparison groups: NVMe swap alone, zram alone, and the experimental setup (VRAM swap). Run the same stress-ng scenario in each, and compare p99 latency and PSI full time rather than averages — that is what matches real-world feel. One more thing: to normalize cache state across reboots, drop the page cache before measuring.

# Drop caches before measurement (never use outside benchmarking)
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

Appendix — Quick Disk Swap Recipes

Before zram/zswap there is the fundamental: creating a swap file. It can be added instantly with no partition surgery, which makes it especially useful in emergencies.

# Create a 4GB swap file (ext4/xfs)
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Activate automatically at boot
echo '/swapfile none swap defaults 0 0' | sudo tee -a /etc/fstab

# On btrfs, use the dedicated command (handles CoW/compression attributes)
sudo btrfs filesystem mkswapfile --size 4g /swap/swapfile

# Check priorities — higher numbers are used first
swapon --show
cat /proc/swaps

With multiple swap devices, the kernel uses the highest-priority one first. Setting zram to 100 and disk swap to 10 creates a natural waterfall: "fast tier first, spill to disk."

By environment:

EnvironmentRecommended setup
Desktop/laptop (16GB or less)zram (half of RAM, zstd) + swappiness 100-180 + earlyoom
Desktop (32GB or more)zram 4-8GB + systemd-oomd, small disk swap
General server (with SSD)zswap + NVMe swap partition + per-service memory.high caps
DB serverkeep small swap + memory.swap.max=0 for the DB cgroup only + low swappiness
Kubernetes nodedefault to no swap with precise requests/limits; adopt NodeSwap carefully
Workstation with an idle GPU(for fun) nbd-vram swap, at lower priority than zram

Closing Thoughts

nbd-vram is, in a sense, a toy. But like all good toys, playing with it teaches you serious things: that swap is not evil but a reclaim option; that compressed memory (zram/zswap) is already the default in mainstream distributions; that OOM should not be left to the kernel alone but handled preemptively with PSI; and that memory observability in the container era starts with reading cgroup files directly.

In 2026, AI workloads have made us more sensitive to the memory hierarchy than ever. In an era where the boundary between VRAM and RAM is blurring (with unified memory architectures spreading), a project that turns GPU memory into swap through nothing but the block device abstraction is also a fine showcase of the elegance of Linux abstractions. Give it a spin some weekend. Just not in production.

References