Local LLM Inference Optimization — From Quantization to Breaking the VRAM Ceiling

Introduction — Why Local LLMs Are Hot Again
Hardware Selection — Think VRAM First
Quantization Formats — GGUF, 4-bit, AWQ, GPTQ
- GGUF — The Standard of the llama.cpp Ecosystem
- AWQ and GPTQ — Quantization for the GPU Serving Camp
Inference Engine Comparison — llama.cpp vs vLLM vs Ollama
KV Cache and the Memory Arithmetic of Context Length
A Realistic Build Guide per VRAM Budget
Offloading Strategies — Fighting a VRAM Shortage
- The Orthodox Route: Per-Layer CPU Offloading
- The Contrarian Route: Spare VRAM as System Swap — nbd-vram
Serving Configuration Examples
Multi-GPU — When One Card Is Not Enough
Performance Measurement Methodology — Numbers, Not Vibes
Common Pitfalls
Closing
References

Introduction — Why Local LLMs Are Hot Again

In the first half of 2026, the local LLM community is buzzing again. On Hacker News, contrarian projects like nbd-vram — which exposes spare GPU VRAM as a Linux swap device — made the front page, and local inference optimization posts keep appearing on GeekNews. Several currents have converged.

First, privacy. Now that we feed code, documents, and even journals into LLMs, resistance to sending sensitive data to external APIs has grown. The June 2026 npm supply-chain attack that reached Red Hat Cloud Services, and the one-click GitHub token theft via a VSCode extension bug, both reinforced the sentiment: my data stays on my machine.

Second, cost. With AI coding agents now ubiquitous, token consumption has exploded. In workflows where agents run autonomously for hours, the API bill becomes impossible to ignore, creating demand to route repetitive auxiliary work to local models.

Third, big-tech fatigue. As the surge in no-AI search traffic on DuckDuckGo and the Gmail exodus show, there is a clear backlash against everything converging into cloud AI. Local LLMs are the technical outlet for that sentiment.

Conveniently, the hardware and software are ready: consumer hardware with dramatically larger unified memory, a mature quantization ecosystem, and two pillars of inference engines in llama.cpp and vLLM. This post draws the full map of local inference optimization.

Hardware Selection — Think VRAM First

The first principle of local LLM hardware is simple: memory capacity and bandwidth come before compute speed. If the model does not fit in memory, the fastest GPU is useless; once it fits, token generation speed is mostly determined by memory bandwidth.

First-order approximation of generation speed:

  tokens/sec ~= memory bandwidth (GB/s) / model size (GB)

Example: a 5GB model after quantization
  - 100 GB/s bandwidth (dual-channel DDR5 CPU) -> ~20 tokens/sec
  - 400 GB/s bandwidth (Apple M-series Max)    -> ~80 tokens/sec
  - 1000 GB/s bandwidth (high-end dGPU)        -> ~200 tokens/sec

The choice splits into two broad paths.

Aspect	Unified memory (Apple Silicon, Strix Halo class)	Discrete GPU (RTX class)
Memory capacity	Large (64GB to 512GB)	Small (16GB to 32GB typical)
Memory bandwidth	Medium (200 to 800 GB/s)	High (800 to 1700 GB/s)
Loading large models	70B+ feasible with quantization	Hard on a single card
Generation speed	Medium	Fast
Prompt processing speed	Relatively slow	Very fast
Power / noise	Low	High
Expandability	None (fixed at purchase)	Expandable with multi-GPU

In short: unified memory if you want large models at decent speed, discrete GPU if you want mid-size models at top speed. For workloads that frequently ingest long documents (RAG, codebase analysis), the fast prompt processing of a dGPU pays off; for conversation-heavy use, the capacity advantage of unified memory wins.

Back-of-the-envelope model memory:

Model weight memory (approximate):

  fp16:      parameter count x 2 bytes
  8bit (Q8): parameter count x 1 byte + small overhead
  4bit (Q4): parameter count x 0.5 bytes + small overhead

Example: 8B model  -> fp16 16GB / Q8 8.5GB / Q4 4.7GB
Example: 70B model -> fp16 140GB / Q8 75GB / Q4 40GB

KV cache and activations come on top (see below)

Quantization Formats — GGUF, 4-bit, AWQ, GPTQ

Quantization represents weights at lower precision to cut memory and bandwidth requirements. For local LLMs it is effectively mandatory.

GGUF — The Standard of the llama.cpp Ecosystem

GGUF is the single-file format used by llama.cpp. Weights, tokenizer, and metadata live in one file, making distribution trivial. Quantization levels are distinguished by filename suffixes.

Quantization	Bit level	8B model size	Quality impact
Q8_0	8-bit	~8.5GB	Near lossless
Q6_K	~6-bit	~6.6GB	Very small
Q5_K_M	~5-bit	~5.7GB	Small
Q4_K_M	~4-bit	~4.9GB	Hard to notice (recommended default)
Q3_K_M	~3-bit	~4.0GB	Noticeable on some tasks
Q2_K	~2-bit	~3.2GB	Clear degradation, emergencies only

The community rule of thumb is clear: for the same memory, a bigger model at 4-bit usually beats a smaller model at high precision. With an 8GB budget, a 14B Q4 often outperforms an 8B Q8.

AWQ and GPTQ — Quantization for the GPU Serving Camp

In GPU serving engines like vLLM, the AWQ and GPTQ families dominate.

GPTQ: post-training quantization that minimizes per-layer quantization error using a calibration dataset. Good quality at 4-bit.
AWQ: observes activation distributions and protects the important weight channels during quantization. Low calibration dependence and strong quality retention made it the default choice for 4-bit GPU serving.
FP8: 8-bit floating point with hardware support on recent GPUs. Rapidly becoming the standard for throughput serving.

In summary: start from GGUF Q4_K_M if you run llama.cpp, and from AWQ 4-bit or FP8 if you run vLLM.

Inference Engine Comparison — llama.cpp vs vLLM vs Ollama

Aspect	llama.cpp	vLLM	Ollama
Main target	Personal machines, edge	GPU servers, teams/production	Personal machines (convenience first)
Hardware	CPU, Apple Silicon, GPU — all	NVIDIA/AMD GPU focused	Same as llama.cpp underneath
Model format	GGUF	HF safetensors, AWQ, GPTQ, FP8	GGUF (llama.cpp inside)
Concurrency	Limited (a few slots)	Strong (continuous batching, PagedAttention)	Limited
CPU offloading	Strong (per-layer)	Limited	Supported (automatic)
API	Built-in OpenAI-compatible server	Built-in OpenAI-compatible server	Own API + OpenAI-compatible
Operational difficulty	Medium (build flags to learn)	Medium to high	Very low
Best suited for	Power users who enjoy tuning	Teams needing throughput	Beginners who want to try now

Reduced to one line each:

Ollama: I want to start within five minutes. Fine-tuning the setup comes later.
llama.cpp: I want to squeeze the last drop out of my hardware. I need offloading because VRAM is short.
vLLM: I need a server hit concurrently by multiple users or agents, and the GPU can hold the whole model.

Running both on the same machine for different purposes is common: Ollama for casual daily use, vLLM when the team demo needs throughput.

KV Cache and the Memory Arithmetic of Context Length

The wall you hit most often with local LLMs is not model weights but the KV cache. The longer the context, the more attention keys and values must be stored per token — and that eats VRAM.

The formula:

def kv_cache_bytes(n_layers, n_kv_heads, head_dim, seq_len,
                   batch=1, bytes_per_elem=2):
    """KV cache memory (x2 because K and V are two tensors).
    bytes_per_elem: fp16/bf16=2, q8=1, q4=0.5
    """
    return 2 * n_layers * n_kv_heads * head_dim * seq_len * batch * bytes_per_elem

# Example 1: 8B-class model (32 layers, 8 KV heads, head_dim 128, fp16)
per_token = kv_cache_bytes(32, 8, 128, 1)
print(per_token)            # 131072 bytes = 128KB per token

print(kv_cache_bytes(32, 8, 128, 8192) / 2**30)    # 8k context: 1.0GB
print(kv_cache_bytes(32, 8, 128, 32768) / 2**30)   # 32k context: 4.0GB
print(kv_cache_bytes(32, 8, 128, 131072) / 2**30)  # 128k context: 16.0GB

# Example 2: 70B-class model (80 layers, 8 KV heads, head_dim 128, fp16)
print(kv_cache_bytes(80, 8, 128, 32768) / 2**30)   # 32k context: 10.0GB

What this arithmetic teaches in practice:

An 8B model at Q4 weighs barely 5GB, but the moment you use 128k context, the KV cache hits 16GB — three times the weights. This is why a model loads fine but dies when you extend the context.
KV cache grows linearly with concurrent users (batch). This is why a local server with multiple agents attached must set conservative context limits.
Three countermeasures exist: KV cache quantization (q8/q4 instead of fp16 for a half to a quarter of the memory), picking models with strong GQA (fewer KV heads is better), and explicit context limits.

Quantizing the KV cache in llama.cpp:

# KV cache at q8: half the memory, negligible quality impact
llama-server -m model-q4_k_m.gguf \
  --ctx-size 32768 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0

A Realistic Build Guide per VRAM Budget

Combining weight and KV cache arithmetic yields realistic configurations per memory budget.

Memory budget	Sensible setup	Context headroom	Use case feel
8GB	7-8B model Q4	~8k	Light assistance, summaries, classification
12GB	8B Q6 or 14B Q4	8k-16k	The floor for daily coding assistance
16GB	14B Q4 + KV q8	16k-32k	The practical zone for coding/docs work
24GB	32B Q4 or 14B high precision	32k	First tier where local alone suffices
48GB (2 cards or unified)	70B-class Q4	16k-32k	Sharply reduces API dependence
96GB+ (unified)	70B high precision, large MoE Q4	64k+	Endgame local workstation

Two corrections apply. First, the context headroom column assumes the KV cache is quantized to q8. Second, MoE models have far fewer active parameters than their total, so generation is much faster — on unified-memory machines a large MoE at Q4 often feels better than a dense model of the same class.

Model family selection, simplified: for coding assistance prefer code-specialized fine-tunes; if multilingual quality matters, prefer families with strong multilingual tokenizers; for agent use, shortlist models with official tool-calling support — and always compare directly on 10 to 20 samples of your own work. Leaderboard rankings are reference material, nothing more.

Offloading Strategies — Fighting a VRAM Shortage

The Orthodox Route: Per-Layer CPU Offloading

The signature feature of llama.cpp: load only some model layers onto the GPU and run the rest from CPU memory.

# A 70B Q4 model (~40GB) on a 24GB GPU:
# 48 of 80 layers on GPU, the rest on CPU
llama-server -m llama-70b-q4_k_m.gguf \
  --n-gpu-layers 48 \
  --ctx-size 8192 \
  --threads 16

The rule of thumb is not that speed scales with the fraction of layers on GPU, but that the layers left on the CPU dominate total speed. Offload half and you converge to CPU speed no matter how fast the GPU is. Still, the value of running a model that otherwise would not run at all is enormous. For MoE models, the option to send only expert weights to the CPU works well — only the active experts are used at any moment, so the perceived slowdown is small.

# MoE model: shared weights (attention etc.) on GPU, expert FFNs on CPU
llama-server -m moe-model-q4.gguf \
  --n-gpu-layers 99 \
  --n-cpu-moe 30

The Contrarian Route: Spare VRAM as System Swap — nbd-vram

nbd-vram, which lit up Hacker News in 2026, flips the premise. Instead of borrowing system memory because VRAM is short, it exposes idle VRAM as a Linux block device usable as swap or a ramdisk. Using the network block device (NBD) protocol, GPU memory is wrapped so the kernel sees it as just a very fast disk.

Usual thinking:  model bigger than VRAM -> offload to RAM/disk
nbd-vram:        RAM is short           -> use spare VRAM as swap

+--------+      NBD protocol      +-----------------+
| Linux  | <-------------------> | GPU VRAM        |
| kernel |    (block device)     | (CUDA buffers)  |
| swap   |                       |                 |
+--------+                       +-----------------+

Going through PCIe makes it slower than real RAM but much faster than NVMe swap. Many found it surprisingly practical for memory-hungry jobs (large data processing, compilation) on workstations where a gaming GPU sits idle. Combined with local LLMs, you can even get the curious setup where pages of a giant CPU-inferenced model land in VRAM swap. More important than the technical fit is the attitude the project demonstrates: VRAM, RAM, and disk are not fixed roles but memory tiers of different speeds — and we decide the combination.

Serving Configuration Examples

llama.cpp Server — For Single-Machine Power Users

# Build (CUDA example)
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

# Launch the OpenAI-compatible server
./build/bin/llama-server \
  -m models/qwen2.5-14b-q4_k_m.gguf \
  --host 0.0.0.0 --port 8080 \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  --flash-attn \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --parallel 2

Ollama — The Fastest Starting Point

# After install, pulling and running a model is one line each
ollama pull qwen2.5:14b
ollama run qwen2.5:14b

# Custom settings via a Modelfile
cat > Modelfile <<'EOF'
FROM qwen2.5:14b
PARAMETER num_ctx 16384
PARAMETER temperature 0.7
SYSTEM You are a coding assistant that answers concisely.
EOF
ollama create my-coder -f Modelfile
ollama run my-coder

Ollama uses llama.cpp internally, so performance characteristics are similar, but its defaults are conservative. In particular the default num_ctx is small, creating the trap of silently truncated context in long conversations — raise it explicitly.

vLLM — For Team Servers with Concurrent Requests

# docker-compose.yml
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    volumes:
      - ./models:/root/.cache/huggingface
    command: >
      --model Qwen/Qwen2.5-14B-Instruct-AWQ
      --quantization awq
      --max-model-len 16384
      --gpu-memory-utilization 0.92
      --max-num-seqs 16

The key vLLM flags mean:

gpu-memory-utilization: the fraction of total VRAM vLLM claims. Everything left after weights becomes the KV cache pool.
max-model-len: maximum sequence length. Directly tied to KV cache arithmetic, so set it conservatively for your workload.
max-num-seqs: cap on concurrently processed sequences. The dial trading throughput against per-request latency.

Multi-GPU — When One Card Is Not Enough

Running 70B on two 24GB cards is now commonplace. There are two approaches.

Approach	Principle	Pros	Cons
Tensor parallel (TP)	Cards share each layer's matrices	Also reduces latency	Frequent inter-card traffic, identical cards recommended
Pipeline parallel (PP)	Layers split into stages	Mixed card models possible	Single-request latency not improved

# vLLM: 2-way tensor parallel
vllm serve Qwen/Qwen2.5-72B-Instruct-AWQ \
  --quantization awq \
  --tensor-parallel-size 2 \
  --max-model-len 8192

# llama.cpp: per-card split ratio (mixed-card example)
llama-server -m llama-70b-q4_k_m.gguf \
  --n-gpu-layers 99 \
  --split-mode layer \
  --tensor-split 60,40

On consumer motherboards, PCIe lanes easily become the bottleneck. Tensor parallelism is communication-heavy and wants x8/x8 or better; with fewer lanes, pipeline (layer) splitting is the safer bet.

Performance Measurement Methodology — Numbers, Not Vibes

Local LLM tuning without measurement is superstition. Four standard metrics:

TTFT  (Time To First Token)  : time to the first token; reflects prompt processing
TPOT  (Time Per Output Token): time per generated token; inverse of generation speed
pp speed (prompt processing) : prompt tokens/sec; dominates long-input workloads
tg speed (token generation)  : generated tokens/sec; dominates chat workloads

llama.cpp ships a dedicated benchmark tool:

# Measure a 512-token prompt + 128-token generation scenario
./build/bin/llama-bench \
  -m models/qwen2.5-14b-q4_k_m.gguf \
  -p 512 -n 128 \
  -ngl 99

# Example output columns: config, pp512, tg128 tokens/sec
# Sweep quantization level, ngl, and thread count to build a table:
# that is your machine's quality-speed curve

Load-test the whole server while varying concurrency:

# vLLM built-in benchmark: TTFT/TPOT across request rates
vllm bench serve \
  --backend openai \
  --base-url http://localhost:8000 \
  --model Qwen/Qwen2.5-14B-Instruct-AWQ \
  --num-prompts 200 \
  --request-rate 4

Three principles for measurement:

Warm up first: the first request mixes in model loading and kernel compilation and is always slow.
Match your own workload distribution: a short-prompt benchmark says nothing about RAG performance. Match input/output length distributions to reality.
p95, not averages: in agent pipelines the slowest call delays everything. Watch the tail latency.

Common Pitfalls

Finally, the traps that come up repeatedly in the community.

Forgetting the context limit: default context sizes are short. As conversations grow, the beginning silently falls off and the model suddenly seems to have become stupid. Set ctx-size explicitly and budget the corresponding KV cache memory.
Subtle quality collapse from over-quantization: Q2 to Q3 quantization looks fine on short answers but falls apart on long reasoning chains and code generation. Compare quantization levels on an eval set from your own tasks.
Benchmark-vs-feel divergence: high tg tokens/sec means little if pp is slow — RAG will still feel sluggish. Always read both numbers together.
Neglected sampling parameters: every model has its own recommended temperature and top-p. Many people run defaults and blame the model.
Ignoring power and heat: for 24/7 servers, slightly lowering the power limit trades a few percent of tokens/sec for major reductions in power and heat.
Neglected security: expose an unauthenticated server on 0.0.0.0 and anyone on the network can use your GPU. Set at least an API key and firewall rules.
Model files of unknown origin: anyone can upload quantized models. In an era of routine supply-chain attacks, using well-downloaded repositories from verified uploaders and checking hashes is just as valid locally.
Not tracking updates: llama.cpp and vLLM improve noticeably every few weeks. Conclusions drawn from a six-month-old benchmark may already be stale.

Closing

Local LLM inference optimization ultimately converges on one question: how do I allocate quality, speed, and context length within my memory budget? Quantization shrinks the weight budget, KV cache quantization and context limits govern the cache budget, and offloading borrows budget from elsewhere. Contrarian hacks like nbd-vram show that the partitions in that ledger are far more fluid than they appear.

If you start today, the recommended path:

Pull a 14B Q4 model with Ollama and simply use it on your own work for a few days
When context or speed frustrations appear, drop down to llama.cpp and tune KV quantization and offloading
When concurrent users arrive, move up to vLLM and collect the throughput of continuous batching
At every step, compare before and after with llama-bench and load tests — in numbers

One more thing: this field moves fast. The optimal setup of today can be stale knowledge next quarter, so automating your measurement scripts and rerunning them periodically is the asset that lasts longest.

Cloud APIs and local models are not substitutes but a division of labor. Sensitive data, repetitive tasks, and offline environments go local; moments that demand top quality go to the API. Being able to draw that boundary with numbers from your own workload — that is the point of all the measuring and arithmetic in this post.

References

nbd-vram (VRAM as a block device): https://github.com/c0dejedi/nbd-vram
llama.cpp repository: https://github.com/ggml-org/llama.cpp
vLLM documentation: https://docs.vllm.ai/
Ollama: https://ollama.com/
GGUF format specification: https://github.com/ggml-org/ggml/blob/master/docs/gguf.md
AWQ paper: https://arxiv.org/abs/2306.00978
GPTQ paper: https://arxiv.org/abs/2210.17323
PagedAttention (vLLM) paper: https://arxiv.org/abs/2309.06180
llama.cpp server docs: https://github.com/ggml-org/llama.cpp/tree/master/tools/server
Local LLM discussions on Hacker News: https://news.ycombinator.com/
GeekNews: https://news.hada.io/