- Published on
Local LLM Inference Optimization — From Quantization to Breaking the VRAM Ceiling
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction — Why Local LLMs Are Hot Again
- Hardware Selection — Think VRAM First
- Quantization Formats — GGUF, 4-bit, AWQ, GPTQ
- Inference Engine Comparison — llama.cpp vs vLLM vs Ollama
- KV Cache and the Memory Arithmetic of Context Length
- A Realistic Build Guide per VRAM Budget
- Offloading Strategies — Fighting a VRAM Shortage
- Serving Configuration Examples
- Multi-GPU — When One Card Is Not Enough
- Performance Measurement Methodology — Numbers, Not Vibes
- Common Pitfalls
- Closing
- References
Introduction — Why Local LLMs Are Hot Again
In the first half of 2026, the local LLM community is buzzing again. On Hacker News, contrarian projects like nbd-vram — which exposes spare GPU VRAM as a Linux swap device — made the front page, and local inference optimization posts keep appearing on GeekNews. Several currents have converged.
First, privacy. Now that we feed code, documents, and even journals into LLMs, resistance to sending sensitive data to external APIs has grown. The June 2026 npm supply-chain attack that reached Red Hat Cloud Services, and the one-click GitHub token theft via a VSCode extension bug, both reinforced the sentiment: my data stays on my machine.
Second, cost. With AI coding agents now ubiquitous, token consumption has exploded. In workflows where agents run autonomously for hours, the API bill becomes impossible to ignore, creating demand to route repetitive auxiliary work to local models.
Third, big-tech fatigue. As the surge in no-AI search traffic on DuckDuckGo and the Gmail exodus show, there is a clear backlash against everything converging into cloud AI. Local LLMs are the technical outlet for that sentiment.
Conveniently, the hardware and software are ready: consumer hardware with dramatically larger unified memory, a mature quantization ecosystem, and two pillars of inference engines in llama.cpp and vLLM. This post draws the full map of local inference optimization.
Hardware Selection — Think VRAM First
The first principle of local LLM hardware is simple: memory capacity and bandwidth come before compute speed. If the model does not fit in memory, the fastest GPU is useless; once it fits, token generation speed is mostly determined by memory bandwidth.
First-order approximation of generation speed:
tokens/sec ~= memory bandwidth (GB/s) / model size (GB)
Example: a 5GB model after quantization
- 100 GB/s bandwidth (dual-channel DDR5 CPU) -> ~20 tokens/sec
- 400 GB/s bandwidth (Apple M-series Max) -> ~80 tokens/sec
- 1000 GB/s bandwidth (high-end dGPU) -> ~200 tokens/sec
The choice splits into two broad paths.
| Aspect | Unified memory (Apple Silicon, Strix Halo class) | Discrete GPU (RTX class) |
|---|---|---|
| Memory capacity | Large (64GB to 512GB) | Small (16GB to 32GB typical) |
| Memory bandwidth | Medium (200 to 800 GB/s) | High (800 to 1700 GB/s) |
| Loading large models | 70B+ feasible with quantization | Hard on a single card |
| Generation speed | Medium | Fast |
| Prompt processing speed | Relatively slow | Very fast |
| Power / noise | Low | High |
| Expandability | None (fixed at purchase) | Expandable with multi-GPU |
In short: unified memory if you want large models at decent speed, discrete GPU if you want mid-size models at top speed. For workloads that frequently ingest long documents (RAG, codebase analysis), the fast prompt processing of a dGPU pays off; for conversation-heavy use, the capacity advantage of unified memory wins.
Back-of-the-envelope model memory:
Model weight memory (approximate):
fp16: parameter count x 2 bytes
8bit (Q8): parameter count x 1 byte + small overhead
4bit (Q4): parameter count x 0.5 bytes + small overhead
Example: 8B model -> fp16 16GB / Q8 8.5GB / Q4 4.7GB
Example: 70B model -> fp16 140GB / Q8 75GB / Q4 40GB
KV cache and activations come on top (see below)
Quantization Formats — GGUF, 4-bit, AWQ, GPTQ
Quantization represents weights at lower precision to cut memory and bandwidth requirements. For local LLMs it is effectively mandatory.
GGUF — The Standard of the llama.cpp Ecosystem
GGUF is the single-file format used by llama.cpp. Weights, tokenizer, and metadata live in one file, making distribution trivial. Quantization levels are distinguished by filename suffixes.
| Quantization | Bit level | 8B model size | Quality impact |
|---|---|---|---|
| Q8_0 | 8-bit | ~8.5GB | Near lossless |
| Q6_K | ~6-bit | ~6.6GB | Very small |
| Q5_K_M | ~5-bit | ~5.7GB | Small |
| Q4_K_M | ~4-bit | ~4.9GB | Hard to notice (recommended default) |
| Q3_K_M | ~3-bit | ~4.0GB | Noticeable on some tasks |
| Q2_K | ~2-bit | ~3.2GB | Clear degradation, emergencies only |
The community rule of thumb is clear: for the same memory, a bigger model at 4-bit usually beats a smaller model at high precision. With an 8GB budget, a 14B Q4 often outperforms an 8B Q8.
AWQ and GPTQ — Quantization for the GPU Serving Camp
In GPU serving engines like vLLM, the AWQ and GPTQ families dominate.
- GPTQ: post-training quantization that minimizes per-layer quantization error using a calibration dataset. Good quality at 4-bit.
- AWQ: observes activation distributions and protects the important weight channels during quantization. Low calibration dependence and strong quality retention made it the default choice for 4-bit GPU serving.
- FP8: 8-bit floating point with hardware support on recent GPUs. Rapidly becoming the standard for throughput serving.
In summary: start from GGUF Q4_K_M if you run llama.cpp, and from AWQ 4-bit or FP8 if you run vLLM.
Inference Engine Comparison — llama.cpp vs vLLM vs Ollama
| Aspect | llama.cpp | vLLM | Ollama |
|---|---|---|---|
| Main target | Personal machines, edge | GPU servers, teams/production | Personal machines (convenience first) |
| Hardware | CPU, Apple Silicon, GPU — all | NVIDIA/AMD GPU focused | Same as llama.cpp underneath |
| Model format | GGUF | HF safetensors, AWQ, GPTQ, FP8 | GGUF (llama.cpp inside) |
| Concurrency | Limited (a few slots) | Strong (continuous batching, PagedAttention) | Limited |
| CPU offloading | Strong (per-layer) | Limited | Supported (automatic) |
| API | Built-in OpenAI-compatible server | Built-in OpenAI-compatible server | Own API + OpenAI-compatible |
| Operational difficulty | Medium (build flags to learn) | Medium to high | Very low |
| Best suited for | Power users who enjoy tuning | Teams needing throughput | Beginners who want to try now |
Reduced to one line each:
- Ollama: I want to start within five minutes. Fine-tuning the setup comes later.
- llama.cpp: I want to squeeze the last drop out of my hardware. I need offloading because VRAM is short.
- vLLM: I need a server hit concurrently by multiple users or agents, and the GPU can hold the whole model.
Running both on the same machine for different purposes is common: Ollama for casual daily use, vLLM when the team demo needs throughput.
KV Cache and the Memory Arithmetic of Context Length
The wall you hit most often with local LLMs is not model weights but the KV cache. The longer the context, the more attention keys and values must be stored per token — and that eats VRAM.
The formula:
def kv_cache_bytes(n_layers, n_kv_heads, head_dim, seq_len,
batch=1, bytes_per_elem=2):
"""KV cache memory (x2 because K and V are two tensors).
bytes_per_elem: fp16/bf16=2, q8=1, q4=0.5
"""
return 2 * n_layers * n_kv_heads * head_dim * seq_len * batch * bytes_per_elem
# Example 1: 8B-class model (32 layers, 8 KV heads, head_dim 128, fp16)
per_token = kv_cache_bytes(32, 8, 128, 1)
print(per_token) # 131072 bytes = 128KB per token
print(kv_cache_bytes(32, 8, 128, 8192) / 2**30) # 8k context: 1.0GB
print(kv_cache_bytes(32, 8, 128, 32768) / 2**30) # 32k context: 4.0GB
print(kv_cache_bytes(32, 8, 128, 131072) / 2**30) # 128k context: 16.0GB
# Example 2: 70B-class model (80 layers, 8 KV heads, head_dim 128, fp16)
print(kv_cache_bytes(80, 8, 128, 32768) / 2**30) # 32k context: 10.0GB
What this arithmetic teaches in practice:
- An 8B model at Q4 weighs barely 5GB, but the moment you use 128k context, the KV cache hits 16GB — three times the weights. This is why a model loads fine but dies when you extend the context.
- KV cache grows linearly with concurrent users (batch). This is why a local server with multiple agents attached must set conservative context limits.
- Three countermeasures exist: KV cache quantization (q8/q4 instead of fp16 for a half to a quarter of the memory), picking models with strong GQA (fewer KV heads is better), and explicit context limits.
Quantizing the KV cache in llama.cpp:
# KV cache at q8: half the memory, negligible quality impact
llama-server -m model-q4_k_m.gguf \
--ctx-size 32768 \
--cache-type-k q8_0 \
--cache-type-v q8_0
A Realistic Build Guide per VRAM Budget
Combining weight and KV cache arithmetic yields realistic configurations per memory budget.
| Memory budget | Sensible setup | Context headroom | Use case feel |
|---|---|---|---|
| 8GB | 7-8B model Q4 | ~8k | Light assistance, summaries, classification |
| 12GB | 8B Q6 or 14B Q4 | 8k-16k | The floor for daily coding assistance |
| 16GB | 14B Q4 + KV q8 | 16k-32k | The practical zone for coding/docs work |
| 24GB | 32B Q4 or 14B high precision | 32k | First tier where local alone suffices |
| 48GB (2 cards or unified) | 70B-class Q4 | 16k-32k | Sharply reduces API dependence |
| 96GB+ (unified) | 70B high precision, large MoE Q4 | 64k+ | Endgame local workstation |
Two corrections apply. First, the context headroom column assumes the KV cache is quantized to q8. Second, MoE models have far fewer active parameters than their total, so generation is much faster — on unified-memory machines a large MoE at Q4 often feels better than a dense model of the same class.
Model family selection, simplified: for coding assistance prefer code-specialized fine-tunes; if multilingual quality matters, prefer families with strong multilingual tokenizers; for agent use, shortlist models with official tool-calling support — and always compare directly on 10 to 20 samples of your own work. Leaderboard rankings are reference material, nothing more.
Offloading Strategies — Fighting a VRAM Shortage
The Orthodox Route: Per-Layer CPU Offloading
The signature feature of llama.cpp: load only some model layers onto the GPU and run the rest from CPU memory.
# A 70B Q4 model (~40GB) on a 24GB GPU:
# 48 of 80 layers on GPU, the rest on CPU
llama-server -m llama-70b-q4_k_m.gguf \
--n-gpu-layers 48 \
--ctx-size 8192 \
--threads 16
The rule of thumb is not that speed scales with the fraction of layers on GPU, but that the layers left on the CPU dominate total speed. Offload half and you converge to CPU speed no matter how fast the GPU is. Still, the value of running a model that otherwise would not run at all is enormous. For MoE models, the option to send only expert weights to the CPU works well — only the active experts are used at any moment, so the perceived slowdown is small.
# MoE model: shared weights (attention etc.) on GPU, expert FFNs on CPU
llama-server -m moe-model-q4.gguf \
--n-gpu-layers 99 \
--n-cpu-moe 30
The Contrarian Route: Spare VRAM as System Swap — nbd-vram
nbd-vram, which lit up Hacker News in 2026, flips the premise. Instead of borrowing system memory because VRAM is short, it exposes idle VRAM as a Linux block device usable as swap or a ramdisk. Using the network block device (NBD) protocol, GPU memory is wrapped so the kernel sees it as just a very fast disk.
Usual thinking: model bigger than VRAM -> offload to RAM/disk
nbd-vram: RAM is short -> use spare VRAM as swap
+--------+ NBD protocol +-----------------+
| Linux | <-------------------> | GPU VRAM |
| kernel | (block device) | (CUDA buffers) |
| swap | | |
+--------+ +-----------------+
Going through PCIe makes it slower than real RAM but much faster than NVMe swap. Many found it surprisingly practical for memory-hungry jobs (large data processing, compilation) on workstations where a gaming GPU sits idle. Combined with local LLMs, you can even get the curious setup where pages of a giant CPU-inferenced model land in VRAM swap. More important than the technical fit is the attitude the project demonstrates: VRAM, RAM, and disk are not fixed roles but memory tiers of different speeds — and we decide the combination.
Serving Configuration Examples
llama.cpp Server — For Single-Machine Power Users
# Build (CUDA example)
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
# Launch the OpenAI-compatible server
./build/bin/llama-server \
-m models/qwen2.5-14b-q4_k_m.gguf \
--host 0.0.0.0 --port 8080 \
--ctx-size 16384 \
--n-gpu-layers 99 \
--flash-attn \
--cache-type-k q8_0 --cache-type-v q8_0 \
--parallel 2
Ollama — The Fastest Starting Point
# After install, pulling and running a model is one line each
ollama pull qwen2.5:14b
ollama run qwen2.5:14b
# Custom settings via a Modelfile
cat > Modelfile <<'EOF'
FROM qwen2.5:14b
PARAMETER num_ctx 16384
PARAMETER temperature 0.7
SYSTEM You are a coding assistant that answers concisely.
EOF
ollama create my-coder -f Modelfile
ollama run my-coder
Ollama uses llama.cpp internally, so performance characteristics are similar, but its defaults are conservative. In particular the default num_ctx is small, creating the trap of silently truncated context in long conversations — raise it explicitly.
vLLM — For Team Servers with Concurrent Requests
# docker-compose.yml
services:
vllm:
image: vllm/vllm-openai:latest
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
volumes:
- ./models:/root/.cache/huggingface
command: >
--model Qwen/Qwen2.5-14B-Instruct-AWQ
--quantization awq
--max-model-len 16384
--gpu-memory-utilization 0.92
--max-num-seqs 16
The key vLLM flags mean:
- gpu-memory-utilization: the fraction of total VRAM vLLM claims. Everything left after weights becomes the KV cache pool.
- max-model-len: maximum sequence length. Directly tied to KV cache arithmetic, so set it conservatively for your workload.
- max-num-seqs: cap on concurrently processed sequences. The dial trading throughput against per-request latency.
Multi-GPU — When One Card Is Not Enough
Running 70B on two 24GB cards is now commonplace. There are two approaches.
| Approach | Principle | Pros | Cons |
|---|---|---|---|
| Tensor parallel (TP) | Cards share each layer's matrices | Also reduces latency | Frequent inter-card traffic, identical cards recommended |
| Pipeline parallel (PP) | Layers split into stages | Mixed card models possible | Single-request latency not improved |
# vLLM: 2-way tensor parallel
vllm serve Qwen/Qwen2.5-72B-Instruct-AWQ \
--quantization awq \
--tensor-parallel-size 2 \
--max-model-len 8192
# llama.cpp: per-card split ratio (mixed-card example)
llama-server -m llama-70b-q4_k_m.gguf \
--n-gpu-layers 99 \
--split-mode layer \
--tensor-split 60,40
On consumer motherboards, PCIe lanes easily become the bottleneck. Tensor parallelism is communication-heavy and wants x8/x8 or better; with fewer lanes, pipeline (layer) splitting is the safer bet.
Performance Measurement Methodology — Numbers, Not Vibes
Local LLM tuning without measurement is superstition. Four standard metrics:
TTFT (Time To First Token) : time to the first token; reflects prompt processing
TPOT (Time Per Output Token): time per generated token; inverse of generation speed
pp speed (prompt processing) : prompt tokens/sec; dominates long-input workloads
tg speed (token generation) : generated tokens/sec; dominates chat workloads
llama.cpp ships a dedicated benchmark tool:
# Measure a 512-token prompt + 128-token generation scenario
./build/bin/llama-bench \
-m models/qwen2.5-14b-q4_k_m.gguf \
-p 512 -n 128 \
-ngl 99
# Example output columns: config, pp512, tg128 tokens/sec
# Sweep quantization level, ngl, and thread count to build a table:
# that is your machine's quality-speed curve
Load-test the whole server while varying concurrency:
# vLLM built-in benchmark: TTFT/TPOT across request rates
vllm bench serve \
--backend openai \
--base-url http://localhost:8000 \
--model Qwen/Qwen2.5-14B-Instruct-AWQ \
--num-prompts 200 \
--request-rate 4
Three principles for measurement:
- Warm up first: the first request mixes in model loading and kernel compilation and is always slow.
- Match your own workload distribution: a short-prompt benchmark says nothing about RAG performance. Match input/output length distributions to reality.
- p95, not averages: in agent pipelines the slowest call delays everything. Watch the tail latency.
Common Pitfalls
Finally, the traps that come up repeatedly in the community.
- Forgetting the context limit: default context sizes are short. As conversations grow, the beginning silently falls off and the model suddenly seems to have become stupid. Set ctx-size explicitly and budget the corresponding KV cache memory.
- Subtle quality collapse from over-quantization: Q2 to Q3 quantization looks fine on short answers but falls apart on long reasoning chains and code generation. Compare quantization levels on an eval set from your own tasks.
- Benchmark-vs-feel divergence: high tg tokens/sec means little if pp is slow — RAG will still feel sluggish. Always read both numbers together.
- Neglected sampling parameters: every model has its own recommended temperature and top-p. Many people run defaults and blame the model.
- Ignoring power and heat: for 24/7 servers, slightly lowering the power limit trades a few percent of tokens/sec for major reductions in power and heat.
- Neglected security: expose an unauthenticated server on 0.0.0.0 and anyone on the network can use your GPU. Set at least an API key and firewall rules.
- Model files of unknown origin: anyone can upload quantized models. In an era of routine supply-chain attacks, using well-downloaded repositories from verified uploaders and checking hashes is just as valid locally.
- Not tracking updates: llama.cpp and vLLM improve noticeably every few weeks. Conclusions drawn from a six-month-old benchmark may already be stale.
Closing
Local LLM inference optimization ultimately converges on one question: how do I allocate quality, speed, and context length within my memory budget? Quantization shrinks the weight budget, KV cache quantization and context limits govern the cache budget, and offloading borrows budget from elsewhere. Contrarian hacks like nbd-vram show that the partitions in that ledger are far more fluid than they appear.
If you start today, the recommended path:
- Pull a 14B Q4 model with Ollama and simply use it on your own work for a few days
- When context or speed frustrations appear, drop down to llama.cpp and tune KV quantization and offloading
- When concurrent users arrive, move up to vLLM and collect the throughput of continuous batching
- At every step, compare before and after with llama-bench and load tests — in numbers
One more thing: this field moves fast. The optimal setup of today can be stale knowledge next quarter, so automating your measurement scripts and rerunning them periodically is the asset that lasts longest.
Cloud APIs and local models are not substitutes but a division of labor. Sensitive data, repetitive tasks, and offline environments go local; moments that demand top quality go to the API. Being able to draw that boundary with numbers from your own workload — that is the point of all the measuring and arithmetic in this post.
References
- nbd-vram (VRAM as a block device): https://github.com/c0dejedi/nbd-vram
- llama.cpp repository: https://github.com/ggml-org/llama.cpp
- vLLM documentation: https://docs.vllm.ai/
- Ollama: https://ollama.com/
- GGUF format specification: https://github.com/ggml-org/ggml/blob/master/docs/gguf.md
- AWQ paper: https://arxiv.org/abs/2306.00978
- GPTQ paper: https://arxiv.org/abs/2210.17323
- PagedAttention (vLLM) paper: https://arxiv.org/abs/2309.06180
- llama.cpp server docs: https://github.com/ggml-org/llama.cpp/tree/master/tools/server
- Local LLM discussions on Hacker News: https://news.ycombinator.com/
- GeekNews: https://news.hada.io/