Local AI & On-Device LLMs 2026 — Ollama · LM Studio · Jan · Msty · Open WebUI · GPT4All · AnythingLLM · Faraday Deep Dive

Chapter 1 · Why Local AI Matters in 2026

Three years ago, "local LLMs" meant quantizing a 7B model to 4-bit, jamming it into an RTX 3090, and getting roughly half of GPT-3.5 quality. The landscape in May 2026 is unrecognizable.

An M4 Max MacBook Pro 128GB runs Llama 4 Scout 109B MoE at 24 tokens/second
An RTX 5090 24GB handles DeepSeek R1 Distill 32B at 12 tokens/second
iPhone 16 Pro triggers Apple Intelligence's 3B model automatically at OS level
A Snapdragon X Elite notebook runs Phi Silica 3.8B on the NPU

There are four crisp reasons local AI matters.

Privacy — inputs never leave the building. GDPR, HIPAA, Korea's PIPA, Japan's APPI — all handled
Cost — no API bill. Just electricity (negligible on a laptop)
Offline — planes, subways, cafe Wi-Fi — works without internet
Experimentation — try a freshly released model within 5 minutes. fine-tuning, LoRA, RAG are all free

This guide covers everything a developer should know to run LLMs on desktop / laptop / mobile as of May 2026. Runtimes, GUIs, backends, quantization formats, model recommendations, and ops know-how.

Chapter 2 · Hardware — The VRAM and Unified Memory Era

The first gate for local LLM is memory. Rough guidelines.

Model size	Precision	VRAM/RAM	Notes
3B	INT4	4GB	Mobile / low-end notebooks
7B	INT4 (Q4_K_M)	8GB	RTX 3060, M1/M2 8GB
7B	INT8	12GB	RTX 3060 12GB, M2 16GB
13B	INT4	12-14GB	RTX 4070, M2 24GB
32B	INT4	22-24GB	RTX 4090, M3 Max 36GB
70B	INT4	42-48GB	Dual RTX 5090, M2 Ultra 64GB
70B	INT8	80GB+	A100 80GB, M3 Ultra 192GB
405B	INT4	240GB+	Multi-GPU node, M3 Ultra 192GB pair

NVIDIA vs Apple Silicon

NVIDIA wins on PCIe + GDDR. Token generation is GPU-bound, so latency is unbeatable. RTX 5090 with 32GB GDDR7 has the shortest token latency for 32B-class models.

Apple Silicon's weapon is unified memory. The M3 Ultra Mac Studio has 192GB UMA and can run a 70B model at 16-bit. The NVIDIA equivalent needs two H100 80GB cards (pricing isn't even comparable).

M4 Max 128GB — up to 109B MoE — about USD 7,000
M3 Ultra 192GB — 70B BF16 — about USD 9,500
RTX 5090 24GB — 32B Q4 — about USD 2,200 + the rest of the system

Decision: Mac if you frequently run 70B+ on a laptop; NVIDIA if 32B or smaller + best cost-perf + gaming as well.

Chapter 3 · Ollama — The Most-Loved Local Runtime

Ollama came out of Y Combinator W24. MIT license; CLI / REST API + a model registry on top of llama.cpp. GitHub stars as of May 2026: 145,000+.

Install and first run

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Run the daemon
ollama serve

# Pull and run a model
ollama run llama3.3:70b-instruct-q4_K_M

# 7B, snappy
ollama run qwen2.5:7b-instruct

One line — ollama run — downloads, extracts the quantization, starts the inference server, and opens a chat. What other runtimes do in five steps, Ollama does in one.

Modelfile — a Dockerfile for models

FROM llama3.3:70b-instruct-q4_K_M
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
SYSTEM """
You are a Korean data engineering assistant. Prefer SQL and PySpark.
"""

ollama create yj-de -f Modelfile
ollama run yj-de

You can package a system prompt + params as a "model". Great when distributing a standard prompt across a team.

Ollama REST API

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.3:70b-instruct-q4_K_M",
  "messages": [{"role": "user", "content": "Explain Linux memory cache policy"}],
  "stream": false
}'

OpenAI-compatible mode is also exposed, so langchain, llamaindex, and the OpenAI SDK just need a base URL swap.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
    model="qwen2.5:14b-instruct",
    messages=[{"role": "user", "content": "hi"}]
)

Ollama model registry

One line via ollama pull. Important May 2026 tags.

ollama pull llama3.3:70b-instruct-q4_K_M
ollama pull deepseek-r1:32b-distill-q4_K_M
ollama pull qwen3:14b-instruct
ollama pull phi4:14b
ollama pull gemma3:27b-instruct
ollama pull mistral-small:22b
ollama pull mixtral:8x7b-instruct-q4_K_M
ollama pull deepseek-coder-v2:16b-lite-instruct
ollama pull minicpm3:4b
ollama pull llava:34b

Ollama's limits

Lean GUI — you'll want a separate client (Open WebUI, Msty, etc.)
Multi-GPU distribution is limited (vLLM dominates here)
No fine-tuning tooling — you still need unsloth / axolotl
Memory management is rough — load two models at once and OOM is common

Even so, "run a local LLM in 5 minutes" has had the same answer for three years: Ollama.

Chapter 4 · LM Studio — GUI-First Desktop

LM Studio is a desktop app from Element Labs (San Francisco). Free, closed-source. macOS / Windows / Linux.

Strengths

Model browser — Hugging Face search inside the app. Model card, quant options, memory estimate — all on one screen
Chat UI — multi-session, prompt templates, stop / regenerate
Local server — exposes an OpenAI-compatible API in one click
MLX acceleration — auto-selects MLX on Apple Silicon (30-50% faster than llama.cpp)
Hardware profiler — GPU/CPU split as a slider

Scenario

Best fit for someone who frequently runs two models on a laptop to compare them. With Ollama (CLI) you ollama run every time. With LM Studio you toggle inside one graphical session.

Weaknesses

Closed source — corporate adoption requires extra security review
Model directory isn't shared with Ollama — you re-download
Apple Silicon only on macOS; Intel Mac builds are retired
Linux builds often trail by one or two releases

Chapter 5 · Jan — Truly Open-Source Desktop

Jan is a 100% open-source (AGPL-3.0) desktop LLM app from Homebrew Research. Electron + TypeScript. GitHub stars as of May 2026: 28,000+.

What stands out

Plugin marketplace — features toggle as modules (RAG, web search, code interpreter)
Multiple backends — llama.cpp, MLX, TensorRT, vLLM — all selectable in one app
Cloud + local — drop in OpenAI / Anthropic / Mistral / Groq keys, mix them in the same UI — "today Claude, yesterday local"
Data sovereignty — every chat log is local SQLite; analysis and export are free

When to pick

"I need a ChatGPT-style desktop UI but won't depend on OpenAI"
"Compare local and cloud in one screen"
"Enterprise — corporate policy bans closed-source desktop apps"

Jan API

Jan also exposes an OpenAI-compatible API.

# Default port
curl http://localhost:1337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3-70b-q4",
    "messages": [{"role": "user", "content": "hello"}]
  }'

Chapter 6 · Msty — The Closed-Source Standout

Msty is a solo-developer desktop app. Free for personal, paid team license. macOS / Windows / Linux. Closed source but very highly rated.

Differentiators

Branch chat — fork off any message and generate two answers in parallel. Comparison is dramatically faster
Knowledge Stacks — drag a folder / PDF / URL and RAG happens automatically. No separate setup like AnythingLLM
Workspaces — isolate chats / models / RAG per project. Lightroom's catalog metaphor
Simultaneous local + cloud — fan one prompt out to Claude / GPT / local Llama

Pricing

Personal — free
Pro (personal) USD 99/year — unlimited workspaces, cloud sync
Team — USD 159/seat/year

Where LM Studio is "model browser + chat," Msty positions as a "research / knowledge workbench."

Chapter 7 · Open WebUI — Self-Hosted ChatGPT

Open WebUI (formerly Ollama WebUI) is a self-hostable ChatGPT clone started by Tim Jaeryang Baek. MIT license, Python (FastAPI) + Svelte. GitHub stars: 78,000+.

Why it's popular

Auto-detects Ollama — if Ollama runs on the host, models appear automatically
Multi-user — login, permissions, groups, per-model access
Built-in RAG — upload docs → vector search → context injection
Voice I/O — Whisper (STT) + Piper / Cartesia / ElevenLabs (TTS)
Function calling (Tools) — JS/Python functions invoked by the model
Pipelines — middleware pattern for logging, filtering, multi-model routing
One-line Docker install

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Browse to http://localhost:3000 — near-identical UI to ChatGPT. Drop this on the company GPU server and the whole company uses it. No data leaves the building.

Ops tips

Switch the backend to Postgres + Redis for multi-node scaling
If Ollama is on the same host: OLLAMA_BASE_URL=http://host.docker.internal:11434
vLLM and LM Studio are also OpenAI-compatible — same wiring

Chapter 8 · LibreChat — Multi-Provider Chat

LibreChat leans on cloud integration more than Open WebUI. OpenAI, Anthropic, Google, Mistral, Ollama, vLLM, llama.cpp — all in one screen.

Features

Plugin system (DALL-E, Wolfram, Zapier)
Compare mode — fan a prompt to N models simultaneously
Assistants API compatibility
Full i18n (Korean / Japanese / Chinese included)

When to use

"The company mixes cloud and local. I don't want two UIs"
"Internal unified UI in place of ChatGPT Pro"
"Enterprise SSO / SAML needed"

Chapter 9 · GPT4All — Nomic's Local LLM

GPT4All is run by Nomic AI (known for Atlas embedding visualization). Desktop app + Python SDK. MIT.

from gpt4all import GPT4All
model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf")
resp = model.generate("Why use a local LLM?", max_tokens=200)
print(resp)

Strengths

CPU-first design — works decently without a GPU
LocalDocs — folder RAG out of the box
Desktop + SDK integration — RAG collections built in the desktop GUI are accessible from Python

Weaknesses

New model support lags Ollama (no Llama 4 as of May 2026)
5-10% slower than direct llama.cpp

Chapter 10 · AnythingLLM — Local RAG Powerhouse

AnythingLLM is a full-stack RAG desktop/Docker app from Mintplex Labs (Boston). MIT, Node.js + React. Desktop and Docker self-host builds both ship.

Core components

Workspaces — a bundle of docs, chats, embeddings, model config
Agents — function calls, web search, code execution
Multiple LLM backends — Ollama / LM Studio / OpenAI / Anthropic / Mistral / Together
Embedding backends — sentence-transformers, OpenAI, Cohere, Ollama nomic-embed
Built-in vector DB — LanceDB by default; Chroma / Pinecone / Weaviate / Qdrant optional
Document connectors — PDF, DOCX, MD, GitHub repos, Confluence, Notion, web crawler

Scenario — internal wiki bot

1. Run AnythingLLM Docker
2. Create workspace "engineering-wiki"
3. Connect the Confluence connector and index (auto re-index every 24h)
4. Set the model to qwen2.5:14b via Ollama
5. Call via Slack bot or Open WebUI API

One of the fastest paths to corporate full-stack RAG.

Chapter 11 · PrivateGPT, Khoj, Reor — Specialized Tools

PrivateGPT

PrivateGPT was started by Iván Martínez. Python-based. The goal: 100% local RAG with zero external API. Common in security / regulated industries. Somewhat heavy (model + embedding + vector DB in one process).

Khoj

Khoj is a "personal AI assistant" from Khoj Inc. It indexes notes (Obsidian, Notion), email, calendar — and chats over them.

macOS / Windows / Linux desktop
iOS / Android apps
Self-host Docker option

Reor

Reor is an "AI-native notes app." Markdown notes similar to Obsidian, but automatic embeddings connect every note semantically. All inference and embedding is local.

Chapter 12 · Faraday, Pinokio, Chatbox

Faraday (legacy)

Faraday.dev was a desktop app centered on character chat. Effectively dormant as of May 2026; users migrated to SillyTavern and AI Horde. Mentioned for historical context.

Pinokio

Pinokio is "a package manager for AI scripts." One-click install/run for ComfyUI, AUTOMATIC1111, Whisper, Bark. JSON-based recipe system.

Use cases:
- Try image / voice / video tools quickly
- Share a ComfyUI workflow with a friend
- Automate demo environment setup

Chatbox

Chatbox is a multi-platform chat UI. iOS, Android, macOS, Windows, Linux, Web. OpenAI / Claude / Gemini / Ollama backends. Closed source but popular for travel because of strong mobile support.

Page Assist

Page Assist is a Chrome extension that lets you ask Ollama about the current page. Side-panel chat, context-menu summarize. Light RAG.

Chapter 13 · Backend Engines — llama.cpp / MLX / vLLM / TensorRT

llama.cpp

Georgi Gerganov's 2023 C++ inference engine. The foundation of Ollama, LM Studio, Jan, GPT4All. Supports CPU and GPU (CUDA, Metal, ROCm, Vulkan, SYCL).

# Build from source
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make -j8 LLAMA_METAL=1   # macOS
make -j8 LLAMA_CUDA=1    # Linux NVIDIA

# Run
./llama-cli -m models/qwen2.5-14b-instruct-q4_k_m.gguf -p "hi"
./llama-server -m models/llama-3.3-70b-q4_k_m.gguf --port 8080

Build-from-source is 10-20% faster than Ollama and exposes more flags. Downside: manual model download/management.

MLX-LM

Apple Silicon only. MLX is a NumPy-style tensor library from Apple's ML research team. MLX-LM is the LLM inference layer on top.

pip install mlx-lm
mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit --prompt "hi"
mlx_lm.server --model mlx-community/Qwen2.5-14B-Instruct-4bit --port 8080

30-50% faster than llama.cpp's Metal backend on M3/M4. That's why LM Studio auto-selects MLX. Downside: Apple Silicon only, no NVIDIA / AMD.

vLLM / SGLang / TGI

Server-class. Serves one model to many concurrent requests (PagedAttention, continuous batching). Overkill for a single-user laptop, but the right answer for "internal LLM serving ten people." Covered in depth in a separate post; brief mention here.

pip install vllm
vllm serve Qwen/Qwen2.5-14B-Instruct --port 8080

TensorRT-LLM

NVIDIA only. CUDA-optimized inference. Maximum throughput on H100 / B200 / RTX 5090. Build process is complex but the throughput is unbeaten in production.

Llamafile

Mozilla's Llamafile bundles llama.cpp + a model into a single executable. Same file runs on macOS, Linux, Windows. Great for multi-OS demos and air-gapped environments.

chmod +x llava-v1.5-7b-q4.llamafile
./llava-v1.5-7b-q4.llamafile --server

Chapter 14 · Quantization Formats — GGUF / AWQ / GPTQ / EXL / MXFP4 / BitNet

Source models are typically BF16 (2 bytes/param). A 7B model is 14GB. Too heavy for a laptop. Quantization trades precision for footprint.

GGUF (llama.cpp standard)

Q2_K (smallest, low quality, rarely used)
Q3_K_M (3-bit, 7B becomes 3GB — mobile)
Q4_K_M (4-bit, the sweet spot, most-used)
Q5_K_M (5-bit, better quality)
Q6_K (6-bit, near-BF16)
Q8_0 (8-bit, virtually no quality difference vs BF16, half the memory)
FP16 / BF16 (not quantized, original)

Q4_K_M shrinks a 7B to about 4.5GB and costs 2-3% perplexity. Overwhelming default.

AWQ (Activation-aware Weight Quantization)

Common with vLLM and TGI. Faster than GPTQ at similar quality. 4-bit is standard.

GPTQ

Older. Quantize with AutoGPTQ. 4-bit standard. Gradually losing ground to AWQ.

EXL2 / EXL3

ExLlamaV2/V3. NVIDIA RTX specialized. Mixes 4 / 6 / 8-bit within a model — under 1% perplexity hit. ExLlamaV3 shipped late 2025 with better quantization efficiency.

MXFP4

OpenAI standardized Microscaling FP4 in 2025. Hardware-accelerated on NVIDIA Blackwell (B200, RTX 5090). Better quality than INT4 with quarter-of-BF16 footprint.

BitNet (1.58-bit)

Microsoft research. Weights are -1, 0, +1. Almost no multiplications at inference — very fast. BitNet b1.58 3B and 7B released on Hugging Face in 2026. Experimental but huge potential for embedded / mobile.

Which to pick

Desktop / laptop, Ollama / llama.cpp → GGUF Q4_K_M
vLLM server, NVIDIA → AWQ
Single NVIDIA, max efficiency → EXL3
Apple Silicon → MLX 4-bit

Chapter 15 · Recommended Local Models — May 2026

General — Llama 4 Scout 109B MoE

Meta's Llama 4 Scout. 16-of-128 expert MoE. 17B active params — inference cost is 17B-class, quality is near 70B. 24 tokens/sec on M4 Max 128GB. Context 1M tokens.

General (practical) — Llama 3.3 70B

Llama 3.3 70B Instruct. The 70B-class standard. GPT-4 Turbo level. 42GB at Q4_K_M. Dual RTX 5090 or M2 Ultra 64GB.

Reasoning — DeepSeek R1 Distill 32B

DeepSeek R1's Llama / Qwen distill series. 32B Q4 = single RTX 4090. o1-mini-class reasoning. Strong on math, code, logic.

ollama pull deepseek-r1:32b
ollama pull deepseek-r1:7b   # for laptops

Multilingual — Qwen 3 14B

Alibaba Qwen 3. Strong in Korean / Chinese / Japanese / English. Often outperforms Llama on Korean text. 14B Q4_K_M on a single RTX 4070 (12GB).

Small-model champion — Phi-4 14B

Microsoft Phi-4. The "data curation is the answer" result. 14B with 70B-class benchmarks. The cost-perf winner for laptops.

Very small — Gemma 3 12B / 4B

Google Gemma 3. 12B / 4B / 1B lineup. Mobile / embedded / notebook. Smaller than the 7B class with comparable performance.

Light + multilingual — MiniCPM 3.0 4B

OpenBMB's MiniCPM 3.0. 4B competitive with 8B models. Optimized for mobile / edge.

Code — DeepSeek Coder V2 Lite 16B

DeepSeek Coder V2. 16B MoE (2.4B active). 10GB at Q4. Popular as Continue.dev / Cline backend.

Multimodal — LLaVA 34B, Qwen2-VL 7B, Pixtral 12B

Image + text. LLaVA is the standard, Qwen2-VL is multilingual-strong, Pixtral is Mistral's vision model.

ollama pull llava:34b
ollama pull qwen2-vl:7b

Chapter 16 · Voice Mode — STT + LLM + TTS

STT (speech → text)

OpenAI Whisper — the standard. base / small / medium / large-v3. large-v3 needs 4GB GPU.
faster-whisper — CTranslate2 backend. Fast on CPU and GPU.
whisper.cpp — C++ port, Apple Silicon Metal accelerated.
Distil-Whisper — Whisper distillation, 6x faster.

TTS (text → speech)

Piper — Rhasspy project. Fast on CPU, Korean voices available.
Coqui XTTS v2 — multilingual + voice cloning. (Coqui dissolved 2024, models remain)
F5-TTS — released 2025. English / Chinese naturalness near top tier. Voice cloning.
Kokoro — tiny (82M) English TTS. Real-time on a notebook CPU.
Cartesia Sonic — commercial API but extremely fast.

Open WebUI voice integration

Settings → Audio
  STT: faster-whisper (local) or Whisper API
  TTS: Piper (local), Kokoro (local), ElevenLabs (cloud)

Tap the mic, and the STT → LLM → TTS pipeline runs. Talk to ChatGPT-style while driving.

Chapter 17 · Code Assistant — Continue.dev + Ollama

Continue.dev

Continue.dev is a VSCode / JetBrains extension. A Cursor / Copilot alternative. Model backend is free choice — local Ollama works.

// ~/.continue/config.json
{
  "models": [
    {
      "title": "Local Coder",
      "provider": "ollama",
      "model": "deepseek-coder-v2:16b-lite-instruct",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Tab",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

Tab autocomplete uses Qwen2.5-Coder 7B (fast); chat uses DeepSeek Coder V2 16B (quality). 100% local, zero API cost, code stays on-device.

Cline + Ollama

Cline (formerly Claude Dev) is agentic. File read/write, command exec, Plan/Act mode. Ollama backend works but recommend 70B+ reasoning models — agent loops are heavy.

aider

aider is a terminal pair programmer. Git-based. Ollama backend.

aider --model ollama/qwen2.5-coder:32b

Chapter 18 · Apple Intelligence — OS-Level On-Device

Apple Intelligence is GA on iOS 18, iPadOS 18, macOS 15 Sequoia, visionOS 2. Two pieces.

On-device 3B model — runs on Apple Silicon NPU. Notification summaries, Mail reply suggestions, text refinement, Image Playground.
Private Cloud Compute (PCC) — for bigger workloads, offloaded to Apple Silicon servers. Logs are not persisted; only attested code runs (source published to external security researchers).

Foundation Models framework

import FoundationModels
let session = LanguageModelSession()
let resp = try await session.respond(to: "Summarize my note in 3 lines")

Available on iOS 18.2+ / macOS 15.2+. 3B-bound but free and unlimited.

Limits

English-first; Korean / Japanese GA in stages through 2025
3B is too small for complex tasks — hence the PCC handoff
Requires iPhone 15 Pro or M1 or later

Chapter 19 · Phi Silica — On-Device AI for Windows 11

Microsoft ships Phi Silica — a 3.8B model — on the NPU of Snapdragon X Elite / Intel Core Ultra / AMD Ryzen AI. Standard on Copilot+ PCs since Windows 11 24H2.

Capabilities

Summarize / rewrite / translate
Code assist (Visual Studio integration)
Image generation (Cocreator)
Search (Recall — semantic search over captured screens)

Recall was paused over security concerns in 2024 and re-shipped in 2025 with opt-in + E2E encryption.

Developer API

The Microsoft.Windows.AI.Generative namespace in Windows Copilot Runtime. Callable from C# / Rust / C++.

Chapter 20 · Gemini Nano — Android and Chrome

Gemini Nano is Google's smallest Gemini variant. Available on Pixel 8 Pro and up, parts of Galaxy S24+, and Chrome desktop (Canary / Beta + partial stable as of May 2026).

Chrome Built-in AI

// Origin Trial active as of May 2026
const session = await ai.languageModel.create({
  systemPrompt: "You are a summarization expert.",
})
const summary = await session.prompt("Summarize this article in 3 lines: ...")

An LLM inside the browser. Zero network calls, zero cost. Web apps can finally lean on an "offline LLM."

Android AICore

val generativeModel = GenerativeModel(modelName = "gemini-nano")
val response = generativeModel.generateContent("summarize")

Chapter 21 · Korean Local AI Ecosystem

Lablup Backend.AI

Lablup's Backend.AI is an LLM training / inference platform. Manages vLLM, Triton, TensorRT on in-house GPU clusters. Many SOE / large-enterprise deployments in Korea in 2026.

Upstage Solar

Upstage's Solar comes in 10.7B / Pro / Mini variants. Solar Mini 2.4B runs locally on laptops — registered in Ollama.

ollama pull upstage/solar-pro-preview

Naver Cloud HyperCLOVA X

Naver's HyperCLOVA X SEED 3B is open-weight (released 2025). Korean-specialized. Registered on Hugging Face — convertible for llama.cpp / Ollama.

KT, SKT, LG

KT Mi:dm, SKT A.X 4.0 — proprietary 7B models (some weights open)
LG AI Research EXAONE 3.5 — 2.4B / 7.8B / 32B. Non-commercial license but free for research

ollama pull exaone3.5:7.8b

Chapter 22 · Japanese Local AI Ecosystem

ELYZA

ELYZA (University of Tokyo spinout). Llama-based Japanese fine-tunes. ELYZA-japanese-Llama-3-8B directly in Ollama.

Rinna

Rinna. MS Japan spinout. Japanese GPT, BERT, Llama tunes. Also voice synth / recognition.

Stockmark

Stockmark-100B. Japanese 100B model, business-domain-specialized. Partial weights public.

PFN PLaMo

Preferred Networks's PLaMo. 13B / 100B. PLaMo Lite is open-weight — laptop-local feasible.

CyberAgent CALM

CyberAgent's CALM3 22B. Japanese + dialogue-tuned. Single RTX 4090 at Q4.

Chapter 23 · Ops Know-How — N Models on One GPU

Loading two models on the same GPU often OOMs. Three remedies.

1. Hot-swap (Ollama default)

Ollama's keep_alive controls model retention in memory.

# Unload 30 seconds after last use
ollama run qwen2.5:7b --keep-alive 30s

# Keep loaded forever
ollama run llama3.3:70b --keep-alive -1

2. Model router

If different services need different models, route via LiteLLM or self-hosted OpenRouter.

# litellm config.yaml
model_list:
  - model_name: chat
    litellm_params:
      model: ollama/qwen2.5:14b
      api_base: http://localhost:11434
  - model_name: code
    litellm_params:
      model: ollama/deepseek-coder-v2:16b
      api_base: http://localhost:11434

3. vLLM continuous batching

When many users hit at once, vLLM uses PagedAttention to serve N concurrent requests against one model. Ten people can chat against a single 70B.

Chapter 24 · RAG Patterns — Local Embedding

Embedding models (local)

nomic-embed-text — 768-dim, English SOTA-ish, registered in Ollama
mxbai-embed-large — 1024-dim, better quality, slightly slower
bge-m3 — multilingual strong (Korean / Japanese / Chinese)
multilingual-e5-large — multilingual / notebook-friendly

ollama pull nomic-embed-text
ollama pull mxbai-embed-large
ollama pull bge-m3

Local vector DBs

LanceDB — embedded, on-disk, single file. AnythingLLM default.
ChromaDB — Python lib + server mode
Qdrant — Rust server, very fast
Weaviate — full-stack
Milvus — large-scale

import lancedb
db = lancedb.connect("./data")
table = db.create_table("docs", schema=...)
table.add([{"vector": embed("text"), "text": "text"}])
table.search(embed("query")).limit(5).to_pandas()

Chapter 25 · Security and Compliance

"Local equals safe?" — No

Local LLMs remove some cloud risks but introduce new ones.

Prompt injection — hidden "ignore previous instructions" in documents → identical risk locally
Data leakage — RAG may pull docs the user has no right to
Model integrity — a Hugging Face download might be backdoored — use official channels only
Fine-tune leakage — weights tuned on company data may leak PII

Ops guide

Source models only from the official org (Meta, Microsoft, Google, Alibaba, DeepSeek HF orgs)
Verify hashes after download
Internal RAG needs access control (AnythingLLM workspace-scoped)
Logging and audit — pipe Open WebUI admin logs into your SIEM

Compliance mapping

Regulation	Cloud LLM	Local LLM
GDPR	Transfer requires DPA	No transfer, partially exempt
HIPAA	BAA required	Own infrastructure — full control
Korea PIPA	Overseas transfer consent	Domestic processing — simple
Japan APPI	Consent + safeguards	Same, lower external risk
Korea FSI	Cloud security cert mandatory	Self-controlled infra

Chapter 26 · Conclusion — Local AI as 2026 Table Stakes

Local LLMs were a hobby in 2023, an experiment in 2024, an option in 2025. In 2026 they are a developer's basic skill.

One laptop + Ollama + Continue.dev → API bill drop and code doesn't leak
In-house GPU server + Open WebUI + AnythingLLM → self-run company ChatGPT
iPhone + Apple Intelligence → handled by the OS
Personal notes + Reor / Khoj → semantic search over all notes

A 5-minute workflow to try right now.

# 1. Install Ollama
brew install ollama

# 2. Pull a model
ollama pull qwen2.5:14b-instruct

# 3. Chat
ollama run qwen2.5:14b-instruct

# 4. Spin up Open WebUI (if Docker is on the box)
docker run -d -p 3000:8080 \
  -v open-webui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in a browser and your own ChatGPT runs on a laptop. No data leaves, cost is electricity, and the plane Wi-Fi being down doesn't stop you. That is the May-2026 landscape.

Chapter 27 · References

Ollama official — https://ollama.com/
Ollama model library — https://ollama.com/library
LM Studio — https://lmstudio.ai/
Jan — https://jan.ai/
Msty — https://msty.app/
GPT4All — https://gpt4all.io/
Open WebUI — https://openwebui.com/
LibreChat — https://www.librechat.ai/
AnythingLLM — https://anythingllm.com/
PrivateGPT — https://privategpt.dev/
Khoj — https://khoj.dev/
Reor — https://reor.app/
Pinokio — https://pinokio.computer/
Chatbox — https://chatboxai.app/
llama.cpp — https://github.com/ggml-org/llama.cpp
MLX-LM — https://github.com/ml-explore/mlx-examples
Llamafile — https://github.com/Mozilla-Ocho/llamafile
Continue.dev — https://www.continue.dev/
Cline — https://cline.bot/
aider — https://aider.chat/
Hugging Face — https://huggingface.co/
Apple Intelligence — https://www.apple.com/apple-intelligence/
Microsoft Phi Silica — https://learn.microsoft.com/en-us/windows/ai/
Chrome Built-in AI — https://developer.chrome.com/docs/ai
Lablup Backend.AI — https://www.lablup.com/
Upstage Solar — https://www.upstage.ai/
LG EXAONE — https://www.lgresearch.ai/
ELYZA — https://elyza.ai/
Preferred Networks PLaMo — https://www.preferred.jp/
CyberAgent CALM — https://www.cyberagent.co.jp/