- Published on
Local AI & On-Device LLMs 2026 — Ollama · LM Studio · Jan · Msty · Open WebUI · GPT4All · AnythingLLM · Faraday Deep Dive
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Chapter 1 · Why Local AI Matters in 2026
Three years ago, "local LLMs" meant quantizing a 7B model to 4-bit, jamming it into an RTX 3090, and getting roughly half of GPT-3.5 quality. The landscape in May 2026 is unrecognizable.
- An M4 Max MacBook Pro 128GB runs Llama 4 Scout 109B MoE at 24 tokens/second
- An RTX 5090 24GB handles DeepSeek R1 Distill 32B at 12 tokens/second
- iPhone 16 Pro triggers Apple Intelligence's 3B model automatically at OS level
- A Snapdragon X Elite notebook runs Phi Silica 3.8B on the NPU
There are four crisp reasons local AI matters.
- Privacy — inputs never leave the building. GDPR, HIPAA, Korea's PIPA, Japan's APPI — all handled
- Cost — no API bill. Just electricity (negligible on a laptop)
- Offline — planes, subways, cafe Wi-Fi — works without internet
- Experimentation — try a freshly released model within 5 minutes. fine-tuning, LoRA, RAG are all free
This guide covers everything a developer should know to run LLMs on desktop / laptop / mobile as of May 2026. Runtimes, GUIs, backends, quantization formats, model recommendations, and ops know-how.
Chapter 2 · Hardware — The VRAM and Unified Memory Era
The first gate for local LLM is memory. Rough guidelines.
| Model size | Precision | VRAM/RAM | Notes |
|---|---|---|---|
| 3B | INT4 | 4GB | Mobile / low-end notebooks |
| 7B | INT4 (Q4_K_M) | 8GB | RTX 3060, M1/M2 8GB |
| 7B | INT8 | 12GB | RTX 3060 12GB, M2 16GB |
| 13B | INT4 | 12-14GB | RTX 4070, M2 24GB |
| 32B | INT4 | 22-24GB | RTX 4090, M3 Max 36GB |
| 70B | INT4 | 42-48GB | Dual RTX 5090, M2 Ultra 64GB |
| 70B | INT8 | 80GB+ | A100 80GB, M3 Ultra 192GB |
| 405B | INT4 | 240GB+ | Multi-GPU node, M3 Ultra 192GB pair |
NVIDIA vs Apple Silicon
NVIDIA wins on PCIe + GDDR. Token generation is GPU-bound, so latency is unbeatable. RTX 5090 with 32GB GDDR7 has the shortest token latency for 32B-class models.
Apple Silicon's weapon is unified memory. The M3 Ultra Mac Studio has 192GB UMA and can run a 70B model at 16-bit. The NVIDIA equivalent needs two H100 80GB cards (pricing isn't even comparable).
- M4 Max 128GB — up to 109B MoE — about USD 7,000
- M3 Ultra 192GB — 70B BF16 — about USD 9,500
- RTX 5090 24GB — 32B Q4 — about USD 2,200 + the rest of the system
Decision: Mac if you frequently run 70B+ on a laptop; NVIDIA if 32B or smaller + best cost-perf + gaming as well.
Chapter 3 · Ollama — The Most-Loved Local Runtime
Ollama came out of Y Combinator W24. MIT license; CLI / REST API + a model registry on top of llama.cpp. GitHub stars as of May 2026: 145,000+.
Install and first run
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Run the daemon
ollama serve
# Pull and run a model
ollama run llama3.3:70b-instruct-q4_K_M
# 7B, snappy
ollama run qwen2.5:7b-instruct
One line — ollama run — downloads, extracts the quantization, starts the inference server, and opens a chat. What other runtimes do in five steps, Ollama does in one.
Modelfile — a Dockerfile for models
FROM llama3.3:70b-instruct-q4_K_M
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
SYSTEM """
You are a Korean data engineering assistant. Prefer SQL and PySpark.
"""
ollama create yj-de -f Modelfile
ollama run yj-de
You can package a system prompt + params as a "model". Great when distributing a standard prompt across a team.
Ollama REST API
curl http://localhost:11434/api/chat -d '{
"model": "llama3.3:70b-instruct-q4_K_M",
"messages": [{"role": "user", "content": "Explain Linux memory cache policy"}],
"stream": false
}'
OpenAI-compatible mode is also exposed, so langchain, llamaindex, and the OpenAI SDK just need a base URL swap.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
model="qwen2.5:14b-instruct",
messages=[{"role": "user", "content": "hi"}]
)
Ollama model registry
One line via ollama pull. Important May 2026 tags.
ollama pull llama3.3:70b-instruct-q4_K_M
ollama pull deepseek-r1:32b-distill-q4_K_M
ollama pull qwen3:14b-instruct
ollama pull phi4:14b
ollama pull gemma3:27b-instruct
ollama pull mistral-small:22b
ollama pull mixtral:8x7b-instruct-q4_K_M
ollama pull deepseek-coder-v2:16b-lite-instruct
ollama pull minicpm3:4b
ollama pull llava:34b
Ollama's limits
- Lean GUI — you'll want a separate client (Open WebUI, Msty, etc.)
- Multi-GPU distribution is limited (vLLM dominates here)
- No fine-tuning tooling — you still need unsloth / axolotl
- Memory management is rough — load two models at once and OOM is common
Even so, "run a local LLM in 5 minutes" has had the same answer for three years: Ollama.
Chapter 4 · LM Studio — GUI-First Desktop
LM Studio is a desktop app from Element Labs (San Francisco). Free, closed-source. macOS / Windows / Linux.
Strengths
- Model browser — Hugging Face search inside the app. Model card, quant options, memory estimate — all on one screen
- Chat UI — multi-session, prompt templates, stop / regenerate
- Local server — exposes an OpenAI-compatible API in one click
- MLX acceleration — auto-selects MLX on Apple Silicon (30-50% faster than llama.cpp)
- Hardware profiler — GPU/CPU split as a slider
Scenario
Best fit for someone who frequently runs two models on a laptop to compare them. With Ollama (CLI) you ollama run every time. With LM Studio you toggle inside one graphical session.
Weaknesses
- Closed source — corporate adoption requires extra security review
- Model directory isn't shared with Ollama — you re-download
- Apple Silicon only on macOS; Intel Mac builds are retired
- Linux builds often trail by one or two releases
Chapter 5 · Jan — Truly Open-Source Desktop
Jan is a 100% open-source (AGPL-3.0) desktop LLM app from Homebrew Research. Electron + TypeScript. GitHub stars as of May 2026: 28,000+.
What stands out
- Plugin marketplace — features toggle as modules (RAG, web search, code interpreter)
- Multiple backends — llama.cpp, MLX, TensorRT, vLLM — all selectable in one app
- Cloud + local — drop in OpenAI / Anthropic / Mistral / Groq keys, mix them in the same UI — "today Claude, yesterday local"
- Data sovereignty — every chat log is local SQLite; analysis and export are free
When to pick
- "I need a ChatGPT-style desktop UI but won't depend on OpenAI"
- "Compare local and cloud in one screen"
- "Enterprise — corporate policy bans closed-source desktop apps"
Jan API
Jan also exposes an OpenAI-compatible API.
# Default port
curl http://localhost:1337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3-70b-q4",
"messages": [{"role": "user", "content": "hello"}]
}'
Chapter 6 · Msty — The Closed-Source Standout
Msty is a solo-developer desktop app. Free for personal, paid team license. macOS / Windows / Linux. Closed source but very highly rated.
Differentiators
- Branch chat — fork off any message and generate two answers in parallel. Comparison is dramatically faster
- Knowledge Stacks — drag a folder / PDF / URL and RAG happens automatically. No separate setup like AnythingLLM
- Workspaces — isolate chats / models / RAG per project. Lightroom's catalog metaphor
- Simultaneous local + cloud — fan one prompt out to Claude / GPT / local Llama
Pricing
- Personal — free
- Pro (personal) USD 99/year — unlimited workspaces, cloud sync
- Team — USD 159/seat/year
Where LM Studio is "model browser + chat," Msty positions as a "research / knowledge workbench."
Chapter 7 · Open WebUI — Self-Hosted ChatGPT
Open WebUI (formerly Ollama WebUI) is a self-hostable ChatGPT clone started by Tim Jaeryang Baek. MIT license, Python (FastAPI) + Svelte. GitHub stars: 78,000+.
Why it's popular
- Auto-detects Ollama — if Ollama runs on the host, models appear automatically
- Multi-user — login, permissions, groups, per-model access
- Built-in RAG — upload docs → vector search → context injection
- Voice I/O — Whisper (STT) + Piper / Cartesia / ElevenLabs (TTS)
- Function calling (Tools) — JS/Python functions invoked by the model
- Pipelines — middleware pattern for logging, filtering, multi-model routing
- One-line Docker install
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
Browse to http://localhost:3000 — near-identical UI to ChatGPT. Drop this on the company GPU server and the whole company uses it. No data leaves the building.
Ops tips
- Switch the backend to Postgres + Redis for multi-node scaling
- If Ollama is on the same host:
OLLAMA_BASE_URL=http://host.docker.internal:11434 - vLLM and LM Studio are also OpenAI-compatible — same wiring
Chapter 8 · LibreChat — Multi-Provider Chat
LibreChat leans on cloud integration more than Open WebUI. OpenAI, Anthropic, Google, Mistral, Ollama, vLLM, llama.cpp — all in one screen.
Features
- Plugin system (DALL-E, Wolfram, Zapier)
- Compare mode — fan a prompt to N models simultaneously
- Assistants API compatibility
- Full i18n (Korean / Japanese / Chinese included)
When to use
- "The company mixes cloud and local. I don't want two UIs"
- "Internal unified UI in place of ChatGPT Pro"
- "Enterprise SSO / SAML needed"
Chapter 9 · GPT4All — Nomic's Local LLM
GPT4All is run by Nomic AI (known for Atlas embedding visualization). Desktop app + Python SDK. MIT.
from gpt4all import GPT4All
model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf")
resp = model.generate("Why use a local LLM?", max_tokens=200)
print(resp)
Strengths
- CPU-first design — works decently without a GPU
- LocalDocs — folder RAG out of the box
- Desktop + SDK integration — RAG collections built in the desktop GUI are accessible from Python
Weaknesses
- New model support lags Ollama (no Llama 4 as of May 2026)
- 5-10% slower than direct llama.cpp
Chapter 10 · AnythingLLM — Local RAG Powerhouse
AnythingLLM is a full-stack RAG desktop/Docker app from Mintplex Labs (Boston). MIT, Node.js + React. Desktop and Docker self-host builds both ship.
Core components
- Workspaces — a bundle of docs, chats, embeddings, model config
- Agents — function calls, web search, code execution
- Multiple LLM backends — Ollama / LM Studio / OpenAI / Anthropic / Mistral / Together
- Embedding backends — sentence-transformers, OpenAI, Cohere, Ollama nomic-embed
- Built-in vector DB — LanceDB by default; Chroma / Pinecone / Weaviate / Qdrant optional
- Document connectors — PDF, DOCX, MD, GitHub repos, Confluence, Notion, web crawler
Scenario — internal wiki bot
1. Run AnythingLLM Docker
2. Create workspace "engineering-wiki"
3. Connect the Confluence connector and index (auto re-index every 24h)
4. Set the model to qwen2.5:14b via Ollama
5. Call via Slack bot or Open WebUI API
One of the fastest paths to corporate full-stack RAG.
Chapter 11 · PrivateGPT, Khoj, Reor — Specialized Tools
PrivateGPT
PrivateGPT was started by Iván Martínez. Python-based. The goal: 100% local RAG with zero external API. Common in security / regulated industries. Somewhat heavy (model + embedding + vector DB in one process).
Khoj
Khoj is a "personal AI assistant" from Khoj Inc. It indexes notes (Obsidian, Notion), email, calendar — and chats over them.
- macOS / Windows / Linux desktop
- iOS / Android apps
- Self-host Docker option
Reor
Reor is an "AI-native notes app." Markdown notes similar to Obsidian, but automatic embeddings connect every note semantically. All inference and embedding is local.
Chapter 12 · Faraday, Pinokio, Chatbox
Faraday (legacy)
Faraday.dev was a desktop app centered on character chat. Effectively dormant as of May 2026; users migrated to SillyTavern and AI Horde. Mentioned for historical context.
Pinokio
Pinokio is "a package manager for AI scripts." One-click install/run for ComfyUI, AUTOMATIC1111, Whisper, Bark. JSON-based recipe system.
Use cases:
- Try image / voice / video tools quickly
- Share a ComfyUI workflow with a friend
- Automate demo environment setup
Chatbox
Chatbox is a multi-platform chat UI. iOS, Android, macOS, Windows, Linux, Web. OpenAI / Claude / Gemini / Ollama backends. Closed source but popular for travel because of strong mobile support.
Page Assist
Page Assist is a Chrome extension that lets you ask Ollama about the current page. Side-panel chat, context-menu summarize. Light RAG.
Chapter 13 · Backend Engines — llama.cpp / MLX / vLLM / TensorRT
llama.cpp
Georgi Gerganov's 2023 C++ inference engine. The foundation of Ollama, LM Studio, Jan, GPT4All. Supports CPU and GPU (CUDA, Metal, ROCm, Vulkan, SYCL).
# Build from source
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make -j8 LLAMA_METAL=1 # macOS
make -j8 LLAMA_CUDA=1 # Linux NVIDIA
# Run
./llama-cli -m models/qwen2.5-14b-instruct-q4_k_m.gguf -p "hi"
./llama-server -m models/llama-3.3-70b-q4_k_m.gguf --port 8080
Build-from-source is 10-20% faster than Ollama and exposes more flags. Downside: manual model download/management.
MLX-LM
Apple Silicon only. MLX is a NumPy-style tensor library from Apple's ML research team. MLX-LM is the LLM inference layer on top.
pip install mlx-lm
mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit --prompt "hi"
mlx_lm.server --model mlx-community/Qwen2.5-14B-Instruct-4bit --port 8080
30-50% faster than llama.cpp's Metal backend on M3/M4. That's why LM Studio auto-selects MLX. Downside: Apple Silicon only, no NVIDIA / AMD.
vLLM / SGLang / TGI
Server-class. Serves one model to many concurrent requests (PagedAttention, continuous batching). Overkill for a single-user laptop, but the right answer for "internal LLM serving ten people." Covered in depth in a separate post; brief mention here.
pip install vllm
vllm serve Qwen/Qwen2.5-14B-Instruct --port 8080
TensorRT-LLM
NVIDIA only. CUDA-optimized inference. Maximum throughput on H100 / B200 / RTX 5090. Build process is complex but the throughput is unbeaten in production.
Llamafile
Mozilla's Llamafile bundles llama.cpp + a model into a single executable. Same file runs on macOS, Linux, Windows. Great for multi-OS demos and air-gapped environments.
chmod +x llava-v1.5-7b-q4.llamafile
./llava-v1.5-7b-q4.llamafile --server
Chapter 14 · Quantization Formats — GGUF / AWQ / GPTQ / EXL / MXFP4 / BitNet
Source models are typically BF16 (2 bytes/param). A 7B model is 14GB. Too heavy for a laptop. Quantization trades precision for footprint.
GGUF (llama.cpp standard)
- Q2_K (smallest, low quality, rarely used)
- Q3_K_M (3-bit, 7B becomes 3GB — mobile)
- Q4_K_M (4-bit, the sweet spot, most-used)
- Q5_K_M (5-bit, better quality)
- Q6_K (6-bit, near-BF16)
- Q8_0 (8-bit, virtually no quality difference vs BF16, half the memory)
- FP16 / BF16 (not quantized, original)
Q4_K_M shrinks a 7B to about 4.5GB and costs 2-3% perplexity. Overwhelming default.
AWQ (Activation-aware Weight Quantization)
Common with vLLM and TGI. Faster than GPTQ at similar quality. 4-bit is standard.
GPTQ
Older. Quantize with AutoGPTQ. 4-bit standard. Gradually losing ground to AWQ.
EXL2 / EXL3
ExLlamaV2/V3. NVIDIA RTX specialized. Mixes 4 / 6 / 8-bit within a model — under 1% perplexity hit. ExLlamaV3 shipped late 2025 with better quantization efficiency.
MXFP4
OpenAI standardized Microscaling FP4 in 2025. Hardware-accelerated on NVIDIA Blackwell (B200, RTX 5090). Better quality than INT4 with quarter-of-BF16 footprint.
BitNet (1.58-bit)
Microsoft research. Weights are -1, 0, +1. Almost no multiplications at inference — very fast. BitNet b1.58 3B and 7B released on Hugging Face in 2026. Experimental but huge potential for embedded / mobile.
Which to pick
- Desktop / laptop, Ollama / llama.cpp → GGUF Q4_K_M
- vLLM server, NVIDIA → AWQ
- Single NVIDIA, max efficiency → EXL3
- Apple Silicon → MLX 4-bit
Chapter 15 · Recommended Local Models — May 2026
General — Llama 4 Scout 109B MoE
Meta's Llama 4 Scout. 16-of-128 expert MoE. 17B active params — inference cost is 17B-class, quality is near 70B. 24 tokens/sec on M4 Max 128GB. Context 1M tokens.
General (practical) — Llama 3.3 70B
Llama 3.3 70B Instruct. The 70B-class standard. GPT-4 Turbo level. 42GB at Q4_K_M. Dual RTX 5090 or M2 Ultra 64GB.
Reasoning — DeepSeek R1 Distill 32B
DeepSeek R1's Llama / Qwen distill series. 32B Q4 = single RTX 4090. o1-mini-class reasoning. Strong on math, code, logic.
ollama pull deepseek-r1:32b
ollama pull deepseek-r1:7b # for laptops
Multilingual — Qwen 3 14B
Alibaba Qwen 3. Strong in Korean / Chinese / Japanese / English. Often outperforms Llama on Korean text. 14B Q4_K_M on a single RTX 4070 (12GB).
Small-model champion — Phi-4 14B
Microsoft Phi-4. The "data curation is the answer" result. 14B with 70B-class benchmarks. The cost-perf winner for laptops.
Very small — Gemma 3 12B / 4B
Google Gemma 3. 12B / 4B / 1B lineup. Mobile / embedded / notebook. Smaller than the 7B class with comparable performance.
Light + multilingual — MiniCPM 3.0 4B
OpenBMB's MiniCPM 3.0. 4B competitive with 8B models. Optimized for mobile / edge.
Code — DeepSeek Coder V2 Lite 16B
DeepSeek Coder V2. 16B MoE (2.4B active). 10GB at Q4. Popular as Continue.dev / Cline backend.
Multimodal — LLaVA 34B, Qwen2-VL 7B, Pixtral 12B
Image + text. LLaVA is the standard, Qwen2-VL is multilingual-strong, Pixtral is Mistral's vision model.
ollama pull llava:34b
ollama pull qwen2-vl:7b
Chapter 16 · Voice Mode — STT + LLM + TTS
STT (speech → text)
- OpenAI Whisper — the standard. base / small / medium / large-v3. large-v3 needs 4GB GPU.
- faster-whisper — CTranslate2 backend. Fast on CPU and GPU.
- whisper.cpp — C++ port, Apple Silicon Metal accelerated.
- Distil-Whisper — Whisper distillation, 6x faster.
TTS (text → speech)
- Piper — Rhasspy project. Fast on CPU, Korean voices available.
- Coqui XTTS v2 — multilingual + voice cloning. (Coqui dissolved 2024, models remain)
- F5-TTS — released 2025. English / Chinese naturalness near top tier. Voice cloning.
- Kokoro — tiny (82M) English TTS. Real-time on a notebook CPU.
- Cartesia Sonic — commercial API but extremely fast.
Open WebUI voice integration
Settings → Audio
STT: faster-whisper (local) or Whisper API
TTS: Piper (local), Kokoro (local), ElevenLabs (cloud)
Tap the mic, and the STT → LLM → TTS pipeline runs. Talk to ChatGPT-style while driving.
Chapter 17 · Code Assistant — Continue.dev + Ollama
Continue.dev
Continue.dev is a VSCode / JetBrains extension. A Cursor / Copilot alternative. Model backend is free choice — local Ollama works.
// ~/.continue/config.json
{
"models": [
{
"title": "Local Coder",
"provider": "ollama",
"model": "deepseek-coder-v2:16b-lite-instruct",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "Tab",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
}
Tab autocomplete uses Qwen2.5-Coder 7B (fast); chat uses DeepSeek Coder V2 16B (quality). 100% local, zero API cost, code stays on-device.
Cline + Ollama
Cline (formerly Claude Dev) is agentic. File read/write, command exec, Plan/Act mode. Ollama backend works but recommend 70B+ reasoning models — agent loops are heavy.
aider
aider is a terminal pair programmer. Git-based. Ollama backend.
aider --model ollama/qwen2.5-coder:32b
Chapter 18 · Apple Intelligence — OS-Level On-Device
Apple Intelligence is GA on iOS 18, iPadOS 18, macOS 15 Sequoia, visionOS 2. Two pieces.
- On-device 3B model — runs on Apple Silicon NPU. Notification summaries, Mail reply suggestions, text refinement, Image Playground.
- Private Cloud Compute (PCC) — for bigger workloads, offloaded to Apple Silicon servers. Logs are not persisted; only attested code runs (source published to external security researchers).
Foundation Models framework
import FoundationModels
let session = LanguageModelSession()
let resp = try await session.respond(to: "Summarize my note in 3 lines")
Available on iOS 18.2+ / macOS 15.2+. 3B-bound but free and unlimited.
Limits
- English-first; Korean / Japanese GA in stages through 2025
- 3B is too small for complex tasks — hence the PCC handoff
- Requires iPhone 15 Pro or M1 or later
Chapter 19 · Phi Silica — On-Device AI for Windows 11
Microsoft ships Phi Silica — a 3.8B model — on the NPU of Snapdragon X Elite / Intel Core Ultra / AMD Ryzen AI. Standard on Copilot+ PCs since Windows 11 24H2.
Capabilities
- Summarize / rewrite / translate
- Code assist (Visual Studio integration)
- Image generation (Cocreator)
- Search (Recall — semantic search over captured screens)
Recall was paused over security concerns in 2024 and re-shipped in 2025 with opt-in + E2E encryption.
Developer API
The Microsoft.Windows.AI.Generative namespace in Windows Copilot Runtime. Callable from C# / Rust / C++.
Chapter 20 · Gemini Nano — Android and Chrome
Gemini Nano is Google's smallest Gemini variant. Available on Pixel 8 Pro and up, parts of Galaxy S24+, and Chrome desktop (Canary / Beta + partial stable as of May 2026).
Chrome Built-in AI
// Origin Trial active as of May 2026
const session = await ai.languageModel.create({
systemPrompt: "You are a summarization expert.",
})
const summary = await session.prompt("Summarize this article in 3 lines: ...")
An LLM inside the browser. Zero network calls, zero cost. Web apps can finally lean on an "offline LLM."
Android AICore
val generativeModel = GenerativeModel(modelName = "gemini-nano")
val response = generativeModel.generateContent("summarize")
Chapter 21 · Korean Local AI Ecosystem
Lablup Backend.AI
Lablup's Backend.AI is an LLM training / inference platform. Manages vLLM, Triton, TensorRT on in-house GPU clusters. Many SOE / large-enterprise deployments in Korea in 2026.
Upstage Solar
Upstage's Solar comes in 10.7B / Pro / Mini variants. Solar Mini 2.4B runs locally on laptops — registered in Ollama.
ollama pull upstage/solar-pro-preview
Naver Cloud HyperCLOVA X
Naver's HyperCLOVA X SEED 3B is open-weight (released 2025). Korean-specialized. Registered on Hugging Face — convertible for llama.cpp / Ollama.
KT, SKT, LG
- KT Mi:dm, SKT A.X 4.0 — proprietary 7B models (some weights open)
- LG AI Research EXAONE 3.5 — 2.4B / 7.8B / 32B. Non-commercial license but free for research
ollama pull exaone3.5:7.8b
Chapter 22 · Japanese Local AI Ecosystem
ELYZA
ELYZA (University of Tokyo spinout). Llama-based Japanese fine-tunes. ELYZA-japanese-Llama-3-8B directly in Ollama.
Rinna
Rinna. MS Japan spinout. Japanese GPT, BERT, Llama tunes. Also voice synth / recognition.
Stockmark
Stockmark-100B. Japanese 100B model, business-domain-specialized. Partial weights public.
PFN PLaMo
Preferred Networks's PLaMo. 13B / 100B. PLaMo Lite is open-weight — laptop-local feasible.
CyberAgent CALM
CyberAgent's CALM3 22B. Japanese + dialogue-tuned. Single RTX 4090 at Q4.
Chapter 23 · Ops Know-How — N Models on One GPU
Loading two models on the same GPU often OOMs. Three remedies.
1. Hot-swap (Ollama default)
Ollama's keep_alive controls model retention in memory.
# Unload 30 seconds after last use
ollama run qwen2.5:7b --keep-alive 30s
# Keep loaded forever
ollama run llama3.3:70b --keep-alive -1
2. Model router
If different services need different models, route via LiteLLM or self-hosted OpenRouter.
# litellm config.yaml
model_list:
- model_name: chat
litellm_params:
model: ollama/qwen2.5:14b
api_base: http://localhost:11434
- model_name: code
litellm_params:
model: ollama/deepseek-coder-v2:16b
api_base: http://localhost:11434
3. vLLM continuous batching
When many users hit at once, vLLM uses PagedAttention to serve N concurrent requests against one model. Ten people can chat against a single 70B.
Chapter 24 · RAG Patterns — Local Embedding
Embedding models (local)
- nomic-embed-text — 768-dim, English SOTA-ish, registered in Ollama
- mxbai-embed-large — 1024-dim, better quality, slightly slower
- bge-m3 — multilingual strong (Korean / Japanese / Chinese)
- multilingual-e5-large — multilingual / notebook-friendly
ollama pull nomic-embed-text
ollama pull mxbai-embed-large
ollama pull bge-m3
Local vector DBs
- LanceDB — embedded, on-disk, single file. AnythingLLM default.
- ChromaDB — Python lib + server mode
- Qdrant — Rust server, very fast
- Weaviate — full-stack
- Milvus — large-scale
import lancedb
db = lancedb.connect("./data")
table = db.create_table("docs", schema=...)
table.add([{"vector": embed("text"), "text": "text"}])
table.search(embed("query")).limit(5).to_pandas()
Chapter 25 · Security and Compliance
"Local equals safe?" — No
Local LLMs remove some cloud risks but introduce new ones.
- Prompt injection — hidden "ignore previous instructions" in documents → identical risk locally
- Data leakage — RAG may pull docs the user has no right to
- Model integrity — a Hugging Face download might be backdoored — use official channels only
- Fine-tune leakage — weights tuned on company data may leak PII
Ops guide
- Source models only from the official org (Meta, Microsoft, Google, Alibaba, DeepSeek HF orgs)
- Verify hashes after download
- Internal RAG needs access control (AnythingLLM workspace-scoped)
- Logging and audit — pipe Open WebUI admin logs into your SIEM
Compliance mapping
| Regulation | Cloud LLM | Local LLM |
|---|---|---|
| GDPR | Transfer requires DPA | No transfer, partially exempt |
| HIPAA | BAA required | Own infrastructure — full control |
| Korea PIPA | Overseas transfer consent | Domestic processing — simple |
| Japan APPI | Consent + safeguards | Same, lower external risk |
| Korea FSI | Cloud security cert mandatory | Self-controlled infra |
Chapter 26 · Conclusion — Local AI as 2026 Table Stakes
Local LLMs were a hobby in 2023, an experiment in 2024, an option in 2025. In 2026 they are a developer's basic skill.
- One laptop + Ollama + Continue.dev → API bill drop and code doesn't leak
- In-house GPU server + Open WebUI + AnythingLLM → self-run company ChatGPT
- iPhone + Apple Intelligence → handled by the OS
- Personal notes + Reor / Khoj → semantic search over all notes
A 5-minute workflow to try right now.
# 1. Install Ollama
brew install ollama
# 2. Pull a model
ollama pull qwen2.5:14b-instruct
# 3. Chat
ollama run qwen2.5:14b-instruct
# 4. Spin up Open WebUI (if Docker is on the box)
docker run -d -p 3000:8080 \
-v open-webui:/app/backend/data \
--add-host=host.docker.internal:host-gateway \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000 in a browser and your own ChatGPT runs on a laptop. No data leaves, cost is electricity, and the plane Wi-Fi being down doesn't stop you. That is the May-2026 landscape.
Chapter 27 · References
- Ollama official — https://ollama.com/
- Ollama model library — https://ollama.com/library
- LM Studio — https://lmstudio.ai/
- Jan — https://jan.ai/
- Msty — https://msty.app/
- GPT4All — https://gpt4all.io/
- Open WebUI — https://openwebui.com/
- LibreChat — https://www.librechat.ai/
- AnythingLLM — https://anythingllm.com/
- PrivateGPT — https://privategpt.dev/
- Khoj — https://khoj.dev/
- Reor — https://reor.app/
- Pinokio — https://pinokio.computer/
- Chatbox — https://chatboxai.app/
- llama.cpp — https://github.com/ggml-org/llama.cpp
- MLX-LM — https://github.com/ml-explore/mlx-examples
- Llamafile — https://github.com/Mozilla-Ocho/llamafile
- Continue.dev — https://www.continue.dev/
- Cline — https://cline.bot/
- aider — https://aider.chat/
- Hugging Face — https://huggingface.co/
- Apple Intelligence — https://www.apple.com/apple-intelligence/
- Microsoft Phi Silica — https://learn.microsoft.com/en-us/windows/ai/
- Chrome Built-in AI — https://developer.chrome.com/docs/ai
- Lablup Backend.AI — https://www.lablup.com/
- Upstage Solar — https://www.upstage.ai/
- LG EXAONE — https://www.lgresearch.ai/
- ELYZA — https://elyza.ai/
- Preferred Networks PLaMo — https://www.preferred.jp/
- CyberAgent CALM — https://www.cyberagent.co.jp/