Skip to content
Published on

Local AI & On-Device LLMs 2026 — Ollama · LM Studio · Jan · Msty · Open WebUI · GPT4All · AnythingLLM · Faraday Deep Dive

Authors

Chapter 1 · Why Local AI Matters in 2026

Three years ago, "local LLMs" meant quantizing a 7B model to 4-bit, jamming it into an RTX 3090, and getting roughly half of GPT-3.5 quality. The landscape in May 2026 is unrecognizable.

  • An M4 Max MacBook Pro 128GB runs Llama 4 Scout 109B MoE at 24 tokens/second
  • An RTX 5090 24GB handles DeepSeek R1 Distill 32B at 12 tokens/second
  • iPhone 16 Pro triggers Apple Intelligence's 3B model automatically at OS level
  • A Snapdragon X Elite notebook runs Phi Silica 3.8B on the NPU

There are four crisp reasons local AI matters.

  1. Privacy — inputs never leave the building. GDPR, HIPAA, Korea's PIPA, Japan's APPI — all handled
  2. Cost — no API bill. Just electricity (negligible on a laptop)
  3. Offline — planes, subways, cafe Wi-Fi — works without internet
  4. Experimentation — try a freshly released model within 5 minutes. fine-tuning, LoRA, RAG are all free

This guide covers everything a developer should know to run LLMs on desktop / laptop / mobile as of May 2026. Runtimes, GUIs, backends, quantization formats, model recommendations, and ops know-how.


Chapter 2 · Hardware — The VRAM and Unified Memory Era

The first gate for local LLM is memory. Rough guidelines.

Model sizePrecisionVRAM/RAMNotes
3BINT44GBMobile / low-end notebooks
7BINT4 (Q4_K_M)8GBRTX 3060, M1/M2 8GB
7BINT812GBRTX 3060 12GB, M2 16GB
13BINT412-14GBRTX 4070, M2 24GB
32BINT422-24GBRTX 4090, M3 Max 36GB
70BINT442-48GBDual RTX 5090, M2 Ultra 64GB
70BINT880GB+A100 80GB, M3 Ultra 192GB
405BINT4240GB+Multi-GPU node, M3 Ultra 192GB pair

NVIDIA vs Apple Silicon

NVIDIA wins on PCIe + GDDR. Token generation is GPU-bound, so latency is unbeatable. RTX 5090 with 32GB GDDR7 has the shortest token latency for 32B-class models.

Apple Silicon's weapon is unified memory. The M3 Ultra Mac Studio has 192GB UMA and can run a 70B model at 16-bit. The NVIDIA equivalent needs two H100 80GB cards (pricing isn't even comparable).

  • M4 Max 128GB — up to 109B MoE — about USD 7,000
  • M3 Ultra 192GB — 70B BF16 — about USD 9,500
  • RTX 5090 24GB — 32B Q4 — about USD 2,200 + the rest of the system

Decision: Mac if you frequently run 70B+ on a laptop; NVIDIA if 32B or smaller + best cost-perf + gaming as well.


Chapter 3 · Ollama — The Most-Loved Local Runtime

Ollama came out of Y Combinator W24. MIT license; CLI / REST API + a model registry on top of llama.cpp. GitHub stars as of May 2026: 145,000+.

Install and first run

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Run the daemon
ollama serve

# Pull and run a model
ollama run llama3.3:70b-instruct-q4_K_M

# 7B, snappy
ollama run qwen2.5:7b-instruct

One line — ollama run — downloads, extracts the quantization, starts the inference server, and opens a chat. What other runtimes do in five steps, Ollama does in one.

Modelfile — a Dockerfile for models

FROM llama3.3:70b-instruct-q4_K_M
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
SYSTEM """
You are a Korean data engineering assistant. Prefer SQL and PySpark.
"""
ollama create yj-de -f Modelfile
ollama run yj-de

You can package a system prompt + params as a "model". Great when distributing a standard prompt across a team.

Ollama REST API

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.3:70b-instruct-q4_K_M",
  "messages": [{"role": "user", "content": "Explain Linux memory cache policy"}],
  "stream": false
}'

OpenAI-compatible mode is also exposed, so langchain, llamaindex, and the OpenAI SDK just need a base URL swap.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
    model="qwen2.5:14b-instruct",
    messages=[{"role": "user", "content": "hi"}]
)

Ollama model registry

One line via ollama pull. Important May 2026 tags.

ollama pull llama3.3:70b-instruct-q4_K_M
ollama pull deepseek-r1:32b-distill-q4_K_M
ollama pull qwen3:14b-instruct
ollama pull phi4:14b
ollama pull gemma3:27b-instruct
ollama pull mistral-small:22b
ollama pull mixtral:8x7b-instruct-q4_K_M
ollama pull deepseek-coder-v2:16b-lite-instruct
ollama pull minicpm3:4b
ollama pull llava:34b

Ollama's limits

  • Lean GUI — you'll want a separate client (Open WebUI, Msty, etc.)
  • Multi-GPU distribution is limited (vLLM dominates here)
  • No fine-tuning tooling — you still need unsloth / axolotl
  • Memory management is rough — load two models at once and OOM is common

Even so, "run a local LLM in 5 minutes" has had the same answer for three years: Ollama.


Chapter 4 · LM Studio — GUI-First Desktop

LM Studio is a desktop app from Element Labs (San Francisco). Free, closed-source. macOS / Windows / Linux.

Strengths

  • Model browser — Hugging Face search inside the app. Model card, quant options, memory estimate — all on one screen
  • Chat UI — multi-session, prompt templates, stop / regenerate
  • Local server — exposes an OpenAI-compatible API in one click
  • MLX acceleration — auto-selects MLX on Apple Silicon (30-50% faster than llama.cpp)
  • Hardware profiler — GPU/CPU split as a slider

Scenario

Best fit for someone who frequently runs two models on a laptop to compare them. With Ollama (CLI) you ollama run every time. With LM Studio you toggle inside one graphical session.

Weaknesses

  • Closed source — corporate adoption requires extra security review
  • Model directory isn't shared with Ollama — you re-download
  • Apple Silicon only on macOS; Intel Mac builds are retired
  • Linux builds often trail by one or two releases

Chapter 5 · Jan — Truly Open-Source Desktop

Jan is a 100% open-source (AGPL-3.0) desktop LLM app from Homebrew Research. Electron + TypeScript. GitHub stars as of May 2026: 28,000+.

What stands out

  • Plugin marketplace — features toggle as modules (RAG, web search, code interpreter)
  • Multiple backends — llama.cpp, MLX, TensorRT, vLLM — all selectable in one app
  • Cloud + local — drop in OpenAI / Anthropic / Mistral / Groq keys, mix them in the same UI — "today Claude, yesterday local"
  • Data sovereignty — every chat log is local SQLite; analysis and export are free

When to pick

  • "I need a ChatGPT-style desktop UI but won't depend on OpenAI"
  • "Compare local and cloud in one screen"
  • "Enterprise — corporate policy bans closed-source desktop apps"

Jan API

Jan also exposes an OpenAI-compatible API.

# Default port
curl http://localhost:1337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3-70b-q4",
    "messages": [{"role": "user", "content": "hello"}]
  }'

Chapter 6 · Msty — The Closed-Source Standout

Msty is a solo-developer desktop app. Free for personal, paid team license. macOS / Windows / Linux. Closed source but very highly rated.

Differentiators

  • Branch chat — fork off any message and generate two answers in parallel. Comparison is dramatically faster
  • Knowledge Stacks — drag a folder / PDF / URL and RAG happens automatically. No separate setup like AnythingLLM
  • Workspaces — isolate chats / models / RAG per project. Lightroom's catalog metaphor
  • Simultaneous local + cloud — fan one prompt out to Claude / GPT / local Llama

Pricing

  • Personal — free
  • Pro (personal) USD 99/year — unlimited workspaces, cloud sync
  • Team — USD 159/seat/year

Where LM Studio is "model browser + chat," Msty positions as a "research / knowledge workbench."


Chapter 7 · Open WebUI — Self-Hosted ChatGPT

Open WebUI (formerly Ollama WebUI) is a self-hostable ChatGPT clone started by Tim Jaeryang Baek. MIT license, Python (FastAPI) + Svelte. GitHub stars: 78,000+.

  • Auto-detects Ollama — if Ollama runs on the host, models appear automatically
  • Multi-user — login, permissions, groups, per-model access
  • Built-in RAG — upload docs → vector search → context injection
  • Voice I/O — Whisper (STT) + Piper / Cartesia / ElevenLabs (TTS)
  • Function calling (Tools) — JS/Python functions invoked by the model
  • Pipelines — middleware pattern for logging, filtering, multi-model routing
  • One-line Docker install
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Browse to http://localhost:3000 — near-identical UI to ChatGPT. Drop this on the company GPU server and the whole company uses it. No data leaves the building.

Ops tips

  • Switch the backend to Postgres + Redis for multi-node scaling
  • If Ollama is on the same host: OLLAMA_BASE_URL=http://host.docker.internal:11434
  • vLLM and LM Studio are also OpenAI-compatible — same wiring

Chapter 8 · LibreChat — Multi-Provider Chat

LibreChat leans on cloud integration more than Open WebUI. OpenAI, Anthropic, Google, Mistral, Ollama, vLLM, llama.cpp — all in one screen.

Features

  • Plugin system (DALL-E, Wolfram, Zapier)
  • Compare mode — fan a prompt to N models simultaneously
  • Assistants API compatibility
  • Full i18n (Korean / Japanese / Chinese included)

When to use

  • "The company mixes cloud and local. I don't want two UIs"
  • "Internal unified UI in place of ChatGPT Pro"
  • "Enterprise SSO / SAML needed"

Chapter 9 · GPT4All — Nomic's Local LLM

GPT4All is run by Nomic AI (known for Atlas embedding visualization). Desktop app + Python SDK. MIT.

from gpt4all import GPT4All
model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf")
resp = model.generate("Why use a local LLM?", max_tokens=200)
print(resp)

Strengths

  • CPU-first design — works decently without a GPU
  • LocalDocs — folder RAG out of the box
  • Desktop + SDK integration — RAG collections built in the desktop GUI are accessible from Python

Weaknesses

  • New model support lags Ollama (no Llama 4 as of May 2026)
  • 5-10% slower than direct llama.cpp

Chapter 10 · AnythingLLM — Local RAG Powerhouse

AnythingLLM is a full-stack RAG desktop/Docker app from Mintplex Labs (Boston). MIT, Node.js + React. Desktop and Docker self-host builds both ship.

Core components

  • Workspaces — a bundle of docs, chats, embeddings, model config
  • Agents — function calls, web search, code execution
  • Multiple LLM backends — Ollama / LM Studio / OpenAI / Anthropic / Mistral / Together
  • Embedding backends — sentence-transformers, OpenAI, Cohere, Ollama nomic-embed
  • Built-in vector DB — LanceDB by default; Chroma / Pinecone / Weaviate / Qdrant optional
  • Document connectors — PDF, DOCX, MD, GitHub repos, Confluence, Notion, web crawler

Scenario — internal wiki bot

1. Run AnythingLLM Docker
2. Create workspace "engineering-wiki"
3. Connect the Confluence connector and index (auto re-index every 24h)
4. Set the model to qwen2.5:14b via Ollama
5. Call via Slack bot or Open WebUI API

One of the fastest paths to corporate full-stack RAG.


Chapter 11 · PrivateGPT, Khoj, Reor — Specialized Tools

PrivateGPT

PrivateGPT was started by Iván Martínez. Python-based. The goal: 100% local RAG with zero external API. Common in security / regulated industries. Somewhat heavy (model + embedding + vector DB in one process).

Khoj

Khoj is a "personal AI assistant" from Khoj Inc. It indexes notes (Obsidian, Notion), email, calendar — and chats over them.

  • macOS / Windows / Linux desktop
  • iOS / Android apps
  • Self-host Docker option

Reor

Reor is an "AI-native notes app." Markdown notes similar to Obsidian, but automatic embeddings connect every note semantically. All inference and embedding is local.


Chapter 12 · Faraday, Pinokio, Chatbox

Faraday (legacy)

Faraday.dev was a desktop app centered on character chat. Effectively dormant as of May 2026; users migrated to SillyTavern and AI Horde. Mentioned for historical context.

Pinokio

Pinokio is "a package manager for AI scripts." One-click install/run for ComfyUI, AUTOMATIC1111, Whisper, Bark. JSON-based recipe system.

Use cases:
- Try image / voice / video tools quickly
- Share a ComfyUI workflow with a friend
- Automate demo environment setup

Chatbox

Chatbox is a multi-platform chat UI. iOS, Android, macOS, Windows, Linux, Web. OpenAI / Claude / Gemini / Ollama backends. Closed source but popular for travel because of strong mobile support.

Page Assist

Page Assist is a Chrome extension that lets you ask Ollama about the current page. Side-panel chat, context-menu summarize. Light RAG.


Chapter 13 · Backend Engines — llama.cpp / MLX / vLLM / TensorRT

llama.cpp

Georgi Gerganov's 2023 C++ inference engine. The foundation of Ollama, LM Studio, Jan, GPT4All. Supports CPU and GPU (CUDA, Metal, ROCm, Vulkan, SYCL).

# Build from source
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make -j8 LLAMA_METAL=1   # macOS
make -j8 LLAMA_CUDA=1    # Linux NVIDIA

# Run
./llama-cli -m models/qwen2.5-14b-instruct-q4_k_m.gguf -p "hi"
./llama-server -m models/llama-3.3-70b-q4_k_m.gguf --port 8080

Build-from-source is 10-20% faster than Ollama and exposes more flags. Downside: manual model download/management.

MLX-LM

Apple Silicon only. MLX is a NumPy-style tensor library from Apple's ML research team. MLX-LM is the LLM inference layer on top.

pip install mlx-lm
mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit --prompt "hi"
mlx_lm.server --model mlx-community/Qwen2.5-14B-Instruct-4bit --port 8080

30-50% faster than llama.cpp's Metal backend on M3/M4. That's why LM Studio auto-selects MLX. Downside: Apple Silicon only, no NVIDIA / AMD.

vLLM / SGLang / TGI

Server-class. Serves one model to many concurrent requests (PagedAttention, continuous batching). Overkill for a single-user laptop, but the right answer for "internal LLM serving ten people." Covered in depth in a separate post; brief mention here.

pip install vllm
vllm serve Qwen/Qwen2.5-14B-Instruct --port 8080

TensorRT-LLM

NVIDIA only. CUDA-optimized inference. Maximum throughput on H100 / B200 / RTX 5090. Build process is complex but the throughput is unbeaten in production.

Llamafile

Mozilla's Llamafile bundles llama.cpp + a model into a single executable. Same file runs on macOS, Linux, Windows. Great for multi-OS demos and air-gapped environments.

chmod +x llava-v1.5-7b-q4.llamafile
./llava-v1.5-7b-q4.llamafile --server

Chapter 14 · Quantization Formats — GGUF / AWQ / GPTQ / EXL / MXFP4 / BitNet

Source models are typically BF16 (2 bytes/param). A 7B model is 14GB. Too heavy for a laptop. Quantization trades precision for footprint.

GGUF (llama.cpp standard)

  • Q2_K (smallest, low quality, rarely used)
  • Q3_K_M (3-bit, 7B becomes 3GB — mobile)
  • Q4_K_M (4-bit, the sweet spot, most-used)
  • Q5_K_M (5-bit, better quality)
  • Q6_K (6-bit, near-BF16)
  • Q8_0 (8-bit, virtually no quality difference vs BF16, half the memory)
  • FP16 / BF16 (not quantized, original)

Q4_K_M shrinks a 7B to about 4.5GB and costs 2-3% perplexity. Overwhelming default.

AWQ (Activation-aware Weight Quantization)

Common with vLLM and TGI. Faster than GPTQ at similar quality. 4-bit is standard.

GPTQ

Older. Quantize with AutoGPTQ. 4-bit standard. Gradually losing ground to AWQ.

EXL2 / EXL3

ExLlamaV2/V3. NVIDIA RTX specialized. Mixes 4 / 6 / 8-bit within a model — under 1% perplexity hit. ExLlamaV3 shipped late 2025 with better quantization efficiency.

MXFP4

OpenAI standardized Microscaling FP4 in 2025. Hardware-accelerated on NVIDIA Blackwell (B200, RTX 5090). Better quality than INT4 with quarter-of-BF16 footprint.

BitNet (1.58-bit)

Microsoft research. Weights are -1, 0, +1. Almost no multiplications at inference — very fast. BitNet b1.58 3B and 7B released on Hugging Face in 2026. Experimental but huge potential for embedded / mobile.

Which to pick

  • Desktop / laptop, Ollama / llama.cpp → GGUF Q4_K_M
  • vLLM server, NVIDIA → AWQ
  • Single NVIDIA, max efficiency → EXL3
  • Apple Silicon → MLX 4-bit

General — Llama 4 Scout 109B MoE

Meta's Llama 4 Scout. 16-of-128 expert MoE. 17B active params — inference cost is 17B-class, quality is near 70B. 24 tokens/sec on M4 Max 128GB. Context 1M tokens.

General (practical) — Llama 3.3 70B

Llama 3.3 70B Instruct. The 70B-class standard. GPT-4 Turbo level. 42GB at Q4_K_M. Dual RTX 5090 or M2 Ultra 64GB.

Reasoning — DeepSeek R1 Distill 32B

DeepSeek R1's Llama / Qwen distill series. 32B Q4 = single RTX 4090. o1-mini-class reasoning. Strong on math, code, logic.

ollama pull deepseek-r1:32b
ollama pull deepseek-r1:7b   # for laptops

Multilingual — Qwen 3 14B

Alibaba Qwen 3. Strong in Korean / Chinese / Japanese / English. Often outperforms Llama on Korean text. 14B Q4_K_M on a single RTX 4070 (12GB).

Small-model champion — Phi-4 14B

Microsoft Phi-4. The "data curation is the answer" result. 14B with 70B-class benchmarks. The cost-perf winner for laptops.

Very small — Gemma 3 12B / 4B

Google Gemma 3. 12B / 4B / 1B lineup. Mobile / embedded / notebook. Smaller than the 7B class with comparable performance.

Light + multilingual — MiniCPM 3.0 4B

OpenBMB's MiniCPM 3.0. 4B competitive with 8B models. Optimized for mobile / edge.

Code — DeepSeek Coder V2 Lite 16B

DeepSeek Coder V2. 16B MoE (2.4B active). 10GB at Q4. Popular as Continue.dev / Cline backend.

Multimodal — LLaVA 34B, Qwen2-VL 7B, Pixtral 12B

Image + text. LLaVA is the standard, Qwen2-VL is multilingual-strong, Pixtral is Mistral's vision model.

ollama pull llava:34b
ollama pull qwen2-vl:7b

Chapter 16 · Voice Mode — STT + LLM + TTS

STT (speech → text)

  • OpenAI Whisper — the standard. base / small / medium / large-v3. large-v3 needs 4GB GPU.
  • faster-whisper — CTranslate2 backend. Fast on CPU and GPU.
  • whisper.cpp — C++ port, Apple Silicon Metal accelerated.
  • Distil-Whisper — Whisper distillation, 6x faster.

TTS (text → speech)

  • Piper — Rhasspy project. Fast on CPU, Korean voices available.
  • Coqui XTTS v2 — multilingual + voice cloning. (Coqui dissolved 2024, models remain)
  • F5-TTS — released 2025. English / Chinese naturalness near top tier. Voice cloning.
  • Kokoro — tiny (82M) English TTS. Real-time on a notebook CPU.
  • Cartesia Sonic — commercial API but extremely fast.

Open WebUI voice integration

Settings → Audio
  STT: faster-whisper (local) or Whisper API
  TTS: Piper (local), Kokoro (local), ElevenLabs (cloud)

Tap the mic, and the STT → LLM → TTS pipeline runs. Talk to ChatGPT-style while driving.


Chapter 17 · Code Assistant — Continue.dev + Ollama

Continue.dev

Continue.dev is a VSCode / JetBrains extension. A Cursor / Copilot alternative. Model backend is free choice — local Ollama works.

// ~/.continue/config.json
{
  "models": [
    {
      "title": "Local Coder",
      "provider": "ollama",
      "model": "deepseek-coder-v2:16b-lite-instruct",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Tab",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

Tab autocomplete uses Qwen2.5-Coder 7B (fast); chat uses DeepSeek Coder V2 16B (quality). 100% local, zero API cost, code stays on-device.

Cline + Ollama

Cline (formerly Claude Dev) is agentic. File read/write, command exec, Plan/Act mode. Ollama backend works but recommend 70B+ reasoning models — agent loops are heavy.

aider

aider is a terminal pair programmer. Git-based. Ollama backend.

aider --model ollama/qwen2.5-coder:32b

Chapter 18 · Apple Intelligence — OS-Level On-Device

Apple Intelligence is GA on iOS 18, iPadOS 18, macOS 15 Sequoia, visionOS 2. Two pieces.

  1. On-device 3B model — runs on Apple Silicon NPU. Notification summaries, Mail reply suggestions, text refinement, Image Playground.
  2. Private Cloud Compute (PCC) — for bigger workloads, offloaded to Apple Silicon servers. Logs are not persisted; only attested code runs (source published to external security researchers).

Foundation Models framework

import FoundationModels
let session = LanguageModelSession()
let resp = try await session.respond(to: "Summarize my note in 3 lines")

Available on iOS 18.2+ / macOS 15.2+. 3B-bound but free and unlimited.

Limits

  • English-first; Korean / Japanese GA in stages through 2025
  • 3B is too small for complex tasks — hence the PCC handoff
  • Requires iPhone 15 Pro or M1 or later

Chapter 19 · Phi Silica — On-Device AI for Windows 11

Microsoft ships Phi Silica — a 3.8B model — on the NPU of Snapdragon X Elite / Intel Core Ultra / AMD Ryzen AI. Standard on Copilot+ PCs since Windows 11 24H2.

Capabilities

  • Summarize / rewrite / translate
  • Code assist (Visual Studio integration)
  • Image generation (Cocreator)
  • Search (Recall — semantic search over captured screens)

Recall was paused over security concerns in 2024 and re-shipped in 2025 with opt-in + E2E encryption.

Developer API

The Microsoft.Windows.AI.Generative namespace in Windows Copilot Runtime. Callable from C# / Rust / C++.


Chapter 20 · Gemini Nano — Android and Chrome

Gemini Nano is Google's smallest Gemini variant. Available on Pixel 8 Pro and up, parts of Galaxy S24+, and Chrome desktop (Canary / Beta + partial stable as of May 2026).

Chrome Built-in AI

// Origin Trial active as of May 2026
const session = await ai.languageModel.create({
  systemPrompt: "You are a summarization expert.",
})
const summary = await session.prompt("Summarize this article in 3 lines: ...")

An LLM inside the browser. Zero network calls, zero cost. Web apps can finally lean on an "offline LLM."

Android AICore

val generativeModel = GenerativeModel(modelName = "gemini-nano")
val response = generativeModel.generateContent("summarize")

Chapter 21 · Korean Local AI Ecosystem

Lablup Backend.AI

Lablup's Backend.AI is an LLM training / inference platform. Manages vLLM, Triton, TensorRT on in-house GPU clusters. Many SOE / large-enterprise deployments in Korea in 2026.

Upstage Solar

Upstage's Solar comes in 10.7B / Pro / Mini variants. Solar Mini 2.4B runs locally on laptops — registered in Ollama.

ollama pull upstage/solar-pro-preview

Naver's HyperCLOVA X SEED 3B is open-weight (released 2025). Korean-specialized. Registered on Hugging Face — convertible for llama.cpp / Ollama.

KT, SKT, LG

  • KT Mi:dm, SKT A.X 4.0 — proprietary 7B models (some weights open)
  • LG AI Research EXAONE 3.5 — 2.4B / 7.8B / 32B. Non-commercial license but free for research
ollama pull exaone3.5:7.8b

Chapter 22 · Japanese Local AI Ecosystem

ELYZA

ELYZA (University of Tokyo spinout). Llama-based Japanese fine-tunes. ELYZA-japanese-Llama-3-8B directly in Ollama.

Rinna

Rinna. MS Japan spinout. Japanese GPT, BERT, Llama tunes. Also voice synth / recognition.

Stockmark

Stockmark-100B. Japanese 100B model, business-domain-specialized. Partial weights public.

PFN PLaMo

Preferred Networks's PLaMo. 13B / 100B. PLaMo Lite is open-weight — laptop-local feasible.

CyberAgent CALM

CyberAgent's CALM3 22B. Japanese + dialogue-tuned. Single RTX 4090 at Q4.


Chapter 23 · Ops Know-How — N Models on One GPU

Loading two models on the same GPU often OOMs. Three remedies.

1. Hot-swap (Ollama default)

Ollama's keep_alive controls model retention in memory.

# Unload 30 seconds after last use
ollama run qwen2.5:7b --keep-alive 30s

# Keep loaded forever
ollama run llama3.3:70b --keep-alive -1

2. Model router

If different services need different models, route via LiteLLM or self-hosted OpenRouter.

# litellm config.yaml
model_list:
  - model_name: chat
    litellm_params:
      model: ollama/qwen2.5:14b
      api_base: http://localhost:11434
  - model_name: code
    litellm_params:
      model: ollama/deepseek-coder-v2:16b
      api_base: http://localhost:11434

3. vLLM continuous batching

When many users hit at once, vLLM uses PagedAttention to serve N concurrent requests against one model. Ten people can chat against a single 70B.


Chapter 24 · RAG Patterns — Local Embedding

Embedding models (local)

  • nomic-embed-text — 768-dim, English SOTA-ish, registered in Ollama
  • mxbai-embed-large — 1024-dim, better quality, slightly slower
  • bge-m3 — multilingual strong (Korean / Japanese / Chinese)
  • multilingual-e5-large — multilingual / notebook-friendly
ollama pull nomic-embed-text
ollama pull mxbai-embed-large
ollama pull bge-m3

Local vector DBs

  • LanceDB — embedded, on-disk, single file. AnythingLLM default.
  • ChromaDB — Python lib + server mode
  • Qdrant — Rust server, very fast
  • Weaviate — full-stack
  • Milvus — large-scale
import lancedb
db = lancedb.connect("./data")
table = db.create_table("docs", schema=...)
table.add([{"vector": embed("text"), "text": "text"}])
table.search(embed("query")).limit(5).to_pandas()

Chapter 25 · Security and Compliance

"Local equals safe?" — No

Local LLMs remove some cloud risks but introduce new ones.

  • Prompt injection — hidden "ignore previous instructions" in documents → identical risk locally
  • Data leakage — RAG may pull docs the user has no right to
  • Model integrity — a Hugging Face download might be backdoored — use official channels only
  • Fine-tune leakage — weights tuned on company data may leak PII

Ops guide

  • Source models only from the official org (Meta, Microsoft, Google, Alibaba, DeepSeek HF orgs)
  • Verify hashes after download
  • Internal RAG needs access control (AnythingLLM workspace-scoped)
  • Logging and audit — pipe Open WebUI admin logs into your SIEM

Compliance mapping

RegulationCloud LLMLocal LLM
GDPRTransfer requires DPANo transfer, partially exempt
HIPAABAA requiredOwn infrastructure — full control
Korea PIPAOverseas transfer consentDomestic processing — simple
Japan APPIConsent + safeguardsSame, lower external risk
Korea FSICloud security cert mandatorySelf-controlled infra

Chapter 26 · Conclusion — Local AI as 2026 Table Stakes

Local LLMs were a hobby in 2023, an experiment in 2024, an option in 2025. In 2026 they are a developer's basic skill.

  • One laptop + Ollama + Continue.dev → API bill drop and code doesn't leak
  • In-house GPU server + Open WebUI + AnythingLLM → self-run company ChatGPT
  • iPhone + Apple Intelligence → handled by the OS
  • Personal notes + Reor / Khoj → semantic search over all notes

A 5-minute workflow to try right now.

# 1. Install Ollama
brew install ollama

# 2. Pull a model
ollama pull qwen2.5:14b-instruct

# 3. Chat
ollama run qwen2.5:14b-instruct

# 4. Spin up Open WebUI (if Docker is on the box)
docker run -d -p 3000:8080 \
  -v open-webui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in a browser and your own ChatGPT runs on a laptop. No data leaves, cost is electricity, and the plane Wi-Fi being down doesn't stop you. That is the May-2026 landscape.


Chapter 27 · References