Skip to content

필사 모드: Local AI & On-Device LLMs 2026 — Ollama · LM Studio · Jan · Msty · Open WebUI · GPT4All · AnythingLLM · Faraday Deep Dive

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Chapter 1 · Why Local AI Matters in 2026

Three years ago, "local LLMs" meant quantizing a 7B model to 4-bit, jamming it into an RTX 3090, and getting roughly half of GPT-3.5 quality. The landscape in May 2026 is unrecognizable.

- An **M4 Max MacBook Pro 128GB** runs Llama 4 Scout 109B MoE at 24 tokens/second

- An **RTX 5090 24GB** handles DeepSeek R1 Distill 32B at 12 tokens/second

- **iPhone 16 Pro** triggers Apple Intelligence's 3B model automatically at OS level

- A **Snapdragon X Elite** notebook runs Phi Silica 3.8B on the NPU

There are four crisp reasons local AI matters.

1. **Privacy** — inputs never leave the building. GDPR, HIPAA, Korea's PIPA, Japan's APPI — all handled

2. **Cost** — no API bill. Just electricity (negligible on a laptop)

3. **Offline** — planes, subways, cafe Wi-Fi — works without internet

4. **Experimentation** — try a freshly released model within 5 minutes. fine-tuning, LoRA, RAG are all free

This guide covers **everything a developer should know to run LLMs on desktop / laptop / mobile** as of May 2026. Runtimes, GUIs, backends, quantization formats, model recommendations, and ops know-how.

Chapter 2 · Hardware — The VRAM and Unified Memory Era

The first gate for local LLM is memory. Rough guidelines.

| Model size | Precision | VRAM/RAM | Notes |

| --- | --- | --- | --- |

| 3B | INT4 | 4GB | Mobile / low-end notebooks |

| 7B | INT4 (Q4_K_M) | 8GB | RTX 3060, M1/M2 8GB |

| 7B | INT8 | 12GB | RTX 3060 12GB, M2 16GB |

| 13B | INT4 | 12-14GB | RTX 4070, M2 24GB |

| 32B | INT4 | 22-24GB | RTX 4090, M3 Max 36GB |

| 70B | INT4 | 42-48GB | Dual RTX 5090, M2 Ultra 64GB |

| 70B | INT8 | 80GB+ | A100 80GB, M3 Ultra 192GB |

| 405B | INT4 | 240GB+ | Multi-GPU node, M3 Ultra 192GB pair |

NVIDIA vs Apple Silicon

NVIDIA wins on **PCIe + GDDR**. Token generation is GPU-bound, so latency is unbeatable. RTX 5090 with 32GB GDDR7 has the shortest token latency for 32B-class models.

Apple Silicon's weapon is **unified memory**. The M3 Ultra Mac Studio has 192GB UMA and can run a 70B model at 16-bit. The NVIDIA equivalent needs two H100 80GB cards (pricing isn't even comparable).

- **M4 Max 128GB** — up to 109B MoE — about USD 7,000

- **M3 Ultra 192GB** — 70B BF16 — about USD 9,500

- **RTX 5090 24GB** — 32B Q4 — about USD 2,200 + the rest of the system

Decision: **Mac if you frequently run 70B+ on a laptop**; **NVIDIA if 32B or smaller + best cost-perf + gaming as well**.

Chapter 3 · Ollama — The Most-Loved Local Runtime

[Ollama](https://ollama.com/) came out of Y Combinator W24. MIT license; CLI / REST API + a model registry on top of llama.cpp. GitHub stars as of May 2026: 145,000+.

Install and first run

macOS

brew install ollama

Linux

curl -fsSL https://ollama.com/install.sh | sh

Run the daemon

ollama serve

Pull and run a model

ollama run llama3.3:70b-instruct-q4_K_M

7B, snappy

ollama run qwen2.5:7b-instruct

One line — `ollama run` — downloads, extracts the quantization, starts the inference server, and opens a chat. What other runtimes do in five steps, Ollama does in one.

Modelfile — a Dockerfile for models

FROM llama3.3:70b-instruct-q4_K_M

PARAMETER temperature 0.7

PARAMETER num_ctx 8192

SYSTEM """

You are a Korean data engineering assistant. Prefer SQL and PySpark.

"""

ollama create yj-de -f Modelfile

ollama run yj-de

You can package a system prompt + params as a "model". Great when distributing a standard prompt across a team.

Ollama REST API

curl http://localhost:11434/api/chat -d '{

"model": "llama3.3:70b-instruct-q4_K_M",

"messages": [{"role": "user", "content": "Explain Linux memory cache policy"}],

"stream": false

}'

OpenAI-compatible mode is also exposed, so langchain, llamaindex, and the OpenAI SDK just need a base URL swap.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

resp = client.chat.completions.create(

model="qwen2.5:14b-instruct",

messages=[{"role": "user", "content": "hi"}]

)

Ollama model registry

One line via `ollama pull`. Important May 2026 tags.

ollama pull llama3.3:70b-instruct-q4_K_M

ollama pull deepseek-r1:32b-distill-q4_K_M

ollama pull qwen3:14b-instruct

ollama pull phi4:14b

ollama pull gemma3:27b-instruct

ollama pull mistral-small:22b

ollama pull mixtral:8x7b-instruct-q4_K_M

ollama pull deepseek-coder-v2:16b-lite-instruct

ollama pull minicpm3:4b

ollama pull llava:34b

Ollama's limits

- Lean GUI — you'll want a separate client (Open WebUI, Msty, etc.)

- Multi-GPU distribution is limited (vLLM dominates here)

- No fine-tuning tooling — you still need unsloth / axolotl

- Memory management is rough — load two models at once and OOM is common

Even so, **"run a local LLM in 5 minutes"** has had the same answer for three years: Ollama.

Chapter 4 · LM Studio — GUI-First Desktop

[LM Studio](https://lmstudio.ai/) is a desktop app from Element Labs (San Francisco). Free, closed-source. macOS / Windows / Linux.

Strengths

- **Model browser** — Hugging Face search inside the app. Model card, quant options, memory estimate — all on one screen

- **Chat UI** — multi-session, prompt templates, stop / regenerate

- **Local server** — exposes an OpenAI-compatible API in one click

- **MLX acceleration** — auto-selects MLX on Apple Silicon (30-50% faster than llama.cpp)

- **Hardware profiler** — GPU/CPU split as a slider

Scenario

Best fit for someone who frequently runs two models on a laptop to compare them. With Ollama (CLI) you `ollama run` every time. With LM Studio you toggle inside one graphical session.

Weaknesses

- Closed source — corporate adoption requires extra security review

- Model directory isn't shared with Ollama — you re-download

- Apple Silicon only on macOS; Intel Mac builds are retired

- Linux builds often trail by one or two releases

Chapter 5 · Jan — Truly Open-Source Desktop

[Jan](https://jan.ai/) is a 100% open-source (AGPL-3.0) desktop LLM app from Homebrew Research. Electron + TypeScript. GitHub stars as of May 2026: 28,000+.

What stands out

- **Plugin marketplace** — features toggle as modules (RAG, web search, code interpreter)

- **Multiple backends** — llama.cpp, MLX, TensorRT, vLLM — all selectable in one app

- **Cloud + local** — drop in OpenAI / Anthropic / Mistral / Groq keys, mix them in the same UI — "today Claude, yesterday local"

- **Data sovereignty** — every chat log is local SQLite; analysis and export are free

When to pick

- "I need a ChatGPT-style desktop UI but won't depend on OpenAI"

- "Compare local and cloud in one screen"

- "Enterprise — corporate policy bans closed-source desktop apps"

Jan API

Jan also exposes an OpenAI-compatible API.

Default port

curl http://localhost:1337/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "llama3.3-70b-q4",

"messages": [{"role": "user", "content": "hello"}]

}'

Chapter 6 · Msty — The Closed-Source Standout

[Msty](https://msty.app/) is a solo-developer desktop app. Free for personal, paid team license. macOS / Windows / Linux. Closed source but very highly rated.

Differentiators

- **Branch chat** — fork off any message and generate two answers in parallel. Comparison is dramatically faster

- **Knowledge Stacks** — drag a folder / PDF / URL and RAG happens automatically. No separate setup like AnythingLLM

- **Workspaces** — isolate chats / models / RAG per project. Lightroom's catalog metaphor

- **Simultaneous local + cloud** — fan one prompt out to Claude / GPT / local Llama

Pricing

- Personal — free

- Pro (personal) USD 99/year — unlimited workspaces, cloud sync

- Team — USD 159/seat/year

Where LM Studio is "model browser + chat," Msty positions as a "research / knowledge workbench."

Chapter 7 · Open WebUI — Self-Hosted ChatGPT

[Open WebUI](https://openwebui.com/) (formerly Ollama WebUI) is a self-hostable ChatGPT clone started by Tim Jaeryang Baek. MIT license, Python (FastAPI) + Svelte. GitHub stars: 78,000+.

Why it's popular

- **Auto-detects Ollama** — if Ollama runs on the host, models appear automatically

- **Multi-user** — login, permissions, groups, per-model access

- **Built-in RAG** — upload docs → vector search → context injection

- **Voice I/O** — Whisper (STT) + Piper / Cartesia / ElevenLabs (TTS)

- **Function calling (Tools)** — JS/Python functions invoked by the model

- **Pipelines** — middleware pattern for logging, filtering, multi-model routing

- **One-line Docker install**

docker run -d -p 3000:8080 \

--add-host=host.docker.internal:host-gateway \

-v open-webui:/app/backend/data \

--name open-webui \

--restart always \

ghcr.io/open-webui/open-webui:main

Browse to `http://localhost:3000` — near-identical UI to ChatGPT. Drop this on the company GPU server and the whole company uses it. No data leaves the building.

Ops tips

- Switch the backend to Postgres + Redis for multi-node scaling

- If Ollama is on the same host: `OLLAMA_BASE_URL=http://host.docker.internal:11434`

- vLLM and LM Studio are also OpenAI-compatible — same wiring

Chapter 8 · LibreChat — Multi-Provider Chat

[LibreChat](https://www.librechat.ai/) leans on cloud integration more than Open WebUI. OpenAI, Anthropic, Google, Mistral, Ollama, vLLM, llama.cpp — all in one screen.

Features

- Plugin system (DALL-E, Wolfram, Zapier)

- Compare mode — fan a prompt to N models simultaneously

- Assistants API compatibility

- Full i18n (Korean / Japanese / Chinese included)

When to use

- "The company mixes cloud and local. I don't want two UIs"

- "Internal unified UI in place of ChatGPT Pro"

- "Enterprise SSO / SAML needed"

Chapter 9 · GPT4All — Nomic's Local LLM

[GPT4All](https://gpt4all.io/) is run by [Nomic AI](https://nomic.ai/) (known for Atlas embedding visualization). Desktop app + Python SDK. MIT.

from gpt4all import GPT4All

model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf")

resp = model.generate("Why use a local LLM?", max_tokens=200)

print(resp)

Strengths

- **CPU-first design** — works decently without a GPU

- **LocalDocs** — folder RAG out of the box

- **Desktop + SDK integration** — RAG collections built in the desktop GUI are accessible from Python

Weaknesses

- New model support lags Ollama (no Llama 4 as of May 2026)

- 5-10% slower than direct llama.cpp

Chapter 10 · AnythingLLM — Local RAG Powerhouse

[AnythingLLM](https://anythingllm.com/) is a full-stack RAG desktop/Docker app from Mintplex Labs (Boston). MIT, Node.js + React. Desktop and Docker self-host builds both ship.

Core components

- **Workspaces** — a bundle of docs, chats, embeddings, model config

- **Agents** — function calls, web search, code execution

- **Multiple LLM backends** — Ollama / LM Studio / OpenAI / Anthropic / Mistral / Together

- **Embedding backends** — sentence-transformers, OpenAI, Cohere, Ollama nomic-embed

- **Built-in vector DB** — LanceDB by default; Chroma / Pinecone / Weaviate / Qdrant optional

- **Document connectors** — PDF, DOCX, MD, GitHub repos, Confluence, Notion, web crawler

Scenario — internal wiki bot

1. Run AnythingLLM Docker

2. Create workspace "engineering-wiki"

3. Connect the Confluence connector and index (auto re-index every 24h)

4. Set the model to qwen2.5:14b via Ollama

5. Call via Slack bot or Open WebUI API

One of the fastest paths to corporate full-stack RAG.

Chapter 11 · PrivateGPT, Khoj, Reor — Specialized Tools

PrivateGPT

[PrivateGPT](https://privategpt.dev/) was started by Iván Martínez. Python-based. The goal: 100% local RAG with zero external API. Common in security / regulated industries. Somewhat heavy (model + embedding + vector DB in one process).

Khoj

[Khoj](https://khoj.dev/) is a "personal AI assistant" from Khoj Inc. It indexes notes (Obsidian, Notion), email, calendar — and chats over them.

- macOS / Windows / Linux desktop

- iOS / Android apps

- Self-host Docker option

Reor

[Reor](https://reor.app/) is an "AI-native notes app." Markdown notes similar to Obsidian, but automatic embeddings connect every note semantically. All inference and embedding is local.

Chapter 12 · Faraday, Pinokio, Chatbox

Faraday (legacy)

[Faraday.dev](https://faraday.dev/) was a desktop app centered on character chat. Effectively dormant as of May 2026; users migrated to SillyTavern and AI Horde. Mentioned for historical context.

Pinokio

[Pinokio](https://pinokio.computer/) is "a package manager for AI scripts." One-click install/run for ComfyUI, AUTOMATIC1111, Whisper, Bark. JSON-based recipe system.

Use cases:

- Try image / voice / video tools quickly

- Share a ComfyUI workflow with a friend

- Automate demo environment setup

Chatbox

[Chatbox](https://chatboxai.app/) is a multi-platform chat UI. iOS, Android, macOS, Windows, Linux, Web. OpenAI / Claude / Gemini / Ollama backends. Closed source but popular for travel because of strong mobile support.

Page Assist

[Page Assist](https://chromewebstore.google.com/detail/page-assist-a-web-ui-for/jfgfiigpkhlkbnfnbobbkinehhfdhndo) is a Chrome extension that lets you ask Ollama about the current page. Side-panel chat, context-menu summarize. Light RAG.

Chapter 13 · Backend Engines — llama.cpp / MLX / vLLM / TensorRT

llama.cpp

[Georgi Gerganov](https://github.com/ggerganov)'s 2023 C++ inference engine. The foundation of Ollama, LM Studio, Jan, GPT4All. Supports CPU and GPU (CUDA, Metal, ROCm, Vulkan, SYCL).

Build from source

git clone https://github.com/ggml-org/llama.cpp

cd llama.cpp

make -j8 LLAMA_METAL=1 # macOS

make -j8 LLAMA_CUDA=1 # Linux NVIDIA

Run

./llama-cli -m models/qwen2.5-14b-instruct-q4_k_m.gguf -p "hi"

./llama-server -m models/llama-3.3-70b-q4_k_m.gguf --port 8080

Build-from-source is 10-20% faster than Ollama and exposes more flags. Downside: manual model download/management.

MLX-LM

Apple Silicon only. [MLX](https://ml-explore.github.io/mlx/build/html/index.html) is a NumPy-style tensor library from Apple's ML research team. MLX-LM is the LLM inference layer on top.

pip install mlx-lm

mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit --prompt "hi"

mlx_lm.server --model mlx-community/Qwen2.5-14B-Instruct-4bit --port 8080

30-50% faster than llama.cpp's Metal backend on M3/M4. That's why LM Studio auto-selects MLX. Downside: Apple Silicon only, no NVIDIA / AMD.

vLLM / SGLang / TGI

Server-class. Serves one model to many concurrent requests (PagedAttention, continuous batching). Overkill for a single-user laptop, but the right answer for "internal LLM serving ten people." Covered in depth in a separate post; brief mention here.

pip install vllm

vllm serve Qwen/Qwen2.5-14B-Instruct --port 8080

TensorRT-LLM

NVIDIA only. CUDA-optimized inference. Maximum throughput on H100 / B200 / RTX 5090. Build process is complex but the throughput is unbeaten in production.

Llamafile

[Mozilla's Llamafile](https://github.com/Mozilla-Ocho/llamafile) bundles llama.cpp + a model into **a single executable**. Same file runs on macOS, Linux, Windows. Great for multi-OS demos and air-gapped environments.

chmod +x llava-v1.5-7b-q4.llamafile

./llava-v1.5-7b-q4.llamafile --server

Chapter 14 · Quantization Formats — GGUF / AWQ / GPTQ / EXL / MXFP4 / BitNet

Source models are typically BF16 (2 bytes/param). A 7B model is 14GB. Too heavy for a laptop. Quantization trades precision for footprint.

GGUF (llama.cpp standard)

- Q2_K (smallest, low quality, rarely used)

- Q3_K_M (3-bit, 7B becomes 3GB — mobile)

- **Q4_K_M (4-bit, the sweet spot, most-used)**

- Q5_K_M (5-bit, better quality)

- Q6_K (6-bit, near-BF16)

- Q8_0 (8-bit, virtually no quality difference vs BF16, half the memory)

- FP16 / BF16 (not quantized, original)

`Q4_K_M` shrinks a 7B to about 4.5GB and costs 2-3% perplexity. Overwhelming default.

AWQ (Activation-aware Weight Quantization)

Common with vLLM and TGI. Faster than GPTQ at similar quality. 4-bit is standard.

GPTQ

Older. Quantize with AutoGPTQ. 4-bit standard. Gradually losing ground to AWQ.

EXL2 / EXL3

[ExLlamaV2/V3](https://github.com/turboderp-org/exllamav2). NVIDIA RTX specialized. Mixes 4 / 6 / 8-bit within a model — under 1% perplexity hit. ExLlamaV3 shipped late 2025 with better quantization efficiency.

MXFP4

OpenAI standardized Microscaling FP4 in 2025. Hardware-accelerated on NVIDIA Blackwell (B200, RTX 5090). Better quality than INT4 with quarter-of-BF16 footprint.

BitNet (1.58-bit)

Microsoft research. Weights are -1, 0, +1. Almost no multiplications at inference — very fast. BitNet b1.58 3B and 7B released on Hugging Face in 2026. Experimental but huge potential for embedded / mobile.

Which to pick

- Desktop / laptop, Ollama / llama.cpp → **GGUF Q4_K_M**

- vLLM server, NVIDIA → **AWQ**

- Single NVIDIA, max efficiency → **EXL3**

- Apple Silicon → **MLX 4-bit**

Chapter 15 · Recommended Local Models — May 2026

General — Llama 4 Scout 109B MoE

Meta's [Llama 4 Scout](https://huggingface.co/meta-llama/Llama-4-Scout-109B-Instruct). 16-of-128 expert MoE. 17B active params — inference cost is 17B-class, quality is near 70B. 24 tokens/sec on M4 Max 128GB. Context 1M tokens.

General (practical) — Llama 3.3 70B

[Llama 3.3 70B Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct). The 70B-class standard. GPT-4 Turbo level. 42GB at Q4_K_M. Dual RTX 5090 or M2 Ultra 64GB.

Reasoning — DeepSeek R1 Distill 32B

[DeepSeek R1](https://www.deepseek.com/)'s Llama / Qwen distill series. **32B Q4 = single RTX 4090**. o1-mini-class reasoning. Strong on math, code, logic.

ollama pull deepseek-r1:32b

ollama pull deepseek-r1:7b # for laptops

Multilingual — Qwen 3 14B

[Alibaba Qwen 3](https://qwenlm.github.io/). Strong in Korean / Chinese / Japanese / English. Often outperforms Llama on Korean text. 14B Q4_K_M on a single RTX 4070 (12GB).

Small-model champion — Phi-4 14B

[Microsoft Phi-4](https://huggingface.co/microsoft/phi-4). The "data curation is the answer" result. 14B with 70B-class benchmarks. The cost-perf winner for laptops.

Very small — Gemma 3 12B / 4B

Google [Gemma 3](https://huggingface.co/google/gemma-3-12b-it). 12B / 4B / 1B lineup. Mobile / embedded / notebook. Smaller than the 7B class with comparable performance.

Light + multilingual — MiniCPM 3.0 4B

OpenBMB's [MiniCPM 3.0](https://huggingface.co/openbmb/MiniCPM3-4B). 4B competitive with 8B models. Optimized for mobile / edge.

Code — DeepSeek Coder V2 Lite 16B

[DeepSeek Coder V2](https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct). 16B MoE (2.4B active). 10GB at Q4. Popular as Continue.dev / Cline backend.

Multimodal — LLaVA 34B, Qwen2-VL 7B, Pixtral 12B

Image + text. LLaVA is the standard, Qwen2-VL is multilingual-strong, Pixtral is Mistral's vision model.

ollama pull llava:34b

ollama pull qwen2-vl:7b

Chapter 16 · Voice Mode — STT + LLM + TTS

STT (speech → text)

- **OpenAI Whisper** — the standard. base / small / medium / large-v3. large-v3 needs 4GB GPU.

- **faster-whisper** — CTranslate2 backend. Fast on CPU and GPU.

- **whisper.cpp** — C++ port, Apple Silicon Metal accelerated.

- **Distil-Whisper** — Whisper distillation, 6x faster.

TTS (text → speech)

- **Piper** — Rhasspy project. Fast on CPU, Korean voices available.

- **Coqui XTTS v2** — multilingual + voice cloning. (Coqui dissolved 2024, models remain)

- **F5-TTS** — released 2025. English / Chinese naturalness near top tier. Voice cloning.

- **Kokoro** — tiny (82M) English TTS. Real-time on a notebook CPU.

- **Cartesia Sonic** — commercial API but extremely fast.

Open WebUI voice integration

Settings → Audio

STT: faster-whisper (local) or Whisper API

TTS: Piper (local), Kokoro (local), ElevenLabs (cloud)

Tap the mic, and the STT → LLM → TTS pipeline runs. Talk to ChatGPT-style while driving.

Chapter 17 · Code Assistant — Continue.dev + Ollama

Continue.dev

[Continue.dev](https://www.continue.dev/) is a VSCode / JetBrains extension. A Cursor / Copilot alternative. Model backend is free choice — local Ollama works.

// ~/.continue/config.json

{

"models": [

{

"title": "Local Coder",

"provider": "ollama",

"model": "deepseek-coder-v2:16b-lite-instruct",

"apiBase": "http://localhost:11434"

}

],

"tabAutocompleteModel": {

"title": "Tab",

"provider": "ollama",

"model": "qwen2.5-coder:7b"

}

}

Tab autocomplete uses Qwen2.5-Coder 7B (fast); chat uses DeepSeek Coder V2 16B (quality). 100% local, zero API cost, code stays on-device.

Cline + Ollama

[Cline](https://cline.bot/) (formerly Claude Dev) is agentic. File read/write, command exec, Plan/Act mode. Ollama backend works but recommend 70B+ reasoning models — agent loops are heavy.

aider

[aider](https://aider.chat/) is a terminal pair programmer. Git-based. Ollama backend.

aider --model ollama/qwen2.5-coder:32b

Chapter 18 · Apple Intelligence — OS-Level On-Device

[Apple Intelligence](https://www.apple.com/apple-intelligence/) is GA on iOS 18, iPadOS 18, macOS 15 Sequoia, visionOS 2. Two pieces.

1. **On-device 3B model** — runs on Apple Silicon NPU. Notification summaries, Mail reply suggestions, text refinement, Image Playground.

2. **Private Cloud Compute (PCC)** — for bigger workloads, offloaded to Apple Silicon servers. Logs are not persisted; only attested code runs (source published to external security researchers).

Foundation Models framework

let session = LanguageModelSession()

let resp = try await session.respond(to: "Summarize my note in 3 lines")

Available on iOS 18.2+ / macOS 15.2+. 3B-bound but free and unlimited.

Limits

- English-first; Korean / Japanese GA in stages through 2025

- 3B is too small for complex tasks — hence the PCC handoff

- Requires iPhone 15 Pro or M1 or later

Chapter 19 · Phi Silica — On-Device AI for Windows 11

Microsoft ships [Phi Silica](https://blogs.windows.com/windowsexperience/2024/05/20/unlocking-ai-productivity-and-creativity-with-copilot-pcs-windows-11-features/) — a 3.8B model — on the NPU of Snapdragon X Elite / Intel Core Ultra / AMD Ryzen AI. Standard on Copilot+ PCs since Windows 11 24H2.

Capabilities

- Summarize / rewrite / translate

- Code assist (Visual Studio integration)

- Image generation (Cocreator)

- Search (Recall — semantic search over captured screens)

Recall was paused over security concerns in 2024 and re-shipped in 2025 with opt-in + E2E encryption.

Developer API

The [Microsoft.Windows.AI.Generative](https://learn.microsoft.com/en-us/windows/ai/) namespace in Windows Copilot Runtime. Callable from C# / Rust / C++.

Chapter 20 · Gemini Nano — Android and Chrome

[Gemini Nano](https://deepmind.google/technologies/gemini/) is Google's smallest Gemini variant. Available on Pixel 8 Pro and up, parts of Galaxy S24+, and Chrome desktop (Canary / Beta + partial stable as of May 2026).

Chrome Built-in AI

// Origin Trial active as of May 2026

const session = await ai.languageModel.create({

systemPrompt: "You are a summarization expert.",

})

const summary = await session.prompt("Summarize this article in 3 lines: ...")

An LLM inside the browser. Zero network calls, zero cost. Web apps can finally lean on an "offline LLM."

Android AICore

val generativeModel = GenerativeModel(modelName = "gemini-nano")

val response = generativeModel.generateContent("summarize")

Chapter 21 · Korean Local AI Ecosystem

Lablup Backend.AI

[Lablup](https://www.lablup.com/)'s Backend.AI is an LLM training / inference platform. Manages vLLM, Triton, TensorRT on in-house GPU clusters. Many SOE / large-enterprise deployments in Korea in 2026.

Upstage Solar

[Upstage](https://www.upstage.ai/)'s Solar comes in 10.7B / Pro / Mini variants. Solar Mini 2.4B runs locally on laptops — registered in Ollama.

ollama pull upstage/solar-pro-preview

Naver Cloud HyperCLOVA X

Naver's HyperCLOVA X SEED 3B is open-weight (released 2025). Korean-specialized. Registered on Hugging Face — convertible for llama.cpp / Ollama.

KT, SKT, LG

- KT Mi:dm, SKT A.X 4.0 — proprietary 7B models (some weights open)

- LG AI Research EXAONE 3.5 — 2.4B / 7.8B / 32B. Non-commercial license but free for research

ollama pull exaone3.5:7.8b

Chapter 22 · Japanese Local AI Ecosystem

ELYZA

[ELYZA](https://elyza.ai/) (University of Tokyo spinout). Llama-based Japanese fine-tunes. ELYZA-japanese-Llama-3-8B directly in Ollama.

Rinna

[Rinna](https://rinna.co.jp/). MS Japan spinout. Japanese GPT, BERT, Llama tunes. Also voice synth / recognition.

Stockmark

[Stockmark-100B](https://stockmark.co.jp/). Japanese 100B model, business-domain-specialized. Partial weights public.

PFN PLaMo

[Preferred Networks](https://www.preferred.jp/)'s PLaMo. 13B / 100B. PLaMo Lite is open-weight — laptop-local feasible.

CyberAgent CALM

[CyberAgent](https://www.cyberagent.co.jp/)'s CALM3 22B. Japanese + dialogue-tuned. Single RTX 4090 at Q4.

Chapter 23 · Ops Know-How — N Models on One GPU

Loading two models on the same GPU often OOMs. Three remedies.

1. Hot-swap (Ollama default)

Ollama's `keep_alive` controls model retention in memory.

Unload 30 seconds after last use

ollama run qwen2.5:7b --keep-alive 30s

Keep loaded forever

ollama run llama3.3:70b --keep-alive -1

2. Model router

If different services need different models, route via LiteLLM or self-hosted OpenRouter.

litellm config.yaml

model_list:

- model_name: chat

litellm_params:

model: ollama/qwen2.5:14b

api_base: http://localhost:11434

- model_name: code

litellm_params:

model: ollama/deepseek-coder-v2:16b

api_base: http://localhost:11434

3. vLLM continuous batching

When many users hit at once, vLLM uses PagedAttention to serve N concurrent requests against one model. Ten people can chat against a single 70B.

Chapter 24 · RAG Patterns — Local Embedding

Embedding models (local)

- **nomic-embed-text** — 768-dim, English SOTA-ish, registered in Ollama

- **mxbai-embed-large** — 1024-dim, better quality, slightly slower

- **bge-m3** — multilingual strong (Korean / Japanese / Chinese)

- **multilingual-e5-large** — multilingual / notebook-friendly

ollama pull nomic-embed-text

ollama pull mxbai-embed-large

ollama pull bge-m3

Local vector DBs

- **LanceDB** — embedded, on-disk, single file. AnythingLLM default.

- **ChromaDB** — Python lib + server mode

- **Qdrant** — Rust server, very fast

- **Weaviate** — full-stack

- **Milvus** — large-scale

db = lancedb.connect("./data")

table = db.create_table("docs", schema=...)

table.add([{"vector": embed("text"), "text": "text"}])

table.search(embed("query")).limit(5).to_pandas()

Chapter 25 · Security and Compliance

"Local equals safe?" — No

Local LLMs remove some cloud risks but introduce new ones.

- **Prompt injection** — hidden "ignore previous instructions" in documents → identical risk locally

- **Data leakage** — RAG may pull docs the user has no right to

- **Model integrity** — a Hugging Face download might be backdoored — use official channels only

- **Fine-tune leakage** — weights tuned on company data may leak PII

Ops guide

- Source models only from the official org (Meta, Microsoft, Google, Alibaba, DeepSeek HF orgs)

- Verify hashes after download

- Internal RAG needs access control (AnythingLLM workspace-scoped)

- Logging and audit — pipe Open WebUI admin logs into your SIEM

Compliance mapping

| Regulation | Cloud LLM | Local LLM |

| --- | --- | --- |

| GDPR | Transfer requires DPA | No transfer, partially exempt |

| HIPAA | BAA required | Own infrastructure — full control |

| Korea PIPA | Overseas transfer consent | Domestic processing — simple |

| Japan APPI | Consent + safeguards | Same, lower external risk |

| Korea FSI | Cloud security cert mandatory | Self-controlled infra |

Chapter 26 · Conclusion — Local AI as 2026 Table Stakes

Local LLMs were a hobby in 2023, an experiment in 2024, an option in 2025. In 2026 they are **a developer's basic skill**.

- **One laptop** + Ollama + Continue.dev → API bill drop and code doesn't leak

- **In-house GPU server** + Open WebUI + AnythingLLM → self-run company ChatGPT

- **iPhone** + Apple Intelligence → handled by the OS

- **Personal notes** + Reor / Khoj → semantic search over all notes

A 5-minute workflow to try right now.

1. Install Ollama

brew install ollama

2. Pull a model

ollama pull qwen2.5:14b-instruct

3. Chat

ollama run qwen2.5:14b-instruct

4. Spin up Open WebUI (if Docker is on the box)

docker run -d -p 3000:8080 \

-v open-webui:/app/backend/data \

--add-host=host.docker.internal:host-gateway \

ghcr.io/open-webui/open-webui:main

Open `http://localhost:3000` in a browser and your own ChatGPT runs on a laptop. No data leaves, cost is electricity, and the plane Wi-Fi being down doesn't stop you. That is the May-2026 landscape.

Chapter 27 · References

- Ollama official — https://ollama.com/

- Ollama model library — https://ollama.com/library

- LM Studio — https://lmstudio.ai/

- Jan — https://jan.ai/

- Msty — https://msty.app/

- GPT4All — https://gpt4all.io/

- Open WebUI — https://openwebui.com/

- LibreChat — https://www.librechat.ai/

- AnythingLLM — https://anythingllm.com/

- PrivateGPT — https://privategpt.dev/

- Khoj — https://khoj.dev/

- Reor — https://reor.app/

- Pinokio — https://pinokio.computer/

- Chatbox — https://chatboxai.app/

- llama.cpp — https://github.com/ggml-org/llama.cpp

- MLX-LM — https://github.com/ml-explore/mlx-examples

- Llamafile — https://github.com/Mozilla-Ocho/llamafile

- Continue.dev — https://www.continue.dev/

- Cline — https://cline.bot/

- aider — https://aider.chat/

- Hugging Face — https://huggingface.co/

- Apple Intelligence — https://www.apple.com/apple-intelligence/

- Microsoft Phi Silica — https://learn.microsoft.com/en-us/windows/ai/

- Chrome Built-in AI — https://developer.chrome.com/docs/ai

- Lablup Backend.AI — https://www.lablup.com/

- Upstage Solar — https://www.upstage.ai/

- LG EXAONE — https://www.lgresearch.ai/

- ELYZA — https://elyza.ai/

- Preferred Networks PLaMo — https://www.preferred.jp/

- CyberAgent CALM — https://www.cyberagent.co.jp/

현재 단락 (1/383)

Three years ago, "local LLMs" meant quantizing a 7B model to 4-bit, jamming it into an RTX 3090, and...

작성 글자: 0원문 글자: 25,755작성 단락: 0/383