✍️ 필사 모드: MLX Deep Dive — Apple's ML Framework for Apple Silicon: Unified Memory, Lazy Graphs, and the Mac-Native Flow (2026 Hands-On)
EnglishPrologue — Why LLMs Suddenly Got Fast on Macs
Late 2023. Apple's ML team quietly pushed a framework to GitHub called mlx. The names on the commits were familiar — core contributors to PyTorch and JAX. This time, the target was not NVIDIA GPUs or TPUs. It was Apple Silicon, and only Apple Silicon.
Two years on, in 2026, the workflow for running local LLMs on a Mac has essentially converged on MLX. LM Studio, Ollama, the Hugging Face demos, and a steady wave of indie desktop apps all use MLX as their backend.
The reason fits in one sentence.
"The M-series GPU shares RAM with the CPU. So there is no copy between host and device."
That single sentence explains the order-of-magnitude gap between MLX and PyTorch's MPS backend. This post unpacks it end to end — the unified-memory thesis, lazy graphs, mlx-lm and mlx-vlm, the Python and Swift APIs, and real tokens-per-second numbers on actual workloads.
1. The Unified-Memory Thesis — Where MLX Starts
Apple Silicon's memory architecture differs from NVIDIA's. That is where every MLX conversation has to begin.
On an NVIDIA system the GPU has its own VRAM. CPU RAM and GPU VRAM are physically separate chips, and data has to be explicitly copied over PCIe between them.
Traditional GPU (NVIDIA) Apple Silicon (M-series)
─────────────────────── ─────────────────────────
┌────┐ PCIe ┌─────┐ ┌─────────────────────┐
│CPU │ ◀────────▶ │ GPU │ │ CPU + GPU │
│RAM │ │VRAM │ │ same memory pool │
└────┘ └─────┘ └─────────────────────┘
copies are required no copies needed
The standard PyTorch idiom assumes that separation. x.to("cuda"), tensor.cpu() — each of those calls is a PCIe round trip. On a large model that copy traffic is a meaningful fraction of inference latency.
On Apple Silicon the CPU and GPU see the same memory pool. A tensor produced on the CPU can be read by the GPU with no copy — just a shared pointer. This is unified memory.
MLX wired that fact into the deepest layer of the framework. In MLX, arrays do not have a device — devices are chosen at operation time.
import mlx.core as mx
a = mx.array([1.0, 2.0, 3.0])
b = mx.array([4.0, 5.0, 6.0])
# Compute on the CPU
c_cpu = mx.add(a, b, stream=mx.cpu)
# Compute on the GPU — same a, b. No copies.
c_gpu = mx.add(a, b, stream=mx.gpu)
In PyTorch you would need a.to("mps"). In MLX that call does not exist. The notion of "moving to a device" is just not in the model.
This is not an API cosmetic difference. It is the entire reason LLMs are fast on Macs.
2. MLX vs PyTorch MPS vs JAX-Metal vs llama.cpp
There are several ways to run an LLM on a Mac. Start by naming each option's actual thesis.
| Option | Thesis | Strengths | Weaknesses |
|---|---|---|---|
| MLX | Native framework purpose-built for Apple Silicon | Unified memory, lazy graph, full GPU utilization | Apple Silicon only, smaller ecosystem |
| PyTorch MPS | Metal backend added to PyTorch | Existing PyTorch code mostly works | CUDA-style adaptation, 4GB tensor cap, slow |
| JAX-Metal | JAX's experimental Metal backend | Reuse existing JAX code | Experimental, feature gaps, slow updates |
| llama.cpp | C++ inference engine (Metal/CPU/CUDA) | Runs anywhere, small footprint | Inference only, no fine-tuning |
Where PyTorch MPS Is "Good Enough"
PyTorch's Metal Performance Shaders backend adapts CUDA-style operations to Metal. That produces two problems.
- The memory model is not optimized for unified memory. PyTorch still treats tensors as living in "device memory," so the unified-memory advantage cannot be fully exploited.
- There is a tensor size cap. PyTorch MPS has roughly a 4GB tensor limit, and contexts longer than about 2k tokens OOM frequently.
The benchmark gap is stark — for Llama inference, MLX comes in around ~230 tokens/sec while PyTorch MPS gets 7–9 tokens/sec on the same chip, same model. Single digits versus triple digits.
When PyTorch MPS is "good enough": quickly bringing existing training code onto a Mac for small experiments, or prototyping before moving to a CUDA cluster. Not for production-grade local inference.
JAX-Metal — Interesting, Still Experimental
Apple's JAX-Metal plugin exists. It works. But it is experimental, it does not cover all JAX features, and updates lag. It makes sense only if you already have a JAX codebase and want to run a subset of it on a Mac.
llama.cpp — When CPU-Only Is Fine
llama.cpp is a C++ inference engine. It runs on Metal, CUDA, and CPU, has a tiny footprint, and its quantized format (gguf) has become a de facto standard.
When llama.cpp wins:
- Embedded / CLI setups where a small footprint matters and you only need inference.
- You have to support Mac and Linux from the same binary.
- Fine-tuning happens elsewhere; local is inference only.
When MLX wins:
- You want fine-tuning on the same Mac.
- You want to embed the same weights into a Swift app (iOS/macOS).
- You work in Python with tensors directly and experiment with new model architectures.
3. The Lazy Computation Graph — JAX's Heir
MLX's other defining design decision is lazy evaluation. It is an idea the same team brought over from JAX.
PyTorch is eager by default. Calling a + b runs the computation right there. MLX does not — mx.add(a, b) builds a graph node and waits, only running the actual computation when the result is materialized.
import mlx.core as mx
a = mx.array([1.0, 2.0, 3.0])
b = mx.array([4.0, 5.0, 6.0])
# Nothing has been computed yet at this point
c = a + b # graph node only
d = c * 2.0 # another node
# Computation happens when we materialize the result
mx.eval(d)
# Or any call like print(d) implicitly evals
Why this is good:
- Operator fusion —
a + bandc * 2.0can be fused into a single Metal kernel. Fewer round trips through memory, higher GPU occupancy. - Memory allocation wins — intermediate results may never need to be materialized in memory.
- Graph-level optimization — JIT compilation similar to JAX (
mx.compilein MLX).
When you will miss eager: debugging. Wanting to print(x) to inspect every intermediate value is more direct in PyTorch. In MLX you either call mx.eval(x) explicitly or convert to NumPy to force materialization.
It is a trade-off, but for workloads that re-run the same graph over and over — exactly what local LLMs are — the lazy graph's benefits dominate.
4. mlx-lm — The Standard Tool for LLM Inference and Fine-Tuning
MLX itself is a low-level array library. To work with LLMs you install mlx-lm.
pip install mlx-lm
What you get:
- Download and convert Hugging Face models — Llama, Qwen, DeepSeek, Phi, Mistral, almost any transformer family.
- Quantization — 4-bit / 8-bit to fit memory.
- Local inference — one-line CLI, or in Python.
- OpenAI-compatible server mode — start
mlx_lm.serverand any OpenAI SDK can target it. - LoRA / QLoRA / DoRA fine-tuning — same command line.
CLI — Run Llama 3.x
# Download a 4-bit quantized Llama 3.1 8B and generate
mlx_lm.generate \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--prompt "In one paragraph, why Apple Silicon is favorable for LLMs." \
--max-tokens 256
That one line downloads the model, applies quantization, and generates. The first run pulls the weights; afterwards it uses the cache.
Python
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")
response = generate(
model,
tokenizer,
prompt="State the unified-memory advantage of MLX in three lines.",
max_tokens=256,
temp=0.7,
)
print(response)
load() returns the model and tokenizer together. generate() runs text generation. That is essentially it — no separate .to(device) call, no .eval() mode switch.
OpenAI-Compatible Server
mlx_lm.server --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8080
Once it is up, any OpenAI Python SDK can target it by setting base_url="http://localhost:8080/v1". This is the pattern most often seen when a desktop app or a LangChain/LangGraph workflow uses a local Mac as its inference backend.
5. mlx-vlm — Vision-Language Models
When text alone is not enough, there is mlx-vlm.
pip install mlx-vlm
Supported models include LLaVA, Qwen-VL, Phi-Vision, Idefics, and PaliGemma — most of the major VLM families. Usage mirrors mlx-lm.
from mlx_vlm import load, generate
model, processor = load("mlx-community/Qwen2-VL-7B-Instruct-4bit")
response = generate(
model,
processor,
image="./screenshot.png",
prompt="Summarize the error message shown on this screen in plain English.",
max_tokens=256,
)
print(response)
Why VLM on a Mac matters: for workflows around screenshots, documents, and UI captures (note apps, automation tools, accessibility assistants), staying off the cloud is itself a feature. And thanks to unified memory, there is no overhead copying large images to the GPU.
6. Python and Swift — Same Core, Two Entry Points
MLX's core is C++, and on top of it sit Python and Swift high-level APIs. (Core operations are also exposed in C and C++.)
┌─────────────────────────────────────────┐
│ Python API │ Swift API │
│ (Jupyter, │ (iOS, macOS, │
│ research, server)│ visionOS apps) │
├─────────────────────────────────────────┤
│ MLX core (C++) │
│ arrays, autodiff, lazy graph │
├─────────────────────────────────────────┤
│ Metal backend │
│ (M-series GPU, Neural Accelerators) │
└─────────────────────────────────────────┘
Python API — Research and Server
Use mlx.core for arrays and ops, mlx.nn for modules, mlx.optimizers for optimizers. The API maps almost one-to-one to PyTorch, so migration cost is low.
import mlx.core as mx
import mlx.nn as nn
class SmallModel(nn.Module):
def __init__(self, dim=128):
super().__init__()
self.linear1 = nn.Linear(dim, dim)
self.linear2 = nn.Linear(dim, dim)
def __call__(self, x):
x = mx.maximum(self.linear1(x), 0.0) # ReLU
return self.linear2(x)
model = SmallModel()
x = mx.random.normal((4, 128))
y = model(x)
mx.eval(y)
Swift API — iOS, macOS, visionOS Apps
mlx-swift exposes the same core in Swift. Why this is interesting: train in Python, then load the very same weights from Swift and embed them into an iOS app. Where PyTorch would force a detour through ONNX/CoreML conversion, MLX uses the same format and the same core.
import MLX
import MLXNN
let model = SmallModel()
let x = MLXRandom.normal([4, 128])
let y = model(x)
eval(y)
The same model code runs on Mac, iPhone, iPad, and Vision Pro. That is a position PyTorch or JAX cannot easily occupy.
7. The Metal Backend — TensorOps and Neural Accelerators
MLX's GPU backend is Metal — Apple's graphics and compute API, the position CUDA holds on NVIDIA.
Starting with M5, the TensorOps and Metal Performance Primitives frameworks introduced with Metal 4 expose Neural Accelerator tensor operations directly. MLX uses these for matmul, attention, convolution, and the other core ops.
What you, as a developer, need to know:
- You rarely write Metal yourself when using MLX. The hook exists if you want to add a custom op via a Metal shader.
- Quantization is handled inside MLX. 4-bit and 8-bit kernels are optimized at the Metal level.
- M2 → M3 → M4 → M5 brings more GPU cores, more memory bandwidth, more Neural Accelerators. Your MLX code automatically runs faster on a new chip — no recompile needed.
8. A Small Fine-Tune — One LoRA Lap on M-Series
Enough theory. Run an actual fine-tune. Scenario: domain-adapt Llama 3.1 8B on roughly 5,000 question/answer pairs.
Data Prep
mlx-lm's LoRA expects a JSONL file. One example per line:
{"text": "<s>[INST] question [/INST] answer </s>"}
Or the chat-template form:
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
Drop train.jsonl and valid.jsonl into a data/ directory.
LoRA Training Command
mlx_lm.lora \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--train \
--data ./data \
--batch-size 4 \
--lora-layers 16 \
--iters 1000 \
--learning-rate 1e-4 \
--adapter-path ./adapters
Passing a 4-bit quantized model together with --train automatically runs QLoRA — the base weights stay quantized, only the LoRA adapter weights are full precision. An M2 Pro fine-tunes a 7B model on 500 examples in about 20–25 minutes. On an M3 Max with 128GB you can QLoRA-tune up to a 70B model.
Inference After Training
mlx_lm.generate \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--adapter-path ./adapters \
--prompt "domain question..." \
--max-tokens 256
The Honest Picture on Training Throughput
Compared with an NVIDIA H100 the training throughput is far smaller — roughly 1–3 tok/s vs 30–50 tok/s. That means fine-tuning on Apple Silicon does not turn hours into minutes. It is well suited to these positions instead:
- Prototyping — testing new datasets or new prompt formats.
- Experimentation and learning — getting hands dirty with the fine-tuning pipeline.
- Small-scale domain adaptation — a few thousand to tens of thousands of examples on 7B–32B models.
Heavy lifting is still more cost-effective on cloud H100/A100. MLX's fine-tuning shines when you want a closed-loop workflow on your local machine.
9. Performance — Tokens/sec Across M-Series
Benchmarks always depend on model, quantization, context length, and temperature. That said, the rough numbers cited around May 2026 (median-ish across public reports) look like this:
| Model | M2 Pro | M3 Max | M4 Max | M2/M3 Ultra |
|---|---|---|---|---|
| Llama 3.1 8B (4-bit) | 30–40 t/s | 60–85 t/s | 80–110 t/s | 100–140 t/s |
| Qwen 2.5 7B (4-bit) | 30–45 t/s | 65–90 t/s | 90–120 t/s | 110–150 t/s |
| Qwen3 0.6B (4-bit) | ~250 t/s | ~400 t/s | ~525 t/s | ~600 t/s |
| Llama 3.1 70B (4-bit) | (OOM) | 8–12 t/s | 10–15 t/s | 15–22 t/s |
Compared with PyTorch MPS the MLX numbers are often an order of magnitude faster. Reported cases include MLX ~230 t/s vs PyTorch MPS 7–9 t/s on the same chip.
The strongest thesis: you can fine-tune and serve on the same Mac, with the same code, in a closed loop. That is a structural advantage over a NVIDIA cloud workflow that does not go away.
10. Real Workloads — Local Llama, Qwen, DeepSeek
Scenario 1 — Coding Assistant on a Laptop
- M3 Pro / 36GB.
- Qwen 2.5 Coder 7B / 4-bit.
- Run
mlx_lm.serverin OpenAI-compatible mode and point your editor's LLM integration atlocalhost:8080. - About 50–70 t/s. Works offline.
Scenario 2 — Sales Note Summaries, Strictly Local
- M2 Air / 16GB.
- Llama 3.2 3B / 4-bit. (8B is tight at 16GB.)
- A Python script summarizes meeting notes and extracts next actions.
- No data leaves the device.
Scenario 3 — 70B Inference Workstation
- Mac Studio M2 Ultra / 192GB.
- DeepSeek 70B / 4-bit or Llama 3.1 70B / 4-bit.
- Expose
mlx_lm.serveron the office network — a cost-efficient team-scale inference backend.
Scenario 4 — VLM for Document OCR and Summaries
- M3 Max / 64GB.
- mlx-vlm + Qwen2-VL 7B / 4-bit.
- Feed screenshots and PDF page images, get text extraction plus a summary.
The thread running through all four: model, data, and inference all close in a single box. The Mac form factor sits naturally where MLX sits.
11. Decision Framework — When MLX, When Something Else
Local inference needed?
│
┌─────────┴─────────┐
▼ ▼
Yes (Mac) No (server)
│ │
▼ ▼
Fine-tune on the same box? PyTorch + CUDA
│
┌───────┴───────┐
▼ ▼
Yes No
│ │
▼ ▼
MLX Inference-only → llama.cpp / Ollama
PyTorch only → MPS (accept the perf hit)
│
▼
iOS / Swift integration needed?
│
┌────┴────┐
▼ ▼
Yes No
│ │
▼ ▼
mlx-swift Python MLX
Quick Table
| Situation | Pick |
|---|---|
| Local LLM inference on a Mac (personal) | MLX or llama.cpp-based tooling (Ollama, LM Studio) |
| Local fine-tuning + inference on a Mac | MLX (mlx-lm) |
| Embed the same model into iOS / macOS / visionOS apps | MLX (mlx-swift) |
| Quickly bring existing PyTorch code onto a Mac | PyTorch MPS (mind the limits) |
| Single inference engine for both Mac and Linux | llama.cpp |
| Large distributed training | NVIDIA + CUDA + PyTorch |
| JAX codebase you need to run on a Mac | JAX-Metal (experimental) |
12. Limitations and Trade-offs
MLX is not magic. Be honest about the seams.
Apple Silicon Only
MLX does not run on Intel Macs, Linux, or Windows. That is by design — the whole framework is optimized around Apple Silicon's unified memory. If you also need the same code on a Linux server, you still need a PyTorch or JAX implementation alongside it.
Smaller Ecosystem than PyTorch
The huge PyTorch ecosystem — Hugging Face Accelerate, DeepSpeed, axolotl, and so on — does not exist in MLX. The mlx-community hub has many converted models, but brand-new models often arrive on MLX days to weeks behind PyTorch. That gap is closing fast because mlx-lm can auto-convert Hugging Face checkpoints.
Immature Distributed Training
MLX is not the place to run multi-node distributed training. The sweet spot is a single Mac or one system. Large pretraining is still NVIDIA territory.
Debugging Subtleties
Lazy evaluation is occasionally confusing. When you print(x) you are implicitly running the whole graph up to that point. On big graphs the moment of debug is the moment of cost. Sprinkling in explicit mx.eval() calls is a good habit.
Training Throughput Does Not Match H100
Fine-tuning throughput is a fraction of an H100. Heavy training is still more economical in the cloud.
Gap with Linux/Windows Workflows
If your team's main pipeline is Linux Docker / NVIDIA / PyTorch, MLX cannot be at the center. It works well as an auxiliary tool on the Mac workstations of some team members.
13. MLX and the Mac-Native Future — A Six-Month Outlook
Trends visible in late 2026:
- M5-generation and expanding TensorOps use — MLX already exploits M5 tensor acceleration, with quarterly performance improvements.
- Ollama and LM Studio adopting MLX backends — Mac users effectively use MLX even without choosing it directly.
- mlx-vlm growing fast — multimodal models becoming first-class citizens on a Mac.
- Swift side strengthening — mlx-swift-lm and a pattern of embedding LLMs into iOS / Vision Pro apps.
- OpenAI-compatible servers becoming standard — local inference backends increasingly expose the OpenAI API.
Zooming out, MLX is becoming the default infrastructure for doing ML on a Mac. Choosing not to use it is starting to feel like the unusual option.
Epilogue — Why the Mac Became an ML Workstation
It took NVIDIA two decades to become the default substrate for ML infrastructure. It took Apple Silicon two or three years to claim a seat as an ML workstation. Two things made the difference — the hardware thesis of unified memory, and the framework that surfaces that thesis exactly (MLX).
"PyTorch MPS imitates the NVIDIA model on Metal. MLX is the Apple Silicon model itself. Same GPU, different framework, different result."
When you run the same model on a Mac, MLX is almost always the fastest and almost always the most natural. And that naturalness carries into fine-tuning, VLM, and iOS embedding without breaking.
MLX Adoption Checklist
- Do you have an Apple Silicon (M1 or later) Mac?
- Did
pip install mlx mlx-lmsucceed? (Python 3.9+) - Is the memory enough? (8B 4-bit fits in 16GB, 70B 4-bit needs 64GB+, 100B+ needs a 192GB Studio.)
- Have you decided inference only or fine-tuning too?
- Do you plan to expose results via an OpenAI-compatible server?
- Will the same weights be used in an iOS or macOS app?
- Have you accepted the gap with existing PyTorch code?
- Is "data does not leave the device" important for your workload?
- Have you settled when to call
mx.eval()for debugging? - Have you confirmed model licenses (Llama, Qwen, DeepSeek, etc.)?
Ten Anti-Patterns
- Trying to install MLX on a Linux server — wrong framework. Apple Silicon only.
- PyTorch MPS for long-context inference — OOM at the 4GB tensor cap. Move to MLX.
- Evaluating every tensor at once — running a huge graph in a single eval blows up memory and time. Step-wise evals.
- Trying to load a 70B model unquantized on a 16GB Mac — will not work. 4-bit + 64GB+ is the floor.
- Mixing MLX with NumPy/PyTorch tensors carelessly — conversion cost and device confusion. Keep boundaries clean.
- Falling back to PyTorch the moment a new model is missing — mlx-lm auto-converts most Hugging Face checkpoints.
- Expecting H100-class throughput on a Mac — fine-tuning throughput is far smaller. Big training belongs in the cloud.
- Hand-rolling APIs instead of using the OpenAI-compatible server — just run
mlx_lm.server. - Ignoring mlx-swift's possibilities — the iOS / visionOS value lives there.
- Evaluating MLX as a "PyTorch replacement" — it is not. MLX is a tool specialized for Apple Silicon. Its value proposition is different.
Next Posts
Possible follow-ups: local LLM serving architecture — Ollama vs LM Studio vs mlx-lm server, embedding LLMs in iOS — building a note-app assistant with mlx-swift, quantization deep dive — measuring the accuracy impact of 4-bit / 8-bit / QLoRA.
"The M-series GPU shares RAM with the CPU. So MLX should be fast. And it is."
— MLX deep dive, end.
References
- MLX GitHub Repository (ml-explore/mlx)
- Apple Open Source — MLX Project Page
- MLX Official Documentation
- MLX Framework Site
- Apple Machine Learning Research — Exploring LLMs with MLX and Neural Accelerators in the M5 GPU
- WWDC25 — Explore Large Language Models on Apple Silicon with MLX
- mlx-swift GitHub Repository
- Ollama Blog — Ollama Is Now Powered by MLX on Apple Silicon (Preview)
- LM Studio Blog — LM Studio 0.3.4 Ships with Apple MLX
- Apple Silicon LLM Benchmarks — llmcheck.net
- Towards Data Science — How Fast Is MLX? Benchmarks on 8 Apple Silicon Chips and 4 CUDA GPUs
- arXiv — A Comparative Study of MLX, MLC-LLM, Ollama, llama.cpp
- DZone — Vision AI on Apple Silicon: A Practical Guide to MLX-VLM
- Markaicode — Run and Fine-Tune LLMs on Mac with MLX-LM 2026
현재 단락 (1/269)
Late 2023. Apple's ML team quietly pushed a framework to GitHub called `mlx`. The names on the commi...