Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Prologue — Why LLMs Suddenly Got Fast on Macs

Late 2023. Apple's ML team quietly pushed a framework to GitHub called mlx. The names on the commits were familiar — core contributors to PyTorch and JAX. This time, the target was not NVIDIA GPUs or TPUs. It was Apple Silicon, and only Apple Silicon.

Two years on, in 2026, the workflow for running local LLMs on a Mac has essentially converged on MLX. LM Studio, Ollama, the Hugging Face demos, and a steady wave of indie desktop apps all use MLX as their backend.

The reason fits in one sentence.

"The M-series GPU shares RAM with the CPU. So there is no copy between host and device."

That single sentence explains the order-of-magnitude gap between MLX and PyTorch's MPS backend. This post unpacks it end to end — the unified-memory thesis, lazy graphs, mlx-lm and mlx-vlm, the Python and Swift APIs, and real tokens-per-second numbers on actual workloads.

1. The Unified-Memory Thesis — Where MLX Starts

Apple Silicon's memory architecture differs from NVIDIA's. That is where every MLX conversation has to begin.

On an NVIDIA system the GPU has its own VRAM. CPU RAM and GPU VRAM are physically separate chips, and data has to be explicitly copied over PCIe between them.

       Traditional GPU (NVIDIA)         Apple Silicon (M-series)
       ───────────────────────         ─────────────────────────
       ┌────┐    PCIe    ┌─────┐       ┌─────────────────────┐
       │CPU │ ◀────────▶ │ GPU │       │    CPU  +  GPU      │
       │RAM │            │VRAM │       │   same memory pool   │
       └────┘            └─────┘       └─────────────────────┘
       copies are required              no copies needed

The standard PyTorch idiom assumes that separation. x.to("cuda"), tensor.cpu() — each of those calls is a PCIe round trip. On a large model that copy traffic is a meaningful fraction of inference latency.

On Apple Silicon the CPU and GPU see the same memory pool. A tensor produced on the CPU can be read by the GPU with no copy — just a shared pointer. This is unified memory.

MLX wired that fact into the deepest layer of the framework. In MLX, arrays do not have a device — devices are chosen at operation time.

import mlx.core as mx

a = mx.array([1.0, 2.0, 3.0])
b = mx.array([4.0, 5.0, 6.0])

# Compute on the CPU
c_cpu = mx.add(a, b, stream=mx.cpu)
# Compute on the GPU — same a, b. No copies.
c_gpu = mx.add(a, b, stream=mx.gpu)

In PyTorch you would need a.to("mps"). In MLX that call does not exist. The notion of "moving to a device" is just not in the model.

This is not an API cosmetic difference. It is the entire reason LLMs are fast on Macs.

2. MLX vs PyTorch MPS vs JAX-Metal vs llama.cpp

There are several ways to run an LLM on a Mac. Start by naming each option's actual thesis.

Option	Thesis	Strengths	Weaknesses
MLX	Native framework purpose-built for Apple Silicon	Unified memory, lazy graph, full GPU utilization	Apple Silicon only, smaller ecosystem
PyTorch MPS	Metal backend added to PyTorch	Existing PyTorch code mostly works	CUDA-style adaptation, 4GB tensor cap, slow
JAX-Metal	JAX's experimental Metal backend	Reuse existing JAX code	Experimental, feature gaps, slow updates
llama.cpp	C++ inference engine (Metal/CPU/CUDA)	Runs anywhere, small footprint	Inference only, no fine-tuning

Where PyTorch MPS Is "Good Enough"

PyTorch's Metal Performance Shaders backend adapts CUDA-style operations to Metal. That produces two problems.

The memory model is not optimized for unified memory. PyTorch still treats tensors as living in "device memory," so the unified-memory advantage cannot be fully exploited.
There is a tensor size cap. PyTorch MPS has roughly a 4GB tensor limit, and contexts longer than about 2k tokens OOM frequently.

The benchmark gap is stark — for Llama inference, MLX comes in around ~230 tokens/sec while PyTorch MPS gets 7–9 tokens/sec on the same chip, same model. Single digits versus triple digits.

When PyTorch MPS is "good enough": quickly bringing existing training code onto a Mac for small experiments, or prototyping before moving to a CUDA cluster. Not for production-grade local inference.

JAX-Metal — Interesting, Still Experimental

Apple's JAX-Metal plugin exists. It works. But it is experimental, it does not cover all JAX features, and updates lag. It makes sense only if you already have a JAX codebase and want to run a subset of it on a Mac.

llama.cpp — When CPU-Only Is Fine

llama.cpp is a C++ inference engine. It runs on Metal, CUDA, and CPU, has a tiny footprint, and its quantized format (gguf) has become a de facto standard.

When llama.cpp wins:

Embedded / CLI setups where a small footprint matters and you only need inference.
You have to support Mac and Linux from the same binary.
Fine-tuning happens elsewhere; local is inference only.

When MLX wins:

You want fine-tuning on the same Mac.
You want to embed the same weights into a Swift app (iOS/macOS).
You work in Python with tensors directly and experiment with new model architectures.

3. The Lazy Computation Graph — JAX's Heir

MLX's other defining design decision is lazy evaluation. It is an idea the same team brought over from JAX.

PyTorch is eager by default. Calling a + b runs the computation right there. MLX does not — mx.add(a, b) builds a graph node and waits, only running the actual computation when the result is materialized.

import mlx.core as mx

a = mx.array([1.0, 2.0, 3.0])
b = mx.array([4.0, 5.0, 6.0])

# Nothing has been computed yet at this point
c = a + b           # graph node only
d = c * 2.0          # another node

# Computation happens when we materialize the result
mx.eval(d)
# Or any call like print(d) implicitly evals

Why this is good:

Operator fusion — a + b and c * 2.0 can be fused into a single Metal kernel. Fewer round trips through memory, higher GPU occupancy.
Memory allocation wins — intermediate results may never need to be materialized in memory.
Graph-level optimization — JIT compilation similar to JAX (mx.compile in MLX).

When you will miss eager: debugging. Wanting to print(x) to inspect every intermediate value is more direct in PyTorch. In MLX you either call mx.eval(x) explicitly or convert to NumPy to force materialization.

It is a trade-off, but for workloads that re-run the same graph over and over — exactly what local LLMs are — the lazy graph's benefits dominate.

4. mlx-lm — The Standard Tool for LLM Inference and Fine-Tuning

MLX itself is a low-level array library. To work with LLMs you install mlx-lm.

pip install mlx-lm

What you get:

Download and convert Hugging Face models — Llama, Qwen, DeepSeek, Phi, Mistral, almost any transformer family.
Quantization — 4-bit / 8-bit to fit memory.
Local inference — one-line CLI, or in Python.
OpenAI-compatible server mode — start mlx_lm.server and any OpenAI SDK can target it.
LoRA / QLoRA / DoRA fine-tuning — same command line.

CLI — Run Llama 3.x

# Download a 4-bit quantized Llama 3.1 8B and generate
mlx_lm.generate \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --prompt "In one paragraph, why Apple Silicon is favorable for LLMs." \
  --max-tokens 256

That one line downloads the model, applies quantization, and generates. The first run pulls the weights; afterwards it uses the cache.

Python

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")

response = generate(
    model,
    tokenizer,
    prompt="State the unified-memory advantage of MLX in three lines.",
    max_tokens=256,
    temp=0.7,
)
print(response)

load() returns the model and tokenizer together. generate() runs text generation. That is essentially it — no separate .to(device) call, no .eval() mode switch.

OpenAI-Compatible Server

mlx_lm.server --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8080

Once it is up, any OpenAI Python SDK can target it by setting base_url="http://localhost:8080/v1". This is the pattern most often seen when a desktop app or a LangChain/LangGraph workflow uses a local Mac as its inference backend.

5. mlx-vlm — Vision-Language Models

When text alone is not enough, there is mlx-vlm.

pip install mlx-vlm

Supported models include LLaVA, Qwen-VL, Phi-Vision, Idefics, and PaliGemma — most of the major VLM families. Usage mirrors mlx-lm.

from mlx_vlm import load, generate

model, processor = load("mlx-community/Qwen2-VL-7B-Instruct-4bit")

response = generate(
    model,
    processor,
    image="./screenshot.png",
    prompt="Summarize the error message shown on this screen in plain English.",
    max_tokens=256,
)
print(response)

Why VLM on a Mac matters: for workflows around screenshots, documents, and UI captures (note apps, automation tools, accessibility assistants), staying off the cloud is itself a feature. And thanks to unified memory, there is no overhead copying large images to the GPU.

6. Python and Swift — Same Core, Two Entry Points

MLX's core is C++, and on top of it sit Python and Swift high-level APIs. (Core operations are also exposed in C and C++.)

   ┌─────────────────────────────────────────┐
   │  Python API     │     Swift API         │
   │  (Jupyter,       │   (iOS, macOS,        │
   │   research, server)│  visionOS apps)     │
   ├─────────────────────────────────────────┤
   │              MLX core (C++)              │
   │      arrays, autodiff, lazy graph        │
   ├─────────────────────────────────────────┤
   │            Metal backend                 │
   │   (M-series GPU, Neural Accelerators)    │
   └─────────────────────────────────────────┘

Python API — Research and Server

Use mlx.core for arrays and ops, mlx.nn for modules, mlx.optimizers for optimizers. The API maps almost one-to-one to PyTorch, so migration cost is low.

import mlx.core as mx
import mlx.nn as nn

class SmallModel(nn.Module):
    def __init__(self, dim=128):
        super().__init__()
        self.linear1 = nn.Linear(dim, dim)
        self.linear2 = nn.Linear(dim, dim)
    def __call__(self, x):
        x = mx.maximum(self.linear1(x), 0.0)  # ReLU
        return self.linear2(x)

model = SmallModel()
x = mx.random.normal((4, 128))
y = model(x)
mx.eval(y)

Swift API — iOS, macOS, visionOS Apps

mlx-swift exposes the same core in Swift. Why this is interesting: train in Python, then load the very same weights from Swift and embed them into an iOS app. Where PyTorch would force a detour through ONNX/CoreML conversion, MLX uses the same format and the same core.

import MLX
import MLXNN

let model = SmallModel()
let x = MLXRandom.normal([4, 128])
let y = model(x)
eval(y)

The same model code runs on Mac, iPhone, iPad, and Vision Pro. That is a position PyTorch or JAX cannot easily occupy.

7. The Metal Backend — TensorOps and Neural Accelerators

MLX's GPU backend is Metal — Apple's graphics and compute API, the position CUDA holds on NVIDIA.

Starting with M5, the TensorOps and Metal Performance Primitives frameworks introduced with Metal 4 expose Neural Accelerator tensor operations directly. MLX uses these for matmul, attention, convolution, and the other core ops.

What you, as a developer, need to know:

You rarely write Metal yourself when using MLX. The hook exists if you want to add a custom op via a Metal shader.
Quantization is handled inside MLX. 4-bit and 8-bit kernels are optimized at the Metal level.
M2 → M3 → M4 → M5 brings more GPU cores, more memory bandwidth, more Neural Accelerators. Your MLX code automatically runs faster on a new chip — no recompile needed.

8. A Small Fine-Tune — One LoRA Lap on M-Series

Enough theory. Run an actual fine-tune. Scenario: domain-adapt Llama 3.1 8B on roughly 5,000 question/answer pairs.

Data Prep

mlx-lm's LoRA expects a JSONL file. One example per line:

{"text": "<s>[INST] question [/INST] answer </s>"}

Or the chat-template form:

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Drop train.jsonl and valid.jsonl into a data/ directory.

LoRA Training Command

mlx_lm.lora \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --train \
  --data ./data \
  --batch-size 4 \
  --lora-layers 16 \
  --iters 1000 \
  --learning-rate 1e-4 \
  --adapter-path ./adapters

Passing a 4-bit quantized model together with --train automatically runs QLoRA — the base weights stay quantized, only the LoRA adapter weights are full precision. An M2 Pro fine-tunes a 7B model on 500 examples in about 20–25 minutes. On an M3 Max with 128GB you can QLoRA-tune up to a 70B model.

Inference After Training

mlx_lm.generate \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --adapter-path ./adapters \
  --prompt "domain question..." \
  --max-tokens 256

The Honest Picture on Training Throughput

Compared with an NVIDIA H100 the training throughput is far smaller — roughly 1–3 tok/s vs 30–50 tok/s. That means fine-tuning on Apple Silicon does not turn hours into minutes. It is well suited to these positions instead:

Prototyping — testing new datasets or new prompt formats.
Experimentation and learning — getting hands dirty with the fine-tuning pipeline.
Small-scale domain adaptation — a few thousand to tens of thousands of examples on 7B–32B models.

Heavy lifting is still more cost-effective on cloud H100/A100. MLX's fine-tuning shines when you want a closed-loop workflow on your local machine.

9. Performance — Tokens/sec Across M-Series

Benchmarks always depend on model, quantization, context length, and temperature. That said, the rough numbers cited around May 2026 (median-ish across public reports) look like this:

Model	M2 Pro	M3 Max	M4 Max	M2/M3 Ultra
Llama 3.1 8B (4-bit)	30–40 t/s	60–85 t/s	80–110 t/s	100–140 t/s
Qwen 2.5 7B (4-bit)	30–45 t/s	65–90 t/s	90–120 t/s	110–150 t/s
Qwen3 0.6B (4-bit)	~250 t/s	~400 t/s	~525 t/s	~600 t/s
Llama 3.1 70B (4-bit)	(OOM)	8–12 t/s	10–15 t/s	15–22 t/s

Compared with PyTorch MPS the MLX numbers are often an order of magnitude faster. Reported cases include MLX ~230 t/s vs PyTorch MPS 7–9 t/s on the same chip.

The strongest thesis: you can fine-tune and serve on the same Mac, with the same code, in a closed loop. That is a structural advantage over a NVIDIA cloud workflow that does not go away.

10. Real Workloads — Local Llama, Qwen, DeepSeek

Scenario 1 — Coding Assistant on a Laptop

M3 Pro / 36GB.
Qwen 2.5 Coder 7B / 4-bit.
Run mlx_lm.server in OpenAI-compatible mode and point your editor's LLM integration at localhost:8080.
About 50–70 t/s. Works offline.

Scenario 2 — Sales Note Summaries, Strictly Local

M2 Air / 16GB.
Llama 3.2 3B / 4-bit. (8B is tight at 16GB.)
A Python script summarizes meeting notes and extracts next actions.
No data leaves the device.

Scenario 3 — 70B Inference Workstation

Mac Studio M2 Ultra / 192GB.
DeepSeek 70B / 4-bit or Llama 3.1 70B / 4-bit.
Expose mlx_lm.server on the office network — a cost-efficient team-scale inference backend.

Scenario 4 — VLM for Document OCR and Summaries

M3 Max / 64GB.
mlx-vlm + Qwen2-VL 7B / 4-bit.
Feed screenshots and PDF page images, get text extraction plus a summary.

The thread running through all four: model, data, and inference all close in a single box. The Mac form factor sits naturally where MLX sits.

11. Decision Framework — When MLX, When Something Else

                Local inference needed?
                          │
                ┌─────────┴─────────┐
                ▼                   ▼
            Yes (Mac)            No (server)
                │                   │
                ▼                   ▼
       Fine-tune on the same box?   PyTorch + CUDA
                │
        ┌───────┴───────┐
        ▼               ▼
       Yes             No
        │               │
        ▼               ▼
       MLX           Inference-only → llama.cpp / Ollama
                     PyTorch only   → MPS (accept the perf hit)
        │
        ▼
   iOS / Swift integration needed?
        │
   ┌────┴────┐
   ▼         ▼
  Yes       No
   │         │
   ▼         ▼
 mlx-swift  Python MLX

Quick Table

Situation	Pick
Local LLM inference on a Mac (personal)	MLX or llama.cpp-based tooling (Ollama, LM Studio)
Local fine-tuning + inference on a Mac	MLX (mlx-lm)
Embed the same model into iOS / macOS / visionOS apps	MLX (mlx-swift)
Quickly bring existing PyTorch code onto a Mac	PyTorch MPS (mind the limits)
Single inference engine for both Mac and Linux	llama.cpp
Large distributed training	NVIDIA + CUDA + PyTorch
JAX codebase you need to run on a Mac	JAX-Metal (experimental)

12. Limitations and Trade-offs

MLX is not magic. Be honest about the seams.

Apple Silicon Only

MLX does not run on Intel Macs, Linux, or Windows. That is by design — the whole framework is optimized around Apple Silicon's unified memory. If you also need the same code on a Linux server, you still need a PyTorch or JAX implementation alongside it.

Smaller Ecosystem than PyTorch

The huge PyTorch ecosystem — Hugging Face Accelerate, DeepSpeed, axolotl, and so on — does not exist in MLX. The mlx-community hub has many converted models, but brand-new models often arrive on MLX days to weeks behind PyTorch. That gap is closing fast because mlx-lm can auto-convert Hugging Face checkpoints.

Immature Distributed Training

MLX is not the place to run multi-node distributed training. The sweet spot is a single Mac or one system. Large pretraining is still NVIDIA territory.

Debugging Subtleties

Lazy evaluation is occasionally confusing. When you print(x) you are implicitly running the whole graph up to that point. On big graphs the moment of debug is the moment of cost. Sprinkling in explicit mx.eval() calls is a good habit.

Training Throughput Does Not Match H100

Fine-tuning throughput is a fraction of an H100. Heavy training is still more economical in the cloud.

Gap with Linux/Windows Workflows

If your team's main pipeline is Linux Docker / NVIDIA / PyTorch, MLX cannot be at the center. It works well as an auxiliary tool on the Mac workstations of some team members.

13. MLX and the Mac-Native Future — A Six-Month Outlook

Trends visible in late 2026:

M5-generation and expanding TensorOps use — MLX already exploits M5 tensor acceleration, with quarterly performance improvements.
Ollama and LM Studio adopting MLX backends — Mac users effectively use MLX even without choosing it directly.
mlx-vlm growing fast — multimodal models becoming first-class citizens on a Mac.
Swift side strengthening — mlx-swift-lm and a pattern of embedding LLMs into iOS / Vision Pro apps.
OpenAI-compatible servers becoming standard — local inference backends increasingly expose the OpenAI API.

Zooming out, MLX is becoming the default infrastructure for doing ML on a Mac. Choosing not to use it is starting to feel like the unusual option.

Epilogue — Why the Mac Became an ML Workstation

It took NVIDIA two decades to become the default substrate for ML infrastructure. It took Apple Silicon two or three years to claim a seat as an ML workstation. Two things made the difference — the hardware thesis of unified memory, and the framework that surfaces that thesis exactly (MLX).

"PyTorch MPS imitates the NVIDIA model on Metal. MLX is the Apple Silicon model itself. Same GPU, different framework, different result."

When you run the same model on a Mac, MLX is almost always the fastest and almost always the most natural. And that naturalness carries into fine-tuning, VLM, and iOS embedding without breaking.

MLX Adoption Checklist

Do you have an Apple Silicon (M1 or later) Mac?
Did pip install mlx mlx-lm succeed? (Python 3.9+)
Is the memory enough? (8B 4-bit fits in 16GB, 70B 4-bit needs 64GB+, 100B+ needs a 192GB Studio.)
Have you decided inference only or fine-tuning too?
Do you plan to expose results via an OpenAI-compatible server?
Will the same weights be used in an iOS or macOS app?
Have you accepted the gap with existing PyTorch code?
Is "data does not leave the device" important for your workload?
Have you settled when to call mx.eval() for debugging?
Have you confirmed model licenses (Llama, Qwen, DeepSeek, etc.)?

Ten Anti-Patterns

Trying to install MLX on a Linux server — wrong framework. Apple Silicon only.
PyTorch MPS for long-context inference — OOM at the 4GB tensor cap. Move to MLX.
Evaluating every tensor at once — running a huge graph in a single eval blows up memory and time. Step-wise evals.
Trying to load a 70B model unquantized on a 16GB Mac — will not work. 4-bit + 64GB+ is the floor.
Mixing MLX with NumPy/PyTorch tensors carelessly — conversion cost and device confusion. Keep boundaries clean.
Falling back to PyTorch the moment a new model is missing — mlx-lm auto-converts most Hugging Face checkpoints.
Expecting H100-class throughput on a Mac — fine-tuning throughput is far smaller. Big training belongs in the cloud.
Hand-rolling APIs instead of using the OpenAI-compatible server — just run mlx_lm.server.
Ignoring mlx-swift's possibilities — the iOS / visionOS value lives there.
Evaluating MLX as a "PyTorch replacement" — it is not. MLX is a tool specialized for Apple Silicon. Its value proposition is different.

Possible follow-ups: local LLM serving architecture — Ollama vs LM Studio vs mlx-lm server, embedding LLMs in iOS — building a note-app assistant with mlx-swift, quantization deep dive — measuring the accuracy impact of 4-bit / 8-bit / QLoRA.

"The M-series GPU shares RAM with the CPU. So MLX should be fast. And it is."

— MLX deep dive, end.