Skip to content
Published on

NPU Deep Dive: How Transformer Architecture Runs Directly on Silicon

Authors

Introduction: AI Moves Into Your Pocket

When ChatGPT launched, you sent a request to a rack of H100s in a data center. By 2025, Llama 3.2 3B runs in real-time on your iPhone 16 — generating 30+ tokens per second on the Apple Neural Engine.

The hardware that makes this possible is the NPU (Neural Processing Unit).

An NPU is not simply a small GPU. It embodies a fundamentally different design philosophy — a chip that sacrifices generality to achieve extraordinary efficiency on a narrow class of operations. This post explains what that means in precise technical terms: how every transformer operation maps to silicon, why memory bandwidth matters more than TFLOPS for inference, and what the future of AI hardware looks like.


1. CPU vs GPU vs NPU: The Design Philosophy Triangle

CPU: "I handle complex tasks, one at a time, very fast"
+--------------------------------------------------+
| Core count:  8-128 (big cores)                   |
| Clock:       3-5 GHz (high single-thread perf)   |
| Strengths:   Complex control flow, branch pred   |
|              OS, databases, web servers           |
| Caches:      L1/L2/L3 hierarchy (MB-scale)       |
| Weakness:    Energy-inefficient for parallel math |
+--------------------------------------------------+

GPU: "I do simple things, millions at once"
+--------------------------------------------------+
| Core count:  Thousands to tens of thousands       |
| Clock:       1-3 GHz (low, but massively parallel)|
| Strengths:   Any parallel workload (render, AI)  |
| Memory:      GDDR6/HBM, high bandwidth            |
| Weakness:    300-700W power draw, flexibility tax |
+--------------------------------------------------+

NPU: "I do AI math only — with extreme efficiency"
+--------------------------------------------------+
| Core count:  Few but specialized MAC arrays       |
| Strengths:   Integer matrix multiply (INT8/INT4)  |
|              Quantized inference                  |
| Energy eff:  10-100x better than GPU             |
| Power:       1-10W (smartphone/laptop)            |
| Weakness:    No general compute, no FP32 training |
+--------------------------------------------------+

Why NPU is necessary:
- Smartphone AI: 5000 mAh battery, GPU = 30 min battery life
- NPU: same computation, 1/50th the power
- "Always-on AI": face detection, wake word, photo classification

Let's quantify the energy efficiency advantage:

# Energy efficiency comparison for INT8 inference
perf_data = {
    'NVIDIA H100 SXM':       {'tops': 3958, 'tdp_w': 700},
    'NVIDIA A100 40GB':      {'tops': 1248, 'tdp_w': 400},
    'AMD MI300X':            {'tops': 5220, 'tdp_w': 750},
    'Apple M4 Neural Engine': {'tops': 38,  'tdp_w': 4},   # Neural Engine TDP only
    'Qualcomm Hexagon NPU':  {'tops': 45,   'tdp_w': 5},
    'Intel Meteor Lake NPU': {'tops': 10,   'tdp_w': 8},
}

print(f"{'Hardware':<30} {'TOPS':>8} {'TDP (W)':>8} {'TOPS/W':>8}")
print("-" * 56)
for name, data in perf_data.items():
    efficiency = data['tops'] / data['tdp_w']
    print(f"{name:<30} {data['tops']:>8} {data['tdp_w']:>8} {efficiency:>8.2f}")

# Note: absolute TOPS numbers favor data center GPUs,
# but TOPS/W shows why mobile NPUs dominate on-device AI

2. Apple Neural Engine (ANE): Full Hardware Dissection

Apple's Neural Engine is the most mature consumer NPU, having shipped in every iPhone since 2017.

Apple Neural Engine Evolution:

A11 Bionic (2017): 2-core Neural Engine
  Performance: 0.6 TOPS
  Purpose:     Face ID, Animoji
  First neural engine in a consumer phone

A12 Bionic (2018): 8-core Neural Engine
  Performance: 5 TOPS
  Added:       On-device Siri, real-time image segmentation

A15 Bionic (2021): 16-core Neural Engine
  Performance: 15.8 TOPS
  Added:       On-device translation, Live Text in camera

A17 Pro (2023): 16-core, 3nm process
  Performance: 35 TOPS (INT8)
  Added:       Llama 3.2 3B local inference (~25 tok/s)

M4 (2024): 16-core Neural Engine, same die as A17 Pro
  Performance: 38 TOPS
  Unified Memory: 16GB-32GB shared with GPU + CPU

ANE Internal Architecture (based on public patents)

Apple Neural Engine Die Area (estimated):

+-------------------------------------------------------------+
|                    Neural Engine Block                       |
+-------------------------------------------------------------+
|  Command Processor (schedules work to execution units)      |
|  +--------------------------------------------------------+  |
|  |  16 Execution Units (cores)                           |  |
|  |  [MAC Array] [MAC Array] ... [MAC Array] x 16         |  |
|  |  Each EU: INT8/INT16 matrix multiply + activation fn  |  |
|  |           Layer Norm, Softmax accelerated in hardware  |  |
|  +--------------------------------------------------------+  |
+-------------------------------------------------------------+
|  Dedicated L1 SRAM: ~30 MB                                 |
|  (NOT shared with GPU -- no cache thrashing)               |
+-------------------------------------------------------------+
|  DMA Engine (dedicated hardware for memory moves)          |
+-------------------------------------------------------------+
|  Memory Interface: Unified Memory (CPU/GPU/ANE share)      |
+-------------------------------------------------------------+

Critical constraints:
- Programming: ONLY via CoreML (no bare-metal API access)
- Data types: INT8 and INT16 only (no FP32, can't do training)
- Batch size: limited (large batches become inefficient)
- Op coverage: unsupported ops fall back to GPU/CPU automatically

Using CoreML to Target ANE

# Convert a PyTorch model to CoreML -> runs on Apple Neural Engine
import torch
import coremltools as ct

# Example: small transformer model
class SmallTransformer(torch.nn.Module):
    def __init__(self, d_model=512, n_heads=8, n_layers=6):
        super().__init__()
        self.layers = torch.nn.ModuleList([
            torch.nn.TransformerEncoderLayer(d_model, n_heads, batch_first=True)
            for _ in range(n_layers)
        ])
        self.output = torch.nn.Linear(d_model, 32000)

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return self.output(x)

model = SmallTransformer()
model.eval()

# Trace the model
example_input = torch.zeros(1, 128, 512)
traced = torch.jit.trace(model, example_input)

# Convert to CoreML with quantization
mlmodel = ct.convert(
    traced,
    inputs=[ct.TensorType(name='input',
                          shape=(1, ct.RangeDim(1, 512), 512))],
    compute_precision=ct.precision.FLOAT16,
    compute_units=ct.ComputeUnit.ALL  # ANE + GPU + CPU, auto-routed
)

# Post-training INT8 quantization for ANE
from coremltools.optimize.coreml import PostTrainingQuantizer, OptimizationConfig
from coremltools.optimize.coreml import OpLinearQuantizerConfig

config = OptimizationConfig(
    global_config=OpLinearQuantizerConfig(
        mode='linear_symmetric',
        dtype='int8',
        granularity='per_channel'  # per-channel scales = better accuracy
    )
)
quantizer = PostTrainingQuantizer(mlmodel, config)
quantized_model = quantizer.compress()
quantized_model.save('transformer_int8.mlpackage')

# Runtime inference (on-device)
import coremltools as ct
model = ct.models.MLModel('transformer_int8.mlpackage')
predictions = model.predict({'input': input_array})
# CoreML automatically routes to ANE if supported, else GPU/CPU

3. Mapping Every Transformer Operation to Hardware

Let's trace exactly which hardware unit handles each step of a transformer forward pass.

Transformer Forward Pass -> Hardware Mapping:
+----------------------------------------------------------------+
| Input Token IDs [batch, seq_len]                               |
|  -> Embedding Lookup (Gather op)                               |
|  Hardware: NPU SRAM cache for embedding table                  |
|  Cost: near-zero (table lookup, no compute)                    |
+----------------------------------------------------------------+
| Layer Normalization                                            |
|  -> mean -> variance -> normalize -> scale -> bias             |
|  Hardware: NPU vector ALU (fused into single kernel by XLA)   |
|  Compute share: ~1-2%                                          |
+----------------------------------------------------------------+
| Q, K, V Projection (3x Linear Layers)                        |
|  [batch, seq, d_model] x [d_model, d_head] = GEMM            |
|  Hardware: Systolic Array / MAC array (~40% of power draw)    |
|  Compute share: ~38% (dominant operation!)                     |
+----------------------------------------------------------------+
| Attention Score: Q x K^T / sqrt(d_head)                       |
|  [b, heads, seq, d_h] x [b, heads, d_h, seq]                  |
|  Hardware: GPU Tensor Core / NPU MAC (O(n^2 x d) complexity)  |
|  Compute share: ~12% (scales quadratically with seq length!)  |
+----------------------------------------------------------------+
| Softmax: exp -> sum -> divide                                  |
|  Hardware: NPU vector unit (exp needs special function hw)    |
|  Compute share: ~1% (but memory-intensive, not fusable easily) |
+----------------------------------------------------------------+
| Attention x V (GEMM)                                          |
|  [b, heads, seq, seq] x [b, heads, seq, d_h]                  |
|  Hardware: MAC array                                           |
|  Compute share: ~12%                                           |
+----------------------------------------------------------------+
| Output Projection + FFN Layer 1 + FFN Layer 2 (3x GEMM)      |
|  Hardware: MAC array (largest matrix: d_model x 4*d_model)   |
|  Compute share: ~37%                                           |
+----------------------------------------------------------------+

Summary:
~87% of ops are GEMM -> handled by MAC arrays / Systolic Arrays
~13% are vector ops (LayerNorm, Softmax, GELU) -> NPU vector units

Flash Attention: Solving the Memory Problem

Standard attention requires storing an n×n score matrix in memory. For long sequences, this is catastrophic:

def analyze_attention_memory(seq_len, d_model, n_heads, batch_size=1):
    d_head = d_model // n_heads
    dtype_bytes = 2  # FP16

    # Standard attention: O(n^2) memory
    # Stores full [batch, heads, seq, seq] attention score matrix
    attn_score_bytes = batch_size * n_heads * seq_len * seq_len * dtype_bytes
    attn_score_gb = attn_score_bytes / (1024**3)

    # Flash attention: O(n) memory
    # Processes in tiles, never materializes full score matrix
    # Only extra memory: tiling buffers (~d_head per head)
    flash_extra_bytes = batch_size * n_heads * seq_len * d_head * dtype_bytes
    flash_extra_gb = flash_extra_bytes / (1024**3)

    print(f"Sequence length: {seq_len}")
    print(f"Standard attention score matrix: {attn_score_gb:.2f} GB")
    print(f"Flash attention extra memory:    {flash_extra_gb:.4f} GB")
    print(f"Memory reduction:                {attn_score_gb/flash_extra_gb:.0f}x")

# GPT-4-scale (estimated): seq=8192, d=12288, heads=96
analyze_attention_memory(8192, 12288, 96)
# Standard: ~49.2 GB for ONE layer!
# Flash Attention: ~0.75 GB
# 65x memory reduction

# This is why Flash Attention is essential for NPUs:
# NPU SRAM is typically 30-100 MB
# Standard attention for seq=8192 would need 49 GB -- impossible
# Flash Attention's tiling fits in SRAM -> long sequences on NPU!

4. Why LLM Inference is Memory-Bound: The Roofline Model

This is THE most important concept for understanding LLM inference performance. Most engineers intuitively think "more TFLOPS = faster LLM." This is wrong.

The Roofline Analysis

# LLM inference bottleneck analysis using the Roofline Model

model_config = {
    'name': 'Llama 2 7B',
    'num_params': 7_000_000_000,
    'bytes_per_param': 2,   # FP16
}

model_size_bytes = model_config['num_params'] * model_config['bytes_per_param']
model_size_gb = model_size_bytes / (1024**3)
print(f"Model size: {model_size_gb:.1f} GB")  # 14.0 GB

hardware = {
    'H100 SXM': {
        'memory_bw_gbs': 3350,   # GB/s
        'compute_tflops': 1979,  # FP16 TFLOPS
    },
    'A100 80GB': {
        'memory_bw_gbs': 2000,
        'compute_tflops': 312,
    },
    'RTX 4090': {
        'memory_bw_gbs': 1008,
        'compute_tflops': 82.6,
    },
    'Apple M3 Max GPU': {
        'memory_bw_gbs': 300,    # unified memory
        'compute_tflops': 14.2,
    }
}

print(f"\n{'Hardware':<20} {'Mem (ms)':>10} {'Compute (ms)':>13} {'Bottleneck':>15} {'Est tok/s':>10}")
print("-" * 72)

for hw_name, hw in hardware.items():
    # Per token: must read ALL weights from memory
    # (batch_size=1: no weight reuse across tokens)
    mem_time_ms = (model_size_bytes / (hw['memory_bw_gbs'] * 1e9)) * 1000

    # Compute time: 2 * num_params FLOPs / available FLOPS
    flops_per_token = 2 * model_config['num_params']
    compute_time_ms = (flops_per_token / (hw['compute_tflops'] * 1e12)) * 1000

    bottleneck_time_ms = max(mem_time_ms, compute_time_ms)
    tok_per_sec = 1000.0 / bottleneck_time_ms

    # Arithmetic Intensity (AI): FLOPs per byte
    ai_actual = flops_per_token / model_size_bytes
    # Hardware ridge point: compute/bandwidth ratio
    ai_ridge = (hw['compute_tflops'] * 1e12) / (hw['memory_bw_gbs'] * 1e9)

    bottleneck = "Memory-bound" if ai_actual < ai_ridge else "Compute-bound"
    print(f"{hw_name:<20} {mem_time_ms:>10.2f} {compute_time_ms:>13.4f} "
          f"{bottleneck:>15} {tok_per_sec:>10.0f}")

# Expected output (batch_size=1):
# H100 SXM:     4.18 ms memory,  0.007 ms compute -> Memory-bound -> ~239 tok/s
# A100 80GB:    7.00 ms memory,  0.045 ms compute -> Memory-bound -> ~143 tok/s
# RTX 4090:    13.89 ms memory,  0.170 ms compute -> Memory-bound ->  ~72 tok/s
# Apple M3 Max:46.67 ms memory,  0.985 ms compute -> Memory-bound ->  ~21 tok/s
# Conclusion: ALL hardware is memory-bound at batch_size=1!

What Memory-Bound Means in Practice

Memory-bound inference has counter-intuitive implications:

1. Doubling TFLOPS -> NO speedup
   (Memory bandwidth is the bottleneck, not compute)

2. Doubling memory bandwidth -> exactly 2x speedup
   This is why H100 (3.35 TB/s) beats A100 (2.0 TB/s) by ~1.6x for inference

3. Halving model size (quantization) -> exactly 2x speedup
   FP16 -> INT8: model size halved, 2x more tokens/sec
   INT8 -> INT4: model size halved again, 2x more tokens/sec

4. Increasing batch size -> transition to compute-bound
   At batch_size=1: weight is read once, used for 1 token (wasteful)
   At batch_size=64: weight is read once, used for 64 tokens (efficient)
   Large batches => weight reuse => compute-bound => TFLOPS matters

5. Why Apple Silicon is competitive:
   M3 Ultra: 800 GB/s unified memory (vs H100's 3.35 TB/s)
   BUT: 192 GB capacity fits large models without quantization
   M3 Max (128 GB): fits Llama 3.1 70B FP16 locally (H100 needs 2x for that)

6. Why AMD MI300X beats H100 for inference:
   MI300X: 5.3 TB/s memory bandwidth (vs H100's 3.35 TB/s)
   At batch_size=1: MI300X is ~1.6x faster, despite fewer TFLOPS

5. KV Cache: The Memory Cost of Long Contexts

Without KV Cache, generating a 1000-token response would require recomputing all attention weights from scratch for every new token — O(n²) complexity per output token.

# KV Cache memory analysis

def compute_kv_cache_size(model_name, context_len, n_layers,
                          n_kv_heads, head_dim, batch_size=1,
                          dtype_bytes=2):
    """
    KV Cache stores Key and Value tensors for all previous tokens
    Shape: [context_len, n_layers, 2(K+V), n_kv_heads, head_dim]
    """
    kv_bytes = (context_len * n_layers * 2 *
                n_kv_heads * head_dim * batch_size * dtype_bytes)
    kv_gb = kv_bytes / (1024**3)

    # Per-token memory access (reading the cache for one new token)
    per_token_bytes = n_layers * 2 * n_kv_heads * head_dim * dtype_bytes
    per_token_kb = per_token_bytes / 1024

    print(f"\n=== {model_name} ===")
    print(f"  Context length:    {context_len:>8,} tokens")
    print(f"  KV Cache size:     {kv_gb:>8.2f} GB")
    print(f"  Per-token KV read: {per_token_kb:>8.1f} KB")

# Standard models (Multi-Head Attention, n_kv_heads = n_heads)
compute_kv_cache_size("Llama 2 7B",
                      context_len=4096, n_layers=32,
                      n_kv_heads=32, head_dim=128)
# KV Cache: 2.0 GB for 4K context

compute_kv_cache_size("Llama 2 70B",
                      context_len=4096, n_layers=80,
                      n_kv_heads=64, head_dim=128)
# KV Cache: 10.0 GB for 4K context

# Models with Grouped Query Attention (GQA) -- fewer KV heads
compute_kv_cache_size("Llama 3.1 8B (GQA 8 heads)",
                      context_len=128_000, n_layers=32,
                      n_kv_heads=8, head_dim=128)
# KV Cache: 16.0 GB for 128K context (vs 64 GB without GQA!)

compute_kv_cache_size("Llama 3.1 70B (GQA 8 heads)",
                      context_len=128_000, n_layers=80,
                      n_kv_heads=8, head_dim=128)
# KV Cache: 40.0 GB for 128K context

GQA and MQA: Reducing KV Cache

# Grouped Query Attention (GQA): fewer KV heads, same Q heads
# Used in Llama 3, Mistral, Gemma, Falcon

def compare_attention_variants(seq_len, n_layers, d_model,
                                n_q_heads, n_kv_heads_gqa,
                                dtype_bytes=2):
    head_dim = d_model // n_q_heads

    # MHA: n_kv = n_q (standard multi-head attention)
    mha_kv_gb = (seq_len * n_layers * 2 * n_q_heads *
                 head_dim * dtype_bytes) / (1024**3)

    # GQA: n_kv < n_q (grouped -- multiple Q heads share one KV)
    gqa_kv_gb = (seq_len * n_layers * 2 * n_kv_heads_gqa *
                 head_dim * dtype_bytes) / (1024**3)

    # MQA: n_kv = 1 (extreme -- all Q heads share single KV)
    mqa_kv_gb = (seq_len * n_layers * 2 * 1 *
                 head_dim * dtype_bytes) / (1024**3)

    reduction_gqa = (1 - gqa_kv_gb/mha_kv_gb) * 100
    reduction_mqa = (1 - mqa_kv_gb/mha_kv_gb) * 100

    print(f"Seq: {seq_len}, Q heads: {n_q_heads}, GQA KV heads: {n_kv_heads_gqa}")
    print(f"  MHA KV cache: {mha_kv_gb:.2f} GB (baseline)")
    print(f"  GQA KV cache: {gqa_kv_gb:.2f} GB ({reduction_gqa:.0f}% reduction)")
    print(f"  MQA KV cache: {mqa_kv_gb:.2f} GB ({reduction_mqa:.0f}% reduction)")

# Llama 3.1 8B: 32 Q heads, 8 KV heads (GQA)
compare_attention_variants(
    seq_len=128_000, n_layers=32, d_model=4096,
    n_q_heads=32, n_kv_heads_gqa=8
)
# GQA: 75% KV cache reduction!
# NPU SRAM fits much more context with GQA

6. Quantization: How It Supercharges NPU Performance

Quantization is the single most impactful optimization for on-device LLM inference.

Number formats and hardware support:

FP32: [1 sign][8 exp][23 mantissa] = 32 bits
      Standard for training, all hardware supports
      7B model: 28 GB

FP16: [1 sign][5 exp][10 mantissa] = 16 bits
      Inference standard, GPU/NPU support
      7B model: 14 GB

INT8: [1 sign][7 magnitude]         = 8 bits   <- NPU default
      All NPUs support, 4x SIMD throughput vs INT32
      7B model: 7 GB
      Accuracy loss: typically < 0.5%

INT4: 4 bits                         <- Modern NPUs (A17 Pro, Hexagon)
      2x throughput vs INT8
      7B model: 3.5 GB
      Accuracy loss: 1-3% (with GPTQ calibration)

INT8 SIMD advantage on NPU:
- 32-bit register holds 4x INT8 values -> 4x throughput
- In practice: INT8 GEMM is 4-8x faster than FP32 GEMM
- Memory bandwidth savings scale perfectly: INT8 reads 4x less

Post-Training Quantization: Production Techniques

# Technique 1: LLM.int8() -- handles outlier activations
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

config_int8 = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,    # activations above this use FP16
    llm_int8_has_fp16_weight=False
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=config_int8,
    device_map='auto',
    torch_dtype=torch.float16
)
# FP16: 16 GB -> INT8: ~8.5 GB, accuracy loss ~0.3%

# Technique 2: GPTQ -- gradient-optimized post-training quantization
# Much better accuracy than naive INT4 rounding
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,                    # 4-bit quantization
    group_size=128,            # group size (128 = good accuracy/speed tradeoff)
    damp_percent=0.1,          # GPTQ damping factor
    desc_act=True,             # activation order (better accuracy)
)

# Requires calibration data for GPTQ
from datasets import load_dataset
dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')
calibration_data = [
    dataset[i]['text'] for i in range(128)
    if len(dataset[i]['text']) > 50
]

model = AutoGPTQForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantize_config
)
model.quantize(calibration_data)
model.save_quantized("llama3-8b-gptq-4bit")
# FP16: 16 GB -> GPTQ INT4: ~4.5 GB, accuracy loss ~1.2%

# Technique 3: AWQ (Activation-aware Weight Quantization)
# Finds the most important weights and protects them
from awq import AutoAWQForCausalLM

model_awq = AutoAWQForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    safetensors=True
)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}
model_awq.quantize(tokenizer, quant_config=quant_config)
model_awq.save_quantized("llama3-8b-awq-4bit")
# AWQ typically achieves ~0.5% better accuracy than GPTQ at same bit-width

Why Per-Channel Quantization Matters

# Per-tensor vs per-channel quantization accuracy
import numpy as np

def simulate_quantization_error(weights, bits=8, per_channel=True):
    """
    Per-channel: each output channel has its own scale
    Per-tensor: single scale for entire weight matrix
    """
    if per_channel:
        # Scale per output channel (row)
        max_vals = np.abs(weights).max(axis=1, keepdims=True)
        scales = max_vals / (2**(bits-1) - 1)
    else:
        # Single scale for entire tensor
        max_val = np.abs(weights).max()
        scales = max_val / (2**(bits-1) - 1)

    # Quantize
    weights_q = np.round(weights / scales).clip(-(2**(bits-1)), 2**(bits-1)-1)
    # Dequantize
    weights_dq = weights_q * scales

    error = np.mean((weights - weights_dq)**2)
    return error

# Simulate with realistic weight distribution
np.random.seed(42)
# Linear layer weights: approximately Gaussian but with different scales per channel
weights = np.random.randn(256, 4096) * np.random.exponential(1.0, (256, 1))

err_per_tensor = simulate_quantization_error(weights, per_channel=False)
err_per_channel = simulate_quantization_error(weights, per_channel=True)

print(f"Per-tensor INT8 MSE:  {err_per_tensor:.6f}")
print(f"Per-channel INT8 MSE: {err_per_channel:.6f}")
print(f"Improvement: {err_per_tensor/err_per_channel:.1f}x better accuracy")
# Typical: 5-50x better accuracy with per-channel quantization
# Cost: one scale value per output channel (negligible overhead)

7. Qualcomm Hexagon NPU and Intel NPU

Qualcomm Snapdragon X Elite: Hexagon NPU

Qualcomm Hexagon NPU (Snapdragon X Elite, 2024):

Performance: 45 TOPS (INT8)
Architecture:
  HTA (Hexagon Tensor Accelerator):
    - Primary GEMM accelerator
    - Supports INT4, INT8, FP16
    - On-chip SRAM: ~4 MB
  HMNN (Hexagon Multi-Network Node):
    - Run multiple AI networks simultaneously
    - Real-time + background AI concurrently
  Vector DSP + Scalar DSP:
    - Handle activation functions, softmax, etc.

Supported LLM Inference (on-device):
  Llama 3.2 3B INT4:    ~30 tok/s
  Phi-3.5 mini 3.8B:    ~25 tok/s
  Gemma 2 2B INT4:      ~35 tok/s

Programming:
  - Qualcomm AI Engine Direct SDK (low-level)
  - ONNX Runtime with QNN backend
  - llama.cpp with Hexagon backend

Windows Copilot Plus PC AI features (all run on NPU):
  - Live Captions: real-time speech transcription
  - Cocreator: AI image generation
  - Smart Snapshots: AI scene understanding
  All of these: battery-efficient because NPU handles it

Intel Meteor Lake NPU

Intel Neural Processing Unit (Core Ultra / Meteor Lake, 2023):

Performance: 10-11 TOPS (INT8)
Architecture:
  - NN Compute Engine: MAC array
  - Slice architecture: independent compute tiles
  - Dedicated memory controller

Strengths:
  - Always-on AI (sips power in background)
  - Windows Studio Effects (uses NPU when screen open)
  - Real-time noise suppression, eye contact correction

OpenVINO integration:
from openvino.runtime import Core, CompiledModel

ie = Core()
# List available devices
print(ie.available_devices)  # ['CPU', 'GPU', 'NPU']

# Compile model specifically for NPU
compiled_model = ie.compile_model(
    model=onnx_model_path,
    device_name="NPU",
    config={"PERFORMANCE_HINT": "THROUGHPUT"}
)

# Run inference
output = compiled_model({compiled_model.input(): input_data})

Use cases:
  - Windows Studio Effects (background blur, eye gaze correction)
  - Voice recognition pre-processing (wake word detection)
  - Real-time translation with small models
  - Image enhancement (computational photography)
  - NOT suitable: large LLMs (10 TOPS is insufficient)

8. Device-by-Device LLM Capability Matrix

As of 2025: practical on-device LLM inference capabilities

iPhone 16 (8GB RAM, A18 Pro, ~35 TOPS ANE):
  Feasible:   Llama 3.2 3B INT4 (~2.0 GB, ~25 tok/s)
  Feasible:   Phi-3.5 mini 3.8B INT4 (~2.3 GB, ~20 tok/s)
  Feasible:   Llama 3.1 8B INT4 (~5.0 GB, ~12 tok/s)
  Infeasible: Llama 3.1 8B FP16 (16 GB required, RAM exhausted)
  Infeasible: 70B models (even INT4 needs 35+ GB)

MacBook Air M3 16GB:
  Feasible:   Llama 3.1 8B Q4 (~5 GB, ~40 tok/s)
  Feasible:   Mistral 7B Q4 (~4.5 GB, ~45 tok/s)
  Borderline: Llama 3.1 8B Q8 (~9 GB, ~22 tok/s)
  Infeasible: Llama 3.1 70B (even Q4 needs 40 GB)

MacBook Pro M3 Max 128GB (400 GB/s):
  Feasible:   Llama 3.1 70B Q4 (~40 GB, ~22 tok/s)
  Feasible:   Llama 3.1 70B Q8 (~75 GB, ~11 tok/s)
  Feasible:   Llama 3.1 405B Q2_K (~100 GB, ~4 tok/s, low quality)
  Borderline: 405B Q4 needs ~220 GB

Mac Studio M3 Ultra 192GB (800 GB/s):
  Feasible:   Llama 3.1 70B FP16 (~140 GB, ~35 tok/s)
  Feasible:   Llama 3.1 405B Q4 (~230 GB -- requires 192GB tier)
  This is a legitimate GPT-4-class local inference machine

Snapdragon X Elite PC (32GB):
  Feasible:   Llama 3.2 3B Q4 (~30 tok/s on NPU)
  Feasible:   Phi-3.5 mini (~25 tok/s on NPU)
  Feasible:   Llama 3.1 8B Q4 (~15 tok/s on GPU)
  Infeasible: 70B models (RAM exhausted)

9. The LLM Chip Wars: Beyond GPU

General-purpose GPUs are facing serious competition from dedicated inference accelerators.

Groq LPU: Deterministic Dataflow

Groq Language Processing Unit (LPU):

Core innovation: deterministic dataflow architecture
- At compile time: EVERY memory access, EVERY operation is statically scheduled
- At runtime: zero scheduling overhead, zero cache misses (by design)
- Result: perfectly predictable, maximally pipelined execution

Why this wins for LLM inference:
- LLM inference has STATIC computation graph
- Same model = same sequence of operations every time
- Compiler can optimize perfectly because nothing is unknown

Real numbers:
- Llama 2 70B: ~500 tok/s on Groq (vs ~120 tok/s on H100)
- Reason: LPU never stalls waiting for memory
- 14 TOPS/W efficiency vs H100's 2.8 TOPS/W

Limitations:
- Only runs specific pre-compiled model architectures
- Recompile required for any model change
- Very limited flexibility outside inference

Cerebras WSE-3: The Wafer-Scale Monster

Cerebras Wafer Scale Engine 3 (2023):

Size:        46,225 mm^2 (entire silicon wafer)
AI cores:    900,000 cores
On-chip SRAM: 900 MB (!) at extremely high bandwidth
Die:          7nm TSMC

The key insight:
- Enough SRAM to hold an entire model's working memory
- NO HBM needed for many workloads
- Eliminates the memory bandwidth bottleneck completely

For a 7B INT8 model:
- Model weights: 7 GB (too large for 900 MB SRAM)
- Solution: used with Cerebras Memory Expansion (CM-X)
  or for smaller models that fit entirely on-die

Performance:
- 1.5x - 3x faster than H100 for transformer training
- For inference of models fitting on SRAM: latency near-zero
- Use case: organizations needing ultra-fast LLM research iterations

SambaNova Reconfigurable Dataflow Architecture

SambaNova RDA (Reconfigurable Dataflow Architecture):

Concept: FPGA-like programmability + ASIC performance
- Dataflow graph is mapped to physical silicon connections
- Reconfigure the chip's connectivity pattern per model
- Avoids von Neumann bottleneck for known dataflows

Advantages:
- No control overhead (connections ARE the computation)
- Excellent for batch inference (government, research)
- Can specialize hardware routing per customer model

Customers: US national labs, government agencies

Why general-purpose hardware will eventually lose:
- GPU scheduler, register file, cache hierarchy = overhead
- For known, static computation: these overheads are pure waste
- Specialized chips eliminate this waste entirely

10. Practical: Running LLM Inference On-Device

With llama.cpp (Cross-Platform)

# Build llama.cpp with Metal (Apple Silicon) support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with Metal GPU acceleration
cmake -B build -DLLAMA_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Download quantized model (GGUF format)
# Llama 3.1 8B at Q4_K_M: ~5.0 GB
pip install huggingface-hub
huggingface-cli download \
  bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --local-dir ./models

# Run inference (Metal-accelerated on Apple Silicon)
./build/bin/llama-cli \
  -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -n 512 \
  --n-gpu-layers 35 \
  --ctx-size 4096 \
  -p "Explain the roofline model for LLM inference"

# Expected results (Apple M3 Max 16-core):
# Load time:   ~3 seconds
# Throughput:  ~45 tok/s (with GPU offload)
# Memory:      ~5.5 GB
# Python binding for llama.cpp
from llama_cpp import Llama
import time

# Load model
llm = Llama(
    model_path="./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
    n_gpu_layers=35,     # Layers to offload to GPU/NPU
    n_ctx=4096,          # Context window
    n_threads=8,         # CPU threads for non-GPU layers
    flash_attn=True,     # Use Flash Attention (memory efficient)
    verbose=False
)

# Benchmark inference
def benchmark(llm, prompt, n_tokens=200):
    start = time.perf_counter()
    output = llm(prompt, max_tokens=n_tokens, echo=False)
    elapsed = time.perf_counter() - start
    n_out = output['usage']['completion_tokens']
    return n_out, elapsed, n_out / elapsed

n_tok, t, speed = benchmark(
    llm,
    "Explain what makes NPUs more efficient than GPUs for LLM inference:",
    n_tokens=200
)
print(f"Generated {n_tok} tokens in {t:.2f}s = {speed:.1f} tok/s")

# Measure memory usage
import subprocess
result = subprocess.run(['llama-bench', '-m', 'model.gguf', '-p', '512', '-n', '128'],
                      capture_output=True, text=True)
print(result.stdout)

Building a Simple Inference Server

# FastAPI server with local LLM (production-ready pattern)
from fastapi import FastAPI
from pydantic import BaseModel
from llama_cpp import Llama
import asyncio
from concurrent.futures import ThreadPoolExecutor

app = FastAPI()
executor = ThreadPoolExecutor(max_workers=1)  # LLM is not thread-safe

# Load model at startup
llm = Llama(
    model_path="./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
    n_gpu_layers=35,
    n_ctx=4096,
    flash_attn=True,
    verbose=False
)

class InferenceRequest(BaseModel):
    prompt: str
    max_tokens: int = 256
    temperature: float = 0.7

@app.post("/generate")
async def generate(request: InferenceRequest):
    loop = asyncio.get_event_loop()

    def run_inference():
        return llm.create_chat_completion(
            messages=[{"role": "user", "content": request.prompt}],
            max_tokens=request.max_tokens,
            temperature=request.temperature
        )

    # Run in thread pool to avoid blocking the event loop
    result = await loop.run_in_executor(executor, run_inference)
    return {
        "response": result['choices'][0]['message']['content'],
        "tokens_generated": result['usage']['completion_tokens'],
    }

# Run: uvicorn server:app --host 0.0.0.0 --port 8080

Conclusion

NPUs represent a hardware revolution that's bringing AI from the cloud into every pocket and laptop.

The key lessons for any LLM infrastructure engineer:

  1. Power is the constraint: Running AI on a smartphone requires 1/50th of GPU power -> NPU is the only answer
  2. LLM inference is memory-bound, always: At batch_size=1, memory bandwidth determines performance — not TFLOPS. This single insight changes every hardware decision.
  3. Quantization is the unlock: INT4 vs FP16 = 4x less memory = 4x more tokens/sec. Good calibration is the difference between "usable" and "broken."
  4. KV Cache is the memory budget: For long-context models, KV Cache often exceeds model weights in memory. GQA reduces this by 4-8x.
  5. Specialized chips will win: Groq, Cerebras, Etched — purpose-built inference hardware is already beating GPUs on efficiency. The question is just when it becomes cost-competitive at scale.

The hardware and software co-evolution happening right now is the most exciting frontier in systems engineering. Understanding TPUs and NPUs isn't just curiosity — it's the foundational knowledge of the AI infrastructure engineer's craft.


References

  • Apple Neural Engine Patent Filings (US Patent Office, 2017-2024)
  • Qualcomm AI Engine Direct SDK Documentation
  • "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (Dao, 2023)
  • "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers" (Frantar et al., 2022)
  • "AWQ: Activation-aware Weight Quantization for LLM Compression" (Lin et al., 2023)
  • "Roofline: An Insightful Visual Performance Model" (Williams et al., 2009)
  • llama.cpp: github.com/ggerganov/llama.cpp
  • Groq LPU Technical Whitepaper: groq.com
  • Cerebras WSE-3 Architecture: cerebras.net