Skip to content

Split View: NPU 완전 해부: 트랜스포머 아키텍처가 실리콘 위에서 어떻게 달리는가

|

NPU 완전 해부: 트랜스포머 아키텍처가 실리콘 위에서 어떻게 달리는가

들어가며: AI가 우리 주머니 속으로

ChatGPT가 처음 나왔을 때, 우리는 클라우드 서버에 있는 수십 개의 GPU에 요청을 보내야 했습니다. 하지만 2025년 현재, 여러분의 스마트폰에서 Llama 3.2 3B 모델이 실시간으로 돌아갑니다. iPhone 16의 Neural Engine이 초당 30개 이상의 토큰을 생성합니다.

이걸 가능하게 한 것이 NPU (Neural Processing Unit) 입니다.

NPU는 단순히 작은 GPU가 아닙니다. 근본적으로 다른 설계 철학을 가진 특화 가속기입니다. 이 글에서는 NPU가 무엇인지, 트랜스포머의 각 연산이 실리콘 위에서 어떻게 돌아가는지, 그리고 왜 "메모리 대역폭이 TFLOPS보다 중요한가"를 완전히 이해하게 됩니다.


1. CPU vs GPU vs NPU: 설계 철학의 삼각형

CPU: 복잡한 일을 빠르게 처리하는 제너럴리스트
┌─────────────────────────────────────────────────────┐
│ 코어 수:    8-128 (빅 코어)│ 클럭:      3-5 GHz (높은 단일 스레드 성능)│ 특기:      복잡한 제어 흐름, 분기 예측              │
│            운영체제, 데이터베이스, 웹 서버           │
│ 캐시:      L1/L2/L3 대형 캐시 계층 (MB 단위)│ 약점:      병렬 연산 시 에너지 비효율               │
└─────────────────────────────────────────────────────┘

GPU: 단순한 일을 엄청나게 많이 동시에
┌─────────────────────────────────────────────────────┐
│ 코어 수:    수천~ (작은 CUDA 코어)│ 클럭:      1-3 GHz (낮지만 대량 병렬)│ 특기:      어떤 병렬 연산이든 (렌더링, AI, 물리)│ 메모리:    GDDR6/HBM, 높은 대역폭                   │
│ 약점:      전력 소비 (300-700W), 범용성의 비용       │
└─────────────────────────────────────────────────────┘

NPU/Neural Engine: AI 연산만, 그것도 극한 효율로
┌─────────────────────────────────────────────────────┐
│ 코어 수:    소수의 특화된 MAC 배열                   │
│ 특기:      정수 행렬 곱셈 (INT8/INT4), 양자화 추론  │
│ 에너지 효율: GPU10-100배                         │
│ 전력:      1-10W (스마트폰/노트북)│ 약점:      범용 연산 불가, FP32 학습 지원 안 함      │
└─────────────────────────────────────────────────────┘

NPU가 필요한가?
- 스마트폰 AI: 배터리 5000mAh → GPULLM 돌리면 30분도 못 버팀
- NPU: 같은 연산, 1/50의 전력
- "항상 켜져 있는 AI": 사진 분류, 음성 감지 등 상시 실행

실제 에너지 효율 비교:

# 추론 에너지 효율 계산 (가상 7B 모델, INT8, 배치=1)
perf_data = {
    'NVIDIA H100 SXM':     {'tops': 3958, 'tdp_w': 700},
    'NVIDIA A100 40GB':    {'tops': 1248, 'tdp_w': 400},
    'Apple M4 Neural Eng': {'tops': 38,   'tdp_w': 4},    # Neural Engine만
    'Qualcomm Hexagon NPU':{'tops': 45,   'tdp_w': 5},
    'Intel Meteor Lake NPU':{'tops': 10,  'tdp_w': 8},
}

for name, data in perf_data.items():
    efficiency = data['tops'] / data['tdp_w']
    print(f"{name:<28} {efficiency:>8.1f} TOPS/W")

# NVIDIA H100 SXM:           5.7 TOPS/W
# Apple M4 Neural Engine:    9.5 TOPS/W (NPU만, 배터리 미포함 TDP 기준)
# Qualcomm Hexagon:          9.0 TOPS/W
# Intel Meteor Lake NPU:     1.25 TOPS/W
# (모바일 NPU는 절대 성능은 낮지만 와트당 효율 경쟁력 있음)

2. Apple Neural Engine (ANE) 완전 해부

Apple의 Neural Engine은 가장 잘 알려진 NPU 중 하나입니다. 2017년 A11 Bionic에 처음 도입되어 꾸준히 발전했습니다.

Apple Neural Engine 세대별 발전:

A11 Bionic (2017): Neural Engine 1세대
- 코어: 2 (추론 전용)
- 성능: 0.6 TOPS
- 용도: Face ID, Animoji

A12 Bionic (2018): 8코어로 확장
- 성능: 5 TOPS
- Siri 음성 인식 온디바이스 처리

A15 Bionic (2021): 16코어
- 성능: 15.8 TOPS
- 온디바이스 번역, 사진 분류 실시간

A17 Pro (2023): 16코어, 3nm
- 성능: 35 TOPS (INT8 기준)
- Llama 3.2 3B 온디바이스 실행 가능

M4 (2024): 16코어 Neural Engine
- 성능: 38 TOPS
- Context length 8K까지 로컬 처리

ANE 하드웨어 구조

Apple Neural Engine 내부 구조 (추정, 공개 특허 기반):

┌─────────────────────────────────────────────────────────────┐
Neural Engine Die Area├─────────────────────────────────────────────────────────────┤
Command Queue│  ┌──────────────────────────────────────────────────────┐  │
│  │         16Execution Units (코어)                  │  │
│  │  [MAC Array][MAC Array] ... [MAC Array] × 16         │  │
│  │  각 EU: 행렬 곱셈 + 활성화 함수 + Layer Norm 지원    │  │
│  └──────────────────────────────────────────────────────┘  │
├─────────────────────────────────────────────────────────────┤
Dedicated SRAM (L1 캐시): ~30 MB  (GPU와 공유 안 함 → 캐시 오염 없음)├─────────────────────────────────────────────────────────────┤
DMA Engine (메모리 이동 전용 하드웨어)Memory: Unified Memory (CPU/GPU/ANE 공유 물리 메모리)└─────────────────────────────────────────────────────────────┘

중요한 제약:
- CoreML 통해서만 프로그래밍 가능 (직접 접근 API 없음)
- INT8, INT16지원 (FP32 학습 불가)
- 배치 크기 제한 있음 (대형 배치 비효율적)
- 특정 연산자 지원 안 할 경우 CPU로 폴백

CoreML로 ANE 활용하기

# PyTorch 모델을 CoreML로 변환 → Apple Neural Engine에서 실행
import torch
import coremltools as ct

# 예시: 작은 트랜스포머 모델
model = MyTransformerModel()
model.eval()

# 더미 입력으로 tracing
example_input = torch.randint(0, 32000, (1, 512))
traced_model = torch.jit.trace(model, example_input)

# CoreML 변환 (INT8 양자화 포함)
mlmodel = ct.convert(
    traced_model,
    inputs=[ct.TensorType(name='input_ids', shape=(1, 512))],
    compute_precision=ct.precision.FLOAT16,  # FP16으로 변환
    compute_units=ct.ComputeUnit.ALL  # ANE + GPU + CPU 자동 선택
)

# 모델 저장
mlmodel.save('my_transformer.mlpackage')

# 실행 시: CoreML이 자동으로 ANE/GPU/CPU 최적 배분
# ANE에서 실행되면: 초저전력 + 빠른 추론
# ANE 지원 안 되면: GPU 자동 폴백

3. 트랜스포머의 모든 연산을 하드웨어로 매핑하기

트랜스포머 forward pass의 각 단계가 어떤 하드웨어에서 어떻게 실행되는지 완전히 추적해봅시다.

Transformer Forward Pass 하드웨어 매핑:
┌──────────────────────────────────────────────────────────────┐
Input Token IDs [batch, seq_len]│  ↓ 임베딩 룩업 (Gather 연산)│  하드웨어: NPU SRAM에 임베딩 테이블 캐시 → 거의 무료         │
├──────────────────────────────────────────────────────────────┤
Layer Norm [batch, seq, d_model][batch, seq, d_model]│  ↓ MeanVarianceNormalizeScaleBias│  하드웨어: NPU 벡터 연산 유닛 (XLA가 단일 커널로 융합)│  연산 비중: ~1-2% (매우 작음)├──────────────────────────────────────────────────────────────┤
Q, K, V 투영 (Linear Layer × 3)[batch, seq, d_model] × [d_model, d_head] = GEMM│  하드웨어: Systolic Array / MAC 배열 (전력의 40%+)│  연산 비중: ~38% (지배적!)├──────────────────────────────────────────────────────────────┤
Attention Score: Q × K^T / sqrt(d)[batch, heads, seq, d_h] × [batch, heads, d_h, seq]│  하드웨어: GPU Tensor Core / NPU MAC (O(n² × d) 복잡도)│  연산 비중: ~12% (시퀀스 길이에 제곱 비례!)├──────────────────────────────────────────────────────────────┤
Softmax: exp → sum → divide                                  │
│  하드웨어: NPU 벡터 연산 (exp은 특수 함수 하드웨어)│  연산 비중: ~1% (하지만 메모리 집약적)├──────────────────────────────────────────────────────────────┤
Attention × V (GEMM)[batch, heads, seq, seq] × [batch, heads, seq, d_h]│  하드웨어: MAC 배열                                          │
│  연산 비중: ~12%├──────────────────────────────────────────────────────────────┤
Output Projection + FFN Layer 1 + FFN Layer 2 (GEMM × 3)│  하드웨어: MAC 배열 (가장 큰 행렬: d_model × 4×d_model)│  연산 비중: ~37%└──────────────────────────────────────────────────────────────┘

결론:87%GEMMNPU/Systolic Array에서 처리
      나머지 13%는 벡터 연산 (LayerNorm, Softmax, GELU)

Flash Attention: Attention의 메모리 문제 해결

# 표준 Attention vs Flash Attention 메모리 비교
def analyze_attention_memory(seq_len, d_model, n_heads, batch_size=1):
    d_head = d_model // n_heads
    dtype_bytes = 2  # FP16

    # 표준 Attention: O(n²) 메모리
    # Attention Score 행렬: [batch, heads, seq, seq]
    attn_score_gb = (batch_size * n_heads * seq_len * seq_len *
                     dtype_bytes) / (1024**3)

    # Flash Attention: O(n) 메모리
    # 타일별 처리, 전체 score 행렬 저장 안 함
    flash_extra_gb = (batch_size * n_heads * seq_len *
                      d_head * dtype_bytes) / (1024**3)

    print(f"시퀀스 길이: {seq_len}")
    print(f"Standard Attention score 행렬: {attn_score_gb:.2f} GB")
    print(f"Flash Attention 추가 메모리: {flash_extra_gb:.4f} GB")
    print(f"메모리 절약: {attn_score_gb/flash_extra_gb:.0f}배")

# GPT-4 규모 (추정): seq=8192, d=12288, heads=96
analyze_attention_memory(8192, 12288, 96)
# Standard: ~49.2 GB (단일 레이어!)
# Flash Attention: ~0.75 GB
# 65배 메모리 절약!

# Flash Attention이 NPU에서 특히 중요한 이유:
# NPU는 SRAM이 작음 (보통 30-100 MB)
# 표준 Attention은 수십 GB SRAM 필요 → 불가능
# Flash Attention의 타일 연산 = 작은 SRAM으로도 긴 시퀀스 처리 가능

4. 왜 LLM 추론은 메모리 바운드인가: 루프라인 모델

이것이 LLM 추론을 이해하는 가장 중요한 개념입니다. 많은 엔지니어들이 "더 많은 TFLOPS = 더 빠른 LLM"이라고 생각하지만, 이건 틀렸습니다.

루프라인 모델로 분석하기

# LLM 추론 병목 분석 (루프라인 모델)

# === 모델 설정 ===
model_name = "Llama 2 7B"
num_params = 7_000_000_000    # 70억 파라미터
bytes_per_param = 2           # FP16 = 2 bytes
model_size_bytes = num_params * bytes_per_param  # 14 GB

# === 하드웨어 설정 ===
hardware_configs = {
    'H100 SXM': {
        'memory_bw_tbs': 3.35,     # TB/s (3,350 GB/s)
        'compute_tflops': 1979,     # TFLOPS FP16
    },
    'A100 80GB': {
        'memory_bw_tbs': 2.0,
        'compute_tflops': 312,
    },
    'Apple M3 Max (GPU)': {
        'memory_bw_tbs': 0.3,      # 300 GB/s 통합 메모리
        'compute_tflops': 14.2,
    }
}

# === 분석 ===
for hw_name, hw in hardware_configs.items():
    # 토큰당 필요 연산
    # (각 토큰 생성 시 모든 가중치를 메모리에서 읽어야 함)
    bytes_per_token = model_size_bytes  # 14 GB/token

    # 메모리 대역폭이 지배하는 시간
    bw_gb_per_s = hw['memory_bw_tbs'] * 1000  # TB -> GB
    mem_time_sec = bytes_per_token / (bw_gb_per_s * 1e9)

    # 실제 연산 시간 (FLOPS/token ÷ 가용 FLOPS)
    flops_per_token = 2 * num_params  # rough: 2 × params
    compute_time_sec = flops_per_token / (hw['compute_tflops'] * 1e12)

    # 실제 병목은 두 값 중 큰 쪽
    bottleneck_time = max(mem_time_sec, compute_time_sec)
    tokens_per_sec = 1.0 / bottleneck_time

    # Arithmetic Intensity (연산 집약도)
    ai = flops_per_token / bytes_per_token  # FLOP/byte
    hw_ridge_point = hw['compute_tflops'] * 1e12 / (bw_gb_per_s * 1e9)

    bottleneck = "메모리 바운드" if ai < hw_ridge_point else "연산 바운드"

    print(f"\n=== {hw_name} ===")
    print(f"  메모리 시간: {mem_time_sec*1000:.1f} ms/token")
    print(f"  연산 시간:   {compute_time_sec*1000:.4f} ms/token")
    print(f"  병목:        {bottleneck}")
    print(f"  예상 처리량: ~{tokens_per_sec:.0f} tokens/sec")

# 출력 (배치 크기=1):
# H100: 메모리 4.2ms vs 연산 0.007ms → 메모리 바운드 → ~240 tok/s
# A100: 메모리 7.0ms vs 연산 0.045ms → 메모리 바운드 → ~143 tok/s
# M3 Max: 메모리 46.7ms vs 연산 0.98ms → 메모리 바운드 → ~21 tok/s
# 결론: 배치 크기 1에서 모든 하드웨어가 메모리 바운드!

메모리 바운드가 의미하는 것

메모리 바운드의 의미:

1. TFLOPS2배 높여도 속도는 안 변함
   (메모리 대역폭이 병목이므로)

2. 메모리 대역폭을 2배 높이면 정확히 2배 빨라짐

3. 모델 크기를 2줄이면 (양자화) 정확히 2배 빨라짐
   INT8INT4: 모델 크기 절반 → 처리량 2
4. 배치 크기를 키우면 연산 바운드로 전환 가능
   (같은 가중치로 여러 입력 처리 = 가중치 재사용 증가)
   배치=64이면: 대부분의 GPU에서 연산 바운드로 전환

5. Apple Silicon의 강점:
   - 546 GB/s 통합 메모리 (M3 Ultra)
   - GPU, CPU, Neural Engine이 같은 물리 메모리 공유
   - 32GB/64GB/128GB 용량: 큰 모델도 로컬 실행 가능

5. KV Cache: Transformer의 핵심 메모리 최적화

KV Cache 없이는 긴 시퀀스 생성이 실용적으로 불가능합니다.

# KV Cache의 원리와 메모리 계산

def analyze_kv_cache(model_name, context_len, n_layers, n_heads,
                     head_dim, batch_size=1, dtype_bytes=2):
    """KV Cache 메모리 사용량 계산"""

    # KV Cache 크기: [seq, layers, 2(K+V), heads, head_dim]
    kv_cache_bytes = (context_len * n_layers * 2 *
                      n_heads * head_dim * batch_size * dtype_bytes)
    kv_cache_gb = kv_cache_bytes / (1024**3)

    # 토큰당 KV 읽기량 (메모리 대역폭 관점)
    bytes_per_token_kv = (n_layers * 2 * n_heads * head_dim * dtype_bytes)
    bytes_per_token_kv_gb = bytes_per_token_kv / (1024**3)

    print(f"\n=== {model_name} ===")
    print(f"  컨텍스트 길이: {context_len:,} 토큰")
    print(f"  KV Cache 크기: {kv_cache_gb:.2f} GB")
    print(f"  토큰당 KV 읽기: {bytes_per_token_kv/1024:.1f} KB")
    print(f"  현재 문맥에 추가: 현재 Key/Value 벡터만 계산 (이전 캐시 재사용)")

# Llama 2 7B 설정
analyze_kv_cache("Llama 2 7B",
                 context_len=4096, n_layers=32,
                 n_heads=32, head_dim=128)
# KV Cache: 4K 컨텍스트 = ~2.0 GB

analyze_kv_cache("Llama 3.1 70B",
                 context_len=128_000, n_layers=80,
                 n_heads=64, head_dim=128)
# KV Cache: 128K 컨텍스트 = ~160 GB (!!)

# 실제 KV Cache vs 모델 가중치 메모리 분포
model_weights_gb = 14  # 7B FP16
kv_4k_gb = 2.0         # 4K context
kv_128k_gb = 64.0      # 128K context

print(f"\n메모리 분포 (Llama 2 7B, batch=1):")
print(f"  모델 가중치:    {model_weights_gb} GB (고정)")
print(f"  KV Cache 4K:   {kv_4k_gb} GB")
print(f"  KV Cache 128K: {kv_128k_gb} GB (모델의 {kv_128k_gb/model_weights_gb:.1f}배!)")

NPU에서의 KV Cache 최적화

# NPU 친화적 KV Cache 구현 전략

# 전략 1: Paged Attention (vLLM)
# KV Cache를 페이지 단위로 관리 → 동적 할당 가능
# GPU VRAM처럼 NPU SRAM을 관리

class PagedKVCache:
    def __init__(self, block_size=16, n_blocks=256):
        self.block_size = block_size
        self.n_blocks = n_blocks
        # 각 블록: [block_size, n_heads, head_dim] × 2 (K, V)
        self.blocks = {}

    def allocate_block(self, seq_id):
        """시퀀스에 새 블록 할당"""
        block_id = len(self.blocks)
        self.blocks[block_id] = {
            'seq_id': seq_id,
            'tokens': 0,
            'data': None  # 실제로는 텐서
        }
        return block_id

# 전략 2: GQA (Grouped Query Attention)
# Llama 3, Mistral 등에서 사용
# K, V 헤드 수 줄임 → KV Cache 크기 대폭 감소
def compute_gqa_savings(n_kv_heads_mha, n_kv_heads_gqa, n_layers,
                        context_len, head_dim, dtype_bytes=2):
    mha_size = n_kv_heads_mha * 2 * n_layers * context_len * head_dim * dtype_bytes
    gqa_size = n_kv_heads_gqa * 2 * n_layers * context_len * head_dim * dtype_bytes
    reduction = (1 - gqa_size/mha_size) * 100
    print(f"GQA KV Cache 절감: {reduction:.0f}%")

# Llama 3.1 8B: MHA 32헤드 → GQA 8헤드
compute_gqa_savings(32, 8, n_layers=32,
                    context_len=4096, head_dim=128)
# GQA KV Cache 절감: 75%!

6. 양자화가 NPU를 어떻게 도와주는가

양자화는 NPU에서 LLM을 실행 가능하게 만드는 핵심 기술입니다.

수 형식과 하드웨어 지원:

FP32: [1 부호][8 지수][23 가수] = 32비트
      모든 CPU/GPU 지원, 학습의 표준
      메모리: 7B 모델 = 28 GB

FP16: [1 부호][5 지수][10 가수] = 16비트
      GPU/NPU 지원, 추론 표준
      메모리: 7B 모델 = 14 GB

INT8: [1 부호][7 크기값]         = 8비트  ← NPU의 기본
      모든 NPU에서 지원, 4배 빠른 SIMD
      메모리: 7B 모델 = 7 GB
      정확도 손실: 보통 < 0.5%

INT4: 4비트                       ← 최신 NPU (A17 Pro, Hexagon)
      8비트의 2배 처리량
      메모리: 7B 모델 = 3.5 GB
      정확도 손실: 보통 1-3% (GPTQ 사용 시)

INT2/Binary: 실험적, 일부 특화 NPU
      메모리 절약 극대화, 성능 손실 큼

NPU에서 INT8 SIMD의 이점:
- 32비트 레지스터에 INT84개 저장 → 4배 처리량
- 실제: INT8 행렬 곱셈이 FP32보다 4-8배 빠름

실제 양자화 구현

# LLM.int8()을 사용한 포스트 트레이닝 양자화
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# INT8 양자화 (LLM.int8() 알고리즘)
quantization_config_int8 = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,       # 이상치 처리 임계값
    llm_int8_has_fp16_weight=False
)

model_int8 = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=quantization_config_int8,
    device_map='auto'
)
# FP16: 16 GB → INT8: 8 GB, 정확도 손실 ~0.3%

# INT4 양자화 (GPTQ 알고리즘)
quantization_config_int4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',         # NF4: 정규분포에 최적화된 4비트
    bnb_4bit_use_double_quant=True,    # 양자화 상수 자체도 양자화
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_int4 = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=quantization_config_int4,
    device_map='auto'
)
# FP16: 16 GB → INT4 NF4: 4 GB, 정확도 손실 ~1.2%

# Apple CoreML용 INT8 양자화
import coremltools as ct
from coremltools.optimize.coreml import (
    PostTrainingQuantizer, OptimizationConfig, OpLinearQuantizerConfig
)

config = OptimizationConfig(
    global_config=OpLinearQuantizerConfig(
        mode='linear_symmetric',  # 대칭 양자화
        dtype='int8',
        granularity='per_channel'  # 채널별 스케일 (정확도 향상)
    )
)

# 이미 변환된 CoreML 모델에 적용
quantizer = PostTrainingQuantizer(mlmodel, config)
quantized_model = quantizer.compress()
# 모델 크기: 절반, ANE 처리량: 2배

양자화 캘리브레이션의 중요성

# 양자화 품질을 결정하는 캘리브레이션
from datasets import load_dataset
import torch

def calibrate_quantization(model, tokenizer, n_samples=128):
    """
    캘리브레이션: 실제 데이터로 양자화 스케일 결정
    이 과정이 정확도 유지의 핵심!
    """
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
    calibration_texts = dataset['text'][:n_samples]

    model.eval()
    activation_ranges = {}

    # 각 레이어의 활성화 범위 수집
    def hook_fn(name):
        def hook(module, input, output):
            if name not in activation_ranges:
                activation_ranges[name] = {'min': float('inf'), 'max': float('-inf')}
            activation_ranges[name]['min'] = min(
                activation_ranges[name]['min'], output.min().item()
            )
            activation_ranges[name]['max'] = max(
                activation_ranges[name]['max'], output.max().item()
            )
        return hook

    # 훅 등록
    hooks = []
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            hooks.append(module.register_forward_hook(hook_fn(name)))

    # 캘리브레이션 실행
    with torch.no_grad():
        for text in calibration_texts:
            inputs = tokenizer(text, return_tensors='pt', max_length=512,
                              truncation=True)
            model(**inputs)

    # 훅 제거
    for hook in hooks:
        hook.remove()

    # 수집된 범위로 양자화 스케일 계산
    quantization_scales = {}
    for name, ranges in activation_ranges.items():
        abs_max = max(abs(ranges['min']), abs(ranges['max']))
        quantization_scales[name] = abs_max / 127.0  # INT8 최대값

    return quantization_scales

7. Qualcomm Hexagon NPU와 Intel NPU

Qualcomm Snapdragon X Elite: Hexagon NPU

Qualcomm Hexagon NPU (Snapdragon X Elite, 2024):

성능: 45 TOPS (INT8)
아키텍처:
- HTA (Hexagon Tensor Accelerator)
- HMNN (Hexagon Multi-Network Node): 여러 AI 워크로드 동시 실행
- 전용 Vector DSP + Scalar DSP

지원 데이터 형식: INT4, INT8, FP16
온칩 SRAM: ~3 MB (빠른 캐시)

지원하는 LLM 실행:
- Llama 3.2 3B: 30+ tok/s 로컬 실행
- Phi-3.5 mini 3.8B: 25+ tok/s
- Gemma 2 2B: 35+ tok/s

프로그래밍:
- Qualcomm AI Engine Direct SDK
- ONNX Runtime + QNN 백엔드
- llama.cpp의 Hexagon 백엔드

Windows Copilot Plus PC에서의 활용:
- 실시간 자동 캡션 (Live Captions)
- Cocreator (AI 이미지 생성)
- 스마트 스냅숏
모두 NPU에서 실행 → 배터리 소모 최소화

Intel Meteor Lake Neural Processing Unit

Intel NPU (Meteor Lake / Core Ultra, 2023):

성능: 10 TOPS (INT8)
아키텍처:
- NN 연산자 가속기: MAC 배열
- 슬라이스 아키텍처: 독립적인 처리 타일
- 전용 메모리 컨트롤러

특징:
- 항상 켜져 있는 AI 처리
- 백그라운드 AI 작업에 특화
- 실시간 노이즈 캔슬링, 눈 감지 등

OpenVINO로 활용:
from openvino.runtime import Core
ie = Core()
compiled_model = ie.compile_model(onnx_model, device_name="NPU")
output = compiled_model({"input": input_data})

주요 사용 사례:
- Windows Studio Effects (배경 흐리기, 눈 맞춤)
- 음성 인식 전처리
- 실시간 번역 (소형 모델)
- 이미지 향상 (사진 자동 보정)

8. 기기별 LLM 실행 가능 모델 크기

2025년 기준, 실용적인 온디바이스 LLM 실행:

iPhone 16 (8GB RAM, A18 Pro, 35 TOPS ANE):
├── 가능: Llama 3.2 3B INT4 (~2GB, ~25 tok/s)
├── 가능: Phi-3 mini 3.8B INT4 (~2.3GB, ~20 tok/s)
├── 불가능: Llama 3.1 8B FP16 (16GB 필요)
└── 가능: Llama 3.1 8B INT4 (~5GB, ~12 tok/s)

MacBook Air M3 (16GB RAM, 38 TOPS ANE):
├── 가능: Llama 3.1 8B Q4 (~5GB, ~40 tok/s)
├── 가능: Mistral 7B Q4 (~4.5GB, ~45 tok/s)
├── 가능: Llama 3.1 70B Q4 (~40GB - M3 Max 64GB 필요)
└── 불가능: 70B FP16 (140GB 필요)

M3 Ultra (192GB RAM, 800 GB/s):
├── 가능: Llama 3.1 70B FP16 (~140GB, ~35 tok/s)
├── 가능: Llama 3.1 405B Q4 (~230GB, ~8 tok/s)
└── 가능: Claude Sonnet급 모델 로컬 실행

Snapdragon X Elite PC (32GB RAM):
├── 가능: Llama 3.2 3B (~30 tok/s on NPU)
├── 가능: Phi-3.5 mini (~25 tok/s on NPU)
├── 가능: Llama 3.1 8B Q4 (~15 tok/s)
└── 불가능: 70B 이상 (메모리 부족)

9. 미래: LLM 칩 전쟁

범용 GPU만이 아니라, 전용 추론 칩들이 맹렬히 추격하고 있습니다.

전용 LLM 추론 칩의 현재:

1. Groq LPU (Language Processing Unit)
   아키텍처: 결정론적 데이터플로우 (컴파일 타임에 모든 실행 계획)
   강점: 매우 낮은 지연시간, 예측 가능한 성능
   실측: Llama 2 70B에서 500 tok/s (H1004!)
   이유: 메모리 접근 패턴이 완전히 정적 → 파이프라인 완벽 활용
   약점: 특정 모델 아키텍처만 지원, 유연성 낮음

2. Cerebras WSE-3 (Wafer Scale Engine)
   칩 크기: 웨이퍼 전체 (46,225 mm²)
   AI 코어: 900,000   온칩 SRAM: 900 MB (집적 밀도 극대화)
   강점: 모델 전체를 칩 위에 올림 → HBM 접근 없음!
   약점: 1대 가격 수백만 달러

3. SambaNova Reconfigurable Dataflow Architecture (RDA)
   아키텍처: FPGA처럼 재구성 가능한 데이터플로우
   강점: 다양한 모델에 최적화 가능
   고객: 정부 기관, 대형 연구소

4. Etched Sohu
   아키텍처: 트랜스포머만을 위한 하드와이어드 칩
   특징: 트랜스포머 외 연산 불가 (그래서 매우 효율적)
   예상 성능: H100 대비 20배 효율

왜 전용 칩이 범용 GPU를 이길 수 있는가:
- GPU: 범용성의 오버헤드 (스케줄러, 레지스터 파일, 복잡한 캐시)
- 전용 칩: 알려진 워크로드에 맞게 하드웨어 자체가 최적화
- 트랜스포머 추론은 결정론적 → 컴파일 타임 최적화 극대화

미래 NPU의 방향

# 2026-2030 NPU 발전 예측

future_npu_trends = {
    '2026': [
        '모든 스마트폰에 30+ TOPS NPU 기본 탑재',
        '온디바이스 10B 모델 실시간 추론',
        'Apple ANE: 100+ TOPS 목표',
        'FP8 지원으로 정밀도/속도 균형',
    ],
    '2027-2028': [
        '3D 패키징: NPU + HBM 칩 적층',
        'Analog In-Memory Computing 실험적 도입',
        '30B 모델 스마트폰에서 실용적 추론',
        'PIM (Processing In Memory): 메모리 칩 내부에서 연산',
    ],
    '2029-2030': [
        '광 연산 (Photonic Computing) 실험적 NPU',
        '100B 모델 온디바이스 가능성',
        '에너지 효율: 현재 대비 10-50배 개선',
        'Neuromorphic 요소 도입 (스파이킹 뉴런)',
    ]
}

10. 실전: llama.cpp로 온디바이스 LLM 실행

# llama.cpp 설치 및 빌드 (Apple Silicon NPU 지원)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Metal (Apple GPU/ANE) 지원으로 빌드
cmake -B build -DLLAMA_METAL=ON
cmake --build build -j $(nproc)

# 모델 다운로드 (GGUF 형식, 이미 양자화됨)
# Llama 3.1 8B Q4_K_M: ~5.0 GB
huggingface-cli download \
  bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --local-dir ./models

# 실행 (Metal GPU 사용)
./build/bin/llama-cli \
  -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -n 512 \
  --n-gpu-layers 35 \  # GPU/ANE로 오프로드할 레이어 수
  -p "트랜스포머 아키텍처의 핵심 혁신을 설명해줘"

# Apple M3 Max 16코어 예상 결과:
# - 로드 시간: ~3초
# - 처리량: ~45 tok/s (GPU 오프로드)
# - 메모리: ~5.5 GB
# Python에서 llama.cpp 사용 (llama-cpp-python 패키지)
from llama_cpp import Llama

# 모델 로드 (GPU 오프로드 포함)
llm = Llama(
    model_path="./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
    n_gpu_layers=35,    # GPU로 오프로드할 레이어 수
    n_ctx=4096,         # 컨텍스트 길이
    n_threads=8,        # CPU 스레드 수
    verbose=False
)

# 추론 실행
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "당신은 AI 하드웨어 전문가입니다."},
        {"role": "user", "content": "NPU와 GPU의 차이를 설명해주세요."}
    ],
    max_tokens=512,
    temperature=0.7
)

print(response['choices'][0]['message']['content'])
print(f"\n처리량: {response['usage']['completion_tokens'] / response['usage']['total_time']:.1f} tok/s")

# 성능 측정
import time

def benchmark_inference(llm, prompt, n_tokens=100):
    start = time.time()
    output = llm(prompt, max_tokens=n_tokens, echo=False)
    elapsed = time.time() - start
    tokens = output['usage']['completion_tokens']
    return tokens / elapsed

tok_per_sec = benchmark_inference(llm, "AI 칩 역사를 설명해줘", n_tokens=200)
print(f"벤치마크: {tok_per_sec:.1f} tokens/second")

마치며

NPU는 "AI를 모두의 손에"라는 비전을 실현하는 하드웨어 혁명입니다.

핵심 교훈들:

  1. 전력이 제약이다: 스마트폰에서 AI를 실행하려면 GPU의 1/50 전력이 필요 → NPU가 답
  2. LLM 추론은 메모리 바운드: TFLOPS보다 메모리 대역폭이 중요 → 양자화와 메모리 최적화가 핵심
  3. 양자화는 마법이 아니다: 철저한 캘리브레이션이 정확도 유지의 관건
  4. KV Cache가 긴 문맥의 열쇠: 효율적인 KV Cache 없이는 1000 토큰 이상 생성 불가
  5. 전용 칩의 시대가 온다: Groq, Cerebras, Etched 등 순수 LLM 추론 칩들이 GPU를 위협

하드웨어와 소프트웨어가 공진화하는 AI 시대에, TPU와 NPU를 이해하는 것은 단순한 호기심을 넘어 경쟁력 있는 AI 엔지니어가 되기 위한 필수 지식입니다.


참고 자료

  • Apple Neural Engine 특허 문서 (US Patent Office)
  • Qualcomm AI Engine Direct SDK 문서
  • "FlashAttention: Fast and Memory-Efficient Exact Attention" (Dao et al., 2022)
  • "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers" (2023)
  • "Roofline: An Insightful Visual Performance Model" (Williams et al., 2009)
  • llama.cpp GitHub: github.com/ggerganov/llama.cpp
  • Groq LPU 기술 백서 (groq.com)

NPU Deep Dive: How Transformer Architecture Runs Directly on Silicon

Introduction: AI Moves Into Your Pocket

When ChatGPT launched, you sent a request to a rack of H100s in a data center. By 2025, Llama 3.2 3B runs in real-time on your iPhone 16 — generating 30+ tokens per second on the Apple Neural Engine.

The hardware that makes this possible is the NPU (Neural Processing Unit).

An NPU is not simply a small GPU. It embodies a fundamentally different design philosophy — a chip that sacrifices generality to achieve extraordinary efficiency on a narrow class of operations. This post explains what that means in precise technical terms: how every transformer operation maps to silicon, why memory bandwidth matters more than TFLOPS for inference, and what the future of AI hardware looks like.


1. CPU vs GPU vs NPU: The Design Philosophy Triangle

CPU: "I handle complex tasks, one at a time, very fast"
+--------------------------------------------------+
| Core count:  8-128 (big cores)                   |
| Clock:       3-5 GHz (high single-thread perf)   |
| Strengths:   Complex control flow, branch pred   |
|              OS, databases, web servers           |
| Caches:      L1/L2/L3 hierarchy (MB-scale)       |
| Weakness:    Energy-inefficient for parallel math |
+--------------------------------------------------+

GPU: "I do simple things, millions at once"
+--------------------------------------------------+
| Core count:  Thousands to tens of thousands       |
| Clock:       1-3 GHz (low, but massively parallel)|
| Strengths:   Any parallel workload (render, AI)  |
| Memory:      GDDR6/HBM, high bandwidth            |
| Weakness:    300-700W power draw, flexibility tax |
+--------------------------------------------------+

NPU: "I do AI math only — with extreme efficiency"
+--------------------------------------------------+
| Core count:  Few but specialized MAC arrays       |
| Strengths:   Integer matrix multiply (INT8/INT4)  |
|              Quantized inference                  |
| Energy eff:  10-100x better than GPU             |
| Power:       1-10W (smartphone/laptop)            |
| Weakness:    No general compute, no FP32 training |
+--------------------------------------------------+

Why NPU is necessary:
- Smartphone AI: 5000 mAh battery, GPU = 30 min battery life
- NPU: same computation, 1/50th the power
- "Always-on AI": face detection, wake word, photo classification

Let's quantify the energy efficiency advantage:

# Energy efficiency comparison for INT8 inference
perf_data = {
    'NVIDIA H100 SXM':       {'tops': 3958, 'tdp_w': 700},
    'NVIDIA A100 40GB':      {'tops': 1248, 'tdp_w': 400},
    'AMD MI300X':            {'tops': 5220, 'tdp_w': 750},
    'Apple M4 Neural Engine': {'tops': 38,  'tdp_w': 4},   # Neural Engine TDP only
    'Qualcomm Hexagon NPU':  {'tops': 45,   'tdp_w': 5},
    'Intel Meteor Lake NPU': {'tops': 10,   'tdp_w': 8},
}

print(f"{'Hardware':<30} {'TOPS':>8} {'TDP (W)':>8} {'TOPS/W':>8}")
print("-" * 56)
for name, data in perf_data.items():
    efficiency = data['tops'] / data['tdp_w']
    print(f"{name:<30} {data['tops']:>8} {data['tdp_w']:>8} {efficiency:>8.2f}")

# Note: absolute TOPS numbers favor data center GPUs,
# but TOPS/W shows why mobile NPUs dominate on-device AI

2. Apple Neural Engine (ANE): Full Hardware Dissection

Apple's Neural Engine is the most mature consumer NPU, having shipped in every iPhone since 2017.

Apple Neural Engine Evolution:

A11 Bionic (2017): 2-core Neural Engine
  Performance: 0.6 TOPS
  Purpose:     Face ID, Animoji
  First neural engine in a consumer phone

A12 Bionic (2018): 8-core Neural Engine
  Performance: 5 TOPS
  Added:       On-device Siri, real-time image segmentation

A15 Bionic (2021): 16-core Neural Engine
  Performance: 15.8 TOPS
  Added:       On-device translation, Live Text in camera

A17 Pro (2023): 16-core, 3nm process
  Performance: 35 TOPS (INT8)
  Added:       Llama 3.2 3B local inference (~25 tok/s)

M4 (2024): 16-core Neural Engine, same die as A17 Pro
  Performance: 38 TOPS
  Unified Memory: 16GB-32GB shared with GPU + CPU

ANE Internal Architecture (based on public patents)

Apple Neural Engine Die Area (estimated):

+-------------------------------------------------------------+
|                    Neural Engine Block                       |
+-------------------------------------------------------------+
|  Command Processor (schedules work to execution units)      |
|  +--------------------------------------------------------+  |
|  |  16 Execution Units (cores)                           |  |
|  |  [MAC Array] [MAC Array] ... [MAC Array] x 16         |  |
|  |  Each EU: INT8/INT16 matrix multiply + activation fn  |  |
|  |           Layer Norm, Softmax accelerated in hardware  |  |
|  +--------------------------------------------------------+  |
+-------------------------------------------------------------+
|  Dedicated L1 SRAM: ~30 MB                                 |
|  (NOT shared with GPU -- no cache thrashing)               |
+-------------------------------------------------------------+
|  DMA Engine (dedicated hardware for memory moves)          |
+-------------------------------------------------------------+
|  Memory Interface: Unified Memory (CPU/GPU/ANE share)      |
+-------------------------------------------------------------+

Critical constraints:
- Programming: ONLY via CoreML (no bare-metal API access)
- Data types: INT8 and INT16 only (no FP32, can't do training)
- Batch size: limited (large batches become inefficient)
- Op coverage: unsupported ops fall back to GPU/CPU automatically

Using CoreML to Target ANE

# Convert a PyTorch model to CoreML -> runs on Apple Neural Engine
import torch
import coremltools as ct

# Example: small transformer model
class SmallTransformer(torch.nn.Module):
    def __init__(self, d_model=512, n_heads=8, n_layers=6):
        super().__init__()
        self.layers = torch.nn.ModuleList([
            torch.nn.TransformerEncoderLayer(d_model, n_heads, batch_first=True)
            for _ in range(n_layers)
        ])
        self.output = torch.nn.Linear(d_model, 32000)

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return self.output(x)

model = SmallTransformer()
model.eval()

# Trace the model
example_input = torch.zeros(1, 128, 512)
traced = torch.jit.trace(model, example_input)

# Convert to CoreML with quantization
mlmodel = ct.convert(
    traced,
    inputs=[ct.TensorType(name='input',
                          shape=(1, ct.RangeDim(1, 512), 512))],
    compute_precision=ct.precision.FLOAT16,
    compute_units=ct.ComputeUnit.ALL  # ANE + GPU + CPU, auto-routed
)

# Post-training INT8 quantization for ANE
from coremltools.optimize.coreml import PostTrainingQuantizer, OptimizationConfig
from coremltools.optimize.coreml import OpLinearQuantizerConfig

config = OptimizationConfig(
    global_config=OpLinearQuantizerConfig(
        mode='linear_symmetric',
        dtype='int8',
        granularity='per_channel'  # per-channel scales = better accuracy
    )
)
quantizer = PostTrainingQuantizer(mlmodel, config)
quantized_model = quantizer.compress()
quantized_model.save('transformer_int8.mlpackage')

# Runtime inference (on-device)
import coremltools as ct
model = ct.models.MLModel('transformer_int8.mlpackage')
predictions = model.predict({'input': input_array})
# CoreML automatically routes to ANE if supported, else GPU/CPU

3. Mapping Every Transformer Operation to Hardware

Let's trace exactly which hardware unit handles each step of a transformer forward pass.

Transformer Forward Pass -> Hardware Mapping:
+----------------------------------------------------------------+
| Input Token IDs [batch, seq_len]                               |
|  -> Embedding Lookup (Gather op)                               |
|  Hardware: NPU SRAM cache for embedding table                  |
|  Cost: near-zero (table lookup, no compute)                    |
+----------------------------------------------------------------+
| Layer Normalization                                            |
|  -> mean -> variance -> normalize -> scale -> bias             |
|  Hardware: NPU vector ALU (fused into single kernel by XLA)   |
|  Compute share: ~1-2%                                          |
+----------------------------------------------------------------+
| Q, K, V Projection (3x Linear Layers)                        |
|  [batch, seq, d_model] x [d_model, d_head] = GEMM            |
|  Hardware: Systolic Array / MAC array (~40% of power draw)    |
|  Compute share: ~38% (dominant operation!)                     |
+----------------------------------------------------------------+
| Attention Score: Q x K^T / sqrt(d_head)                       |
|  [b, heads, seq, d_h] x [b, heads, d_h, seq]                  |
|  Hardware: GPU Tensor Core / NPU MAC (O(n^2 x d) complexity)  |
|  Compute share: ~12% (scales quadratically with seq length!)  |
+----------------------------------------------------------------+
| Softmax: exp -> sum -> divide                                  |
|  Hardware: NPU vector unit (exp needs special function hw)    |
|  Compute share: ~1% (but memory-intensive, not fusable easily) |
+----------------------------------------------------------------+
| Attention x V (GEMM)                                          |
|  [b, heads, seq, seq] x [b, heads, seq, d_h]                  |
|  Hardware: MAC array                                           |
|  Compute share: ~12%                                           |
+----------------------------------------------------------------+
| Output Projection + FFN Layer 1 + FFN Layer 2 (3x GEMM)      |
|  Hardware: MAC array (largest matrix: d_model x 4*d_model)   |
|  Compute share: ~37%                                           |
+----------------------------------------------------------------+

Summary:
~87% of ops are GEMM -> handled by MAC arrays / Systolic Arrays
~13% are vector ops (LayerNorm, Softmax, GELU) -> NPU vector units

Flash Attention: Solving the Memory Problem

Standard attention requires storing an n×n score matrix in memory. For long sequences, this is catastrophic:

def analyze_attention_memory(seq_len, d_model, n_heads, batch_size=1):
    d_head = d_model // n_heads
    dtype_bytes = 2  # FP16

    # Standard attention: O(n^2) memory
    # Stores full [batch, heads, seq, seq] attention score matrix
    attn_score_bytes = batch_size * n_heads * seq_len * seq_len * dtype_bytes
    attn_score_gb = attn_score_bytes / (1024**3)

    # Flash attention: O(n) memory
    # Processes in tiles, never materializes full score matrix
    # Only extra memory: tiling buffers (~d_head per head)
    flash_extra_bytes = batch_size * n_heads * seq_len * d_head * dtype_bytes
    flash_extra_gb = flash_extra_bytes / (1024**3)

    print(f"Sequence length: {seq_len}")
    print(f"Standard attention score matrix: {attn_score_gb:.2f} GB")
    print(f"Flash attention extra memory:    {flash_extra_gb:.4f} GB")
    print(f"Memory reduction:                {attn_score_gb/flash_extra_gb:.0f}x")

# GPT-4-scale (estimated): seq=8192, d=12288, heads=96
analyze_attention_memory(8192, 12288, 96)
# Standard: ~49.2 GB for ONE layer!
# Flash Attention: ~0.75 GB
# 65x memory reduction

# This is why Flash Attention is essential for NPUs:
# NPU SRAM is typically 30-100 MB
# Standard attention for seq=8192 would need 49 GB -- impossible
# Flash Attention's tiling fits in SRAM -> long sequences on NPU!

4. Why LLM Inference is Memory-Bound: The Roofline Model

This is THE most important concept for understanding LLM inference performance. Most engineers intuitively think "more TFLOPS = faster LLM." This is wrong.

The Roofline Analysis

# LLM inference bottleneck analysis using the Roofline Model

model_config = {
    'name': 'Llama 2 7B',
    'num_params': 7_000_000_000,
    'bytes_per_param': 2,   # FP16
}

model_size_bytes = model_config['num_params'] * model_config['bytes_per_param']
model_size_gb = model_size_bytes / (1024**3)
print(f"Model size: {model_size_gb:.1f} GB")  # 14.0 GB

hardware = {
    'H100 SXM': {
        'memory_bw_gbs': 3350,   # GB/s
        'compute_tflops': 1979,  # FP16 TFLOPS
    },
    'A100 80GB': {
        'memory_bw_gbs': 2000,
        'compute_tflops': 312,
    },
    'RTX 4090': {
        'memory_bw_gbs': 1008,
        'compute_tflops': 82.6,
    },
    'Apple M3 Max GPU': {
        'memory_bw_gbs': 300,    # unified memory
        'compute_tflops': 14.2,
    }
}

print(f"\n{'Hardware':<20} {'Mem (ms)':>10} {'Compute (ms)':>13} {'Bottleneck':>15} {'Est tok/s':>10}")
print("-" * 72)

for hw_name, hw in hardware.items():
    # Per token: must read ALL weights from memory
    # (batch_size=1: no weight reuse across tokens)
    mem_time_ms = (model_size_bytes / (hw['memory_bw_gbs'] * 1e9)) * 1000

    # Compute time: 2 * num_params FLOPs / available FLOPS
    flops_per_token = 2 * model_config['num_params']
    compute_time_ms = (flops_per_token / (hw['compute_tflops'] * 1e12)) * 1000

    bottleneck_time_ms = max(mem_time_ms, compute_time_ms)
    tok_per_sec = 1000.0 / bottleneck_time_ms

    # Arithmetic Intensity (AI): FLOPs per byte
    ai_actual = flops_per_token / model_size_bytes
    # Hardware ridge point: compute/bandwidth ratio
    ai_ridge = (hw['compute_tflops'] * 1e12) / (hw['memory_bw_gbs'] * 1e9)

    bottleneck = "Memory-bound" if ai_actual < ai_ridge else "Compute-bound"
    print(f"{hw_name:<20} {mem_time_ms:>10.2f} {compute_time_ms:>13.4f} "
          f"{bottleneck:>15} {tok_per_sec:>10.0f}")

# Expected output (batch_size=1):
# H100 SXM:     4.18 ms memory,  0.007 ms compute -> Memory-bound -> ~239 tok/s
# A100 80GB:    7.00 ms memory,  0.045 ms compute -> Memory-bound -> ~143 tok/s
# RTX 4090:    13.89 ms memory,  0.170 ms compute -> Memory-bound ->  ~72 tok/s
# Apple M3 Max:46.67 ms memory,  0.985 ms compute -> Memory-bound ->  ~21 tok/s
# Conclusion: ALL hardware is memory-bound at batch_size=1!

What Memory-Bound Means in Practice

Memory-bound inference has counter-intuitive implications:

1. Doubling TFLOPS -> NO speedup
   (Memory bandwidth is the bottleneck, not compute)

2. Doubling memory bandwidth -> exactly 2x speedup
   This is why H100 (3.35 TB/s) beats A100 (2.0 TB/s) by ~1.6x for inference

3. Halving model size (quantization) -> exactly 2x speedup
   FP16 -> INT8: model size halved, 2x more tokens/sec
   INT8 -> INT4: model size halved again, 2x more tokens/sec

4. Increasing batch size -> transition to compute-bound
   At batch_size=1: weight is read once, used for 1 token (wasteful)
   At batch_size=64: weight is read once, used for 64 tokens (efficient)
   Large batches => weight reuse => compute-bound => TFLOPS matters

5. Why Apple Silicon is competitive:
   M3 Ultra: 800 GB/s unified memory (vs H100's 3.35 TB/s)
   BUT: 192 GB capacity fits large models without quantization
   M3 Max (128 GB): fits Llama 3.1 70B FP16 locally (H100 needs 2x for that)

6. Why AMD MI300X beats H100 for inference:
   MI300X: 5.3 TB/s memory bandwidth (vs H100's 3.35 TB/s)
   At batch_size=1: MI300X is ~1.6x faster, despite fewer TFLOPS

5. KV Cache: The Memory Cost of Long Contexts

Without KV Cache, generating a 1000-token response would require recomputing all attention weights from scratch for every new token — O(n²) complexity per output token.

# KV Cache memory analysis

def compute_kv_cache_size(model_name, context_len, n_layers,
                          n_kv_heads, head_dim, batch_size=1,
                          dtype_bytes=2):
    """
    KV Cache stores Key and Value tensors for all previous tokens
    Shape: [context_len, n_layers, 2(K+V), n_kv_heads, head_dim]
    """
    kv_bytes = (context_len * n_layers * 2 *
                n_kv_heads * head_dim * batch_size * dtype_bytes)
    kv_gb = kv_bytes / (1024**3)

    # Per-token memory access (reading the cache for one new token)
    per_token_bytes = n_layers * 2 * n_kv_heads * head_dim * dtype_bytes
    per_token_kb = per_token_bytes / 1024

    print(f"\n=== {model_name} ===")
    print(f"  Context length:    {context_len:>8,} tokens")
    print(f"  KV Cache size:     {kv_gb:>8.2f} GB")
    print(f"  Per-token KV read: {per_token_kb:>8.1f} KB")

# Standard models (Multi-Head Attention, n_kv_heads = n_heads)
compute_kv_cache_size("Llama 2 7B",
                      context_len=4096, n_layers=32,
                      n_kv_heads=32, head_dim=128)
# KV Cache: 2.0 GB for 4K context

compute_kv_cache_size("Llama 2 70B",
                      context_len=4096, n_layers=80,
                      n_kv_heads=64, head_dim=128)
# KV Cache: 10.0 GB for 4K context

# Models with Grouped Query Attention (GQA) -- fewer KV heads
compute_kv_cache_size("Llama 3.1 8B (GQA 8 heads)",
                      context_len=128_000, n_layers=32,
                      n_kv_heads=8, head_dim=128)
# KV Cache: 16.0 GB for 128K context (vs 64 GB without GQA!)

compute_kv_cache_size("Llama 3.1 70B (GQA 8 heads)",
                      context_len=128_000, n_layers=80,
                      n_kv_heads=8, head_dim=128)
# KV Cache: 40.0 GB for 128K context

GQA and MQA: Reducing KV Cache

# Grouped Query Attention (GQA): fewer KV heads, same Q heads
# Used in Llama 3, Mistral, Gemma, Falcon

def compare_attention_variants(seq_len, n_layers, d_model,
                                n_q_heads, n_kv_heads_gqa,
                                dtype_bytes=2):
    head_dim = d_model // n_q_heads

    # MHA: n_kv = n_q (standard multi-head attention)
    mha_kv_gb = (seq_len * n_layers * 2 * n_q_heads *
                 head_dim * dtype_bytes) / (1024**3)

    # GQA: n_kv < n_q (grouped -- multiple Q heads share one KV)
    gqa_kv_gb = (seq_len * n_layers * 2 * n_kv_heads_gqa *
                 head_dim * dtype_bytes) / (1024**3)

    # MQA: n_kv = 1 (extreme -- all Q heads share single KV)
    mqa_kv_gb = (seq_len * n_layers * 2 * 1 *
                 head_dim * dtype_bytes) / (1024**3)

    reduction_gqa = (1 - gqa_kv_gb/mha_kv_gb) * 100
    reduction_mqa = (1 - mqa_kv_gb/mha_kv_gb) * 100

    print(f"Seq: {seq_len}, Q heads: {n_q_heads}, GQA KV heads: {n_kv_heads_gqa}")
    print(f"  MHA KV cache: {mha_kv_gb:.2f} GB (baseline)")
    print(f"  GQA KV cache: {gqa_kv_gb:.2f} GB ({reduction_gqa:.0f}% reduction)")
    print(f"  MQA KV cache: {mqa_kv_gb:.2f} GB ({reduction_mqa:.0f}% reduction)")

# Llama 3.1 8B: 32 Q heads, 8 KV heads (GQA)
compare_attention_variants(
    seq_len=128_000, n_layers=32, d_model=4096,
    n_q_heads=32, n_kv_heads_gqa=8
)
# GQA: 75% KV cache reduction!
# NPU SRAM fits much more context with GQA

6. Quantization: How It Supercharges NPU Performance

Quantization is the single most impactful optimization for on-device LLM inference.

Number formats and hardware support:

FP32: [1 sign][8 exp][23 mantissa] = 32 bits
      Standard for training, all hardware supports
      7B model: 28 GB

FP16: [1 sign][5 exp][10 mantissa] = 16 bits
      Inference standard, GPU/NPU support
      7B model: 14 GB

INT8: [1 sign][7 magnitude]         = 8 bits   <- NPU default
      All NPUs support, 4x SIMD throughput vs INT32
      7B model: 7 GB
      Accuracy loss: typically < 0.5%

INT4: 4 bits                         <- Modern NPUs (A17 Pro, Hexagon)
      2x throughput vs INT8
      7B model: 3.5 GB
      Accuracy loss: 1-3% (with GPTQ calibration)

INT8 SIMD advantage on NPU:
- 32-bit register holds 4x INT8 values -> 4x throughput
- In practice: INT8 GEMM is 4-8x faster than FP32 GEMM
- Memory bandwidth savings scale perfectly: INT8 reads 4x less

Post-Training Quantization: Production Techniques

# Technique 1: LLM.int8() -- handles outlier activations
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

config_int8 = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,    # activations above this use FP16
    llm_int8_has_fp16_weight=False
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=config_int8,
    device_map='auto',
    torch_dtype=torch.float16
)
# FP16: 16 GB -> INT8: ~8.5 GB, accuracy loss ~0.3%

# Technique 2: GPTQ -- gradient-optimized post-training quantization
# Much better accuracy than naive INT4 rounding
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,                    # 4-bit quantization
    group_size=128,            # group size (128 = good accuracy/speed tradeoff)
    damp_percent=0.1,          # GPTQ damping factor
    desc_act=True,             # activation order (better accuracy)
)

# Requires calibration data for GPTQ
from datasets import load_dataset
dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')
calibration_data = [
    dataset[i]['text'] for i in range(128)
    if len(dataset[i]['text']) > 50
]

model = AutoGPTQForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantize_config
)
model.quantize(calibration_data)
model.save_quantized("llama3-8b-gptq-4bit")
# FP16: 16 GB -> GPTQ INT4: ~4.5 GB, accuracy loss ~1.2%

# Technique 3: AWQ (Activation-aware Weight Quantization)
# Finds the most important weights and protects them
from awq import AutoAWQForCausalLM

model_awq = AutoAWQForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    safetensors=True
)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}
model_awq.quantize(tokenizer, quant_config=quant_config)
model_awq.save_quantized("llama3-8b-awq-4bit")
# AWQ typically achieves ~0.5% better accuracy than GPTQ at same bit-width

Why Per-Channel Quantization Matters

# Per-tensor vs per-channel quantization accuracy
import numpy as np

def simulate_quantization_error(weights, bits=8, per_channel=True):
    """
    Per-channel: each output channel has its own scale
    Per-tensor: single scale for entire weight matrix
    """
    if per_channel:
        # Scale per output channel (row)
        max_vals = np.abs(weights).max(axis=1, keepdims=True)
        scales = max_vals / (2**(bits-1) - 1)
    else:
        # Single scale for entire tensor
        max_val = np.abs(weights).max()
        scales = max_val / (2**(bits-1) - 1)

    # Quantize
    weights_q = np.round(weights / scales).clip(-(2**(bits-1)), 2**(bits-1)-1)
    # Dequantize
    weights_dq = weights_q * scales

    error = np.mean((weights - weights_dq)**2)
    return error

# Simulate with realistic weight distribution
np.random.seed(42)
# Linear layer weights: approximately Gaussian but with different scales per channel
weights = np.random.randn(256, 4096) * np.random.exponential(1.0, (256, 1))

err_per_tensor = simulate_quantization_error(weights, per_channel=False)
err_per_channel = simulate_quantization_error(weights, per_channel=True)

print(f"Per-tensor INT8 MSE:  {err_per_tensor:.6f}")
print(f"Per-channel INT8 MSE: {err_per_channel:.6f}")
print(f"Improvement: {err_per_tensor/err_per_channel:.1f}x better accuracy")
# Typical: 5-50x better accuracy with per-channel quantization
# Cost: one scale value per output channel (negligible overhead)

7. Qualcomm Hexagon NPU and Intel NPU

Qualcomm Snapdragon X Elite: Hexagon NPU

Qualcomm Hexagon NPU (Snapdragon X Elite, 2024):

Performance: 45 TOPS (INT8)
Architecture:
  HTA (Hexagon Tensor Accelerator):
    - Primary GEMM accelerator
    - Supports INT4, INT8, FP16
    - On-chip SRAM: ~4 MB
  HMNN (Hexagon Multi-Network Node):
    - Run multiple AI networks simultaneously
    - Real-time + background AI concurrently
  Vector DSP + Scalar DSP:
    - Handle activation functions, softmax, etc.

Supported LLM Inference (on-device):
  Llama 3.2 3B INT4:    ~30 tok/s
  Phi-3.5 mini 3.8B:    ~25 tok/s
  Gemma 2 2B INT4:      ~35 tok/s

Programming:
  - Qualcomm AI Engine Direct SDK (low-level)
  - ONNX Runtime with QNN backend
  - llama.cpp with Hexagon backend

Windows Copilot Plus PC AI features (all run on NPU):
  - Live Captions: real-time speech transcription
  - Cocreator: AI image generation
  - Smart Snapshots: AI scene understanding
  All of these: battery-efficient because NPU handles it

Intel Meteor Lake NPU

Intel Neural Processing Unit (Core Ultra / Meteor Lake, 2023):

Performance: 10-11 TOPS (INT8)
Architecture:
  - NN Compute Engine: MAC array
  - Slice architecture: independent compute tiles
  - Dedicated memory controller

Strengths:
  - Always-on AI (sips power in background)
  - Windows Studio Effects (uses NPU when screen open)
  - Real-time noise suppression, eye contact correction

OpenVINO integration:
from openvino.runtime import Core, CompiledModel

ie = Core()
# List available devices
print(ie.available_devices)  # ['CPU', 'GPU', 'NPU']

# Compile model specifically for NPU
compiled_model = ie.compile_model(
    model=onnx_model_path,
    device_name="NPU",
    config={"PERFORMANCE_HINT": "THROUGHPUT"}
)

# Run inference
output = compiled_model({compiled_model.input(): input_data})

Use cases:
  - Windows Studio Effects (background blur, eye gaze correction)
  - Voice recognition pre-processing (wake word detection)
  - Real-time translation with small models
  - Image enhancement (computational photography)
  - NOT suitable: large LLMs (10 TOPS is insufficient)

8. Device-by-Device LLM Capability Matrix

As of 2025: practical on-device LLM inference capabilities

iPhone 16 (8GB RAM, A18 Pro, ~35 TOPS ANE):
  Feasible:   Llama 3.2 3B INT4 (~2.0 GB, ~25 tok/s)
  Feasible:   Phi-3.5 mini 3.8B INT4 (~2.3 GB, ~20 tok/s)
  Feasible:   Llama 3.1 8B INT4 (~5.0 GB, ~12 tok/s)
  Infeasible: Llama 3.1 8B FP16 (16 GB required, RAM exhausted)
  Infeasible: 70B models (even INT4 needs 35+ GB)

MacBook Air M3 16GB:
  Feasible:   Llama 3.1 8B Q4 (~5 GB, ~40 tok/s)
  Feasible:   Mistral 7B Q4 (~4.5 GB, ~45 tok/s)
  Borderline: Llama 3.1 8B Q8 (~9 GB, ~22 tok/s)
  Infeasible: Llama 3.1 70B (even Q4 needs 40 GB)

MacBook Pro M3 Max 128GB (400 GB/s):
  Feasible:   Llama 3.1 70B Q4 (~40 GB, ~22 tok/s)
  Feasible:   Llama 3.1 70B Q8 (~75 GB, ~11 tok/s)
  Feasible:   Llama 3.1 405B Q2_K (~100 GB, ~4 tok/s, low quality)
  Borderline: 405B Q4 needs ~220 GB

Mac Studio M3 Ultra 192GB (800 GB/s):
  Feasible:   Llama 3.1 70B FP16 (~140 GB, ~35 tok/s)
  Feasible:   Llama 3.1 405B Q4 (~230 GB -- requires 192GB tier)
  This is a legitimate GPT-4-class local inference machine

Snapdragon X Elite PC (32GB):
  Feasible:   Llama 3.2 3B Q4 (~30 tok/s on NPU)
  Feasible:   Phi-3.5 mini (~25 tok/s on NPU)
  Feasible:   Llama 3.1 8B Q4 (~15 tok/s on GPU)
  Infeasible: 70B models (RAM exhausted)

9. The LLM Chip Wars: Beyond GPU

General-purpose GPUs are facing serious competition from dedicated inference accelerators.

Groq LPU: Deterministic Dataflow

Groq Language Processing Unit (LPU):

Core innovation: deterministic dataflow architecture
- At compile time: EVERY memory access, EVERY operation is statically scheduled
- At runtime: zero scheduling overhead, zero cache misses (by design)
- Result: perfectly predictable, maximally pipelined execution

Why this wins for LLM inference:
- LLM inference has STATIC computation graph
- Same model = same sequence of operations every time
- Compiler can optimize perfectly because nothing is unknown

Real numbers:
- Llama 2 70B: ~500 tok/s on Groq (vs ~120 tok/s on H100)
- Reason: LPU never stalls waiting for memory
- 14 TOPS/W efficiency vs H100's 2.8 TOPS/W

Limitations:
- Only runs specific pre-compiled model architectures
- Recompile required for any model change
- Very limited flexibility outside inference

Cerebras WSE-3: The Wafer-Scale Monster

Cerebras Wafer Scale Engine 3 (2023):

Size:        46,225 mm^2 (entire silicon wafer)
AI cores:    900,000 cores
On-chip SRAM: 900 MB (!) at extremely high bandwidth
Die:          7nm TSMC

The key insight:
- Enough SRAM to hold an entire model's working memory
- NO HBM needed for many workloads
- Eliminates the memory bandwidth bottleneck completely

For a 7B INT8 model:
- Model weights: 7 GB (too large for 900 MB SRAM)
- Solution: used with Cerebras Memory Expansion (CM-X)
  or for smaller models that fit entirely on-die

Performance:
- 1.5x - 3x faster than H100 for transformer training
- For inference of models fitting on SRAM: latency near-zero
- Use case: organizations needing ultra-fast LLM research iterations

SambaNova Reconfigurable Dataflow Architecture

SambaNova RDA (Reconfigurable Dataflow Architecture):

Concept: FPGA-like programmability + ASIC performance
- Dataflow graph is mapped to physical silicon connections
- Reconfigure the chip's connectivity pattern per model
- Avoids von Neumann bottleneck for known dataflows

Advantages:
- No control overhead (connections ARE the computation)
- Excellent for batch inference (government, research)
- Can specialize hardware routing per customer model

Customers: US national labs, government agencies

Why general-purpose hardware will eventually lose:
- GPU scheduler, register file, cache hierarchy = overhead
- For known, static computation: these overheads are pure waste
- Specialized chips eliminate this waste entirely

10. Practical: Running LLM Inference On-Device

With llama.cpp (Cross-Platform)

# Build llama.cpp with Metal (Apple Silicon) support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with Metal GPU acceleration
cmake -B build -DLLAMA_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Download quantized model (GGUF format)
# Llama 3.1 8B at Q4_K_M: ~5.0 GB
pip install huggingface-hub
huggingface-cli download \
  bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --local-dir ./models

# Run inference (Metal-accelerated on Apple Silicon)
./build/bin/llama-cli \
  -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -n 512 \
  --n-gpu-layers 35 \
  --ctx-size 4096 \
  -p "Explain the roofline model for LLM inference"

# Expected results (Apple M3 Max 16-core):
# Load time:   ~3 seconds
# Throughput:  ~45 tok/s (with GPU offload)
# Memory:      ~5.5 GB
# Python binding for llama.cpp
from llama_cpp import Llama
import time

# Load model
llm = Llama(
    model_path="./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
    n_gpu_layers=35,     # Layers to offload to GPU/NPU
    n_ctx=4096,          # Context window
    n_threads=8,         # CPU threads for non-GPU layers
    flash_attn=True,     # Use Flash Attention (memory efficient)
    verbose=False
)

# Benchmark inference
def benchmark(llm, prompt, n_tokens=200):
    start = time.perf_counter()
    output = llm(prompt, max_tokens=n_tokens, echo=False)
    elapsed = time.perf_counter() - start
    n_out = output['usage']['completion_tokens']
    return n_out, elapsed, n_out / elapsed

n_tok, t, speed = benchmark(
    llm,
    "Explain what makes NPUs more efficient than GPUs for LLM inference:",
    n_tokens=200
)
print(f"Generated {n_tok} tokens in {t:.2f}s = {speed:.1f} tok/s")

# Measure memory usage
import subprocess
result = subprocess.run(['llama-bench', '-m', 'model.gguf', '-p', '512', '-n', '128'],
                      capture_output=True, text=True)
print(result.stdout)

Building a Simple Inference Server

# FastAPI server with local LLM (production-ready pattern)
from fastapi import FastAPI
from pydantic import BaseModel
from llama_cpp import Llama
import asyncio
from concurrent.futures import ThreadPoolExecutor

app = FastAPI()
executor = ThreadPoolExecutor(max_workers=1)  # LLM is not thread-safe

# Load model at startup
llm = Llama(
    model_path="./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
    n_gpu_layers=35,
    n_ctx=4096,
    flash_attn=True,
    verbose=False
)

class InferenceRequest(BaseModel):
    prompt: str
    max_tokens: int = 256
    temperature: float = 0.7

@app.post("/generate")
async def generate(request: InferenceRequest):
    loop = asyncio.get_event_loop()

    def run_inference():
        return llm.create_chat_completion(
            messages=[{"role": "user", "content": request.prompt}],
            max_tokens=request.max_tokens,
            temperature=request.temperature
        )

    # Run in thread pool to avoid blocking the event loop
    result = await loop.run_in_executor(executor, run_inference)
    return {
        "response": result['choices'][0]['message']['content'],
        "tokens_generated": result['usage']['completion_tokens'],
    }

# Run: uvicorn server:app --host 0.0.0.0 --port 8080

Conclusion

NPUs represent a hardware revolution that's bringing AI from the cloud into every pocket and laptop.

The key lessons for any LLM infrastructure engineer:

  1. Power is the constraint: Running AI on a smartphone requires 1/50th of GPU power -> NPU is the only answer
  2. LLM inference is memory-bound, always: At batch_size=1, memory bandwidth determines performance — not TFLOPS. This single insight changes every hardware decision.
  3. Quantization is the unlock: INT4 vs FP16 = 4x less memory = 4x more tokens/sec. Good calibration is the difference between "usable" and "broken."
  4. KV Cache is the memory budget: For long-context models, KV Cache often exceeds model weights in memory. GQA reduces this by 4-8x.
  5. Specialized chips will win: Groq, Cerebras, Etched — purpose-built inference hardware is already beating GPUs on efficiency. The question is just when it becomes cost-competitive at scale.

The hardware and software co-evolution happening right now is the most exciting frontier in systems engineering. Understanding TPUs and NPUs isn't just curiosity — it's the foundational knowledge of the AI infrastructure engineer's craft.


References

  • Apple Neural Engine Patent Filings (US Patent Office, 2017-2024)
  • Qualcomm AI Engine Direct SDK Documentation
  • "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (Dao, 2023)
  • "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers" (Frantar et al., 2022)
  • "AWQ: Activation-aware Weight Quantization for LLM Compression" (Lin et al., 2023)
  • "Roofline: An Insightful Visual Performance Model" (Williams et al., 2009)
  • llama.cpp: github.com/ggerganov/llama.cpp
  • Groq LPU Technical Whitepaper: groq.com
  • Cerebras WSE-3 Architecture: cerebras.net