Split View: AI 하드웨어 가속기 완전 정복: H100, TPU, Cerebras, 엣지 AI 칩 비교

AI 하드웨어 가속기 완전 정복: H100, TPU, Cerebras, 엣지 AI 칩 비교

시작하며

AI 워크로드가 다양해짐에 따라 하드웨어 가속기 시장도 폭발적으로 성장하고 있습니다. NVIDIA GPU가 여전히 지배적이지만, Google TPU, Cerebras WSE-3, AWS Inferentia, Apple Neural Engine 등 목적에 특화된 가속기들이 빠르게 자리를 잡고 있습니다.

이 가이드는 주요 AI 하드웨어 가속기의 아키텍처, 성능 특성, 사용 사례를 체계적으로 비교합니다. 학습용 GPU 선택부터 엣지 배포 칩까지, 올바른 하드웨어를 선택하는 데 필요한 모든 정보를 담았습니다.

1. NVIDIA Hopper 아키텍처: H100 & H200

Hopper SM 구조

NVIDIA H100은 Hopper 마이크로아키텍처를 기반으로 설계되었습니다. 각 Streaming Multiprocessor(SM)는 다음 구성 요소를 포함합니다.

4개의 워프 스케줄러: 동시에 4개의 워프(32 스레드)를 스케줄링
4세대 Tensor Core: FP8, FP16, BF16, TF32, FP64 지원
공유 메모리: SM당 최대 228KB (L1 캐시 포함)
레지스터 파일: SM당 65,536개의 32비트 레지스터

H100 SXM5 전체 스펙은 다음과 같습니다.

항목	H100 SXM5	H200 SXM5
SM 수	132	132
CUDA 코어	16,896	16,896
Tensor Core (4세대)	528	528
FP8 TFLOPS	3,958	3,958
BF16 TFLOPS	1,979	1,979
메모리 종류	HBM3	HBM3e
메모리 용량	80GB	141GB
메모리 대역폭	3.35TB/s	4.8TB/s
TDP	700W	700W
NVLink 대역폭	900GB/s	900GB/s

4세대 Tensor Core와 Transformer Engine

H100의 핵심 혁신은 Transformer Engine입니다. 이 엔진은 FP8 연산을 지원하면서도 정밀도 손실을 최소화합니다.

동작 원리는 다음과 같습니다. 각 트랜스포머 레이어마다 활성화 값의 통계(최댓값, 표준편차)를 추적하고, 이를 기반으로 동적 스케일링 팩터를 계산합니다. FP8로 연산하면서 스케일링을 통해 수치 안정성을 유지합니다.

# CUDA 디바이스 속성 쿼리
import torch

def query_gpu_properties():
    if not torch.cuda.is_available():
        print("CUDA를 사용할 수 없습니다.")
        return

    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        print(f"GPU {i}: {props.name}")
        print(f"  Compute Capability: {props.major}.{props.minor}")
        print(f"  Total Memory: {props.total_memory / 1024**3:.1f} GB")
        print(f"  Multiprocessors: {props.multi_processor_count}")
        print(f"  Max Threads/SM: {props.max_threads_per_multi_processor}")
        print(f"  L2 Cache Size: {props.l2_cache_size / 1024**2:.1f} MB")

        # Hopper 여부 확인 (Compute Capability 9.0)
        if props.major == 9:
            print(f"  Architecture: Hopper (H100/H200)")
        elif props.major == 8:
            print(f"  Architecture: Ampere (A100/A800)")

query_gpu_properties()

NVLink 4.0과 NVSwitch

대규모 모델 학습에는 다수의 GPU 간 고속 통신이 필수입니다. H100의 NVLink 4.0은 GPU당 900GB/s의 양방향 대역폭을 제공합니다.

NVLink 3.0 (A100): GPU당 600GB/s
NVLink 4.0 (H100): GPU당 900GB/s
NVSwitch 3세대: 단일 스위치당 7.2TB/s 전체 대역폭

DGX H100 시스템(8개 GPU)에서 NVSwitch 3개가 모든 GPU를 full-mesh 토폴로지로 연결합니다. 이를 통해 any-to-any GPU 통신이 PCIe 대비 7배 이상 빠릅니다.

2. Google TPU: Systolic Array 아키텍처

TPU의 핵심: Systolic Array

TPU(Tensor Processing Unit)는 행렬 곱셈에 특화된 ASIC입니다. 핵심 연산 유닛인 systolic array는 데이터가 물결처럼 흘러가며(systolic) 연산이 이루어지는 구조입니다.

TPU v4의 MXU(Matrix Multiply Unit)는 128x128 크기의 systolic array를 사용합니다. 각 셀은 이전 셀로부터 입력값을 받아 MAC(Multiply-Accumulate) 연산을 수행하고 결과를 다음 셀로 전달합니다.

이 구조의 장점은 다음과 같습니다.

메모리 접근 횟수 최소화: 데이터가 어레이를 통과하는 동안 재사용
높은 산술 집약도(Arithmetic Intensity): 같은 데이터로 더 많은 연산
결정론적 실행: 지연시간 예측 가능

TPU v4와 v5e 비교

항목	TPU v4	TPU v5e
BF16 TFLOPS	275	197
INT8 TOPS	275	394
HBM 용량	32GB	16GB
HBM 대역폭	1,200GB/s	1,600GB/s
ICI 대역폭	1,200GB/s/chip	1,600GB/s/chip
전력 소비	~170W	~90W
비용 효율	학습 최적화	추론 최적화

TPU v5e는 전력 효율에 최적화되어 추론 워크로드에 특히 경제적입니다.

TPU Pod와 ICI

TPU Pod는 수천 개의 TPU 칩을 고속 ICI(Inter-Chip Interconnect)로 연결한 클러스터입니다. ICI는 데이터센터 네트워크 대신 칩 간 직접 연결을 사용해 지연시간을 극적으로 줄입니다.

TPU v4 Pod: 4,096개 칩, 1 exaFLOPS(BF16) 이상
ICI 토폴로지: 3D 토러스(torus) 메시

JAX/XLA로 TPU 활용

# JAX on TPU 기본 예제
import jax
import jax.numpy as jnp
from jax import random

# TPU 디바이스 확인
devices = jax.devices()
print(f"사용 가능한 디바이스: {devices}")

# 데이터 샤딩으로 TPU Pod 전체 활용
from jax.sharding import Mesh, PartitionSpec, NamedSharding
import numpy as np

# 8-way 텐서 병렬화 설정
mesh = Mesh(np.array(jax.devices()).reshape(2, 4), ('batch', 'model'))

def matrix_multiply_tpu(a, b):
    # XLA가 자동으로 TPU systolic array 활용을 최적화
    return jnp.dot(a, b)

# jit 컴파일로 XLA 최적화 적용
compiled_matmul = jax.jit(matrix_multiply_tpu)

key = random.PRNGKey(0)
a = random.normal(key, (4096, 4096), dtype=jnp.bfloat16)
b = random.normal(key, (4096, 4096), dtype=jnp.bfloat16)

result = compiled_matmul(a, b)
print(f"결과 shape: {result.shape}, dtype: {result.dtype}")

3. AI ASIC: 전용 가속기들

Cerebras WSE-3: 웨이퍼 스케일 엔진

Cerebras WSE-3(Wafer Scale Engine 3)는 단일 실리콘 웨이퍼 전체를 하나의 칩으로 사용하는 획기적인 설계입니다.

항목	WSE-3 사양
다이 크기	46,225 mm² (웨이퍼 전체)
AI 코어 수	900,000개
온칩 SRAM	44GB
메모리 대역폭	21PB/s (온칩)
FP16 성능	125 PFLOPS
패브릭 대역폭	220Pb/s

핵심 장점은 inter-chip 통신 병목의 완전 제거입니다. 기존 GPU 클러스터에서는 수백 개의 GPU가 네트워크나 NVLink로 연결되어 통신 오버헤드가 발생합니다. WSE-3는 모든 코어가 단일 웨이퍼 위의 온칩 패브릭으로 연결되어 있어 지연시간이 나노초 단위입니다.

CS-3 시스템에서는 웨이퍼 하나가 최대 24개 서버 랙의 GPU 클러스터를 대체한다고 Cerebras는 주장합니다.

Graphcore IPU

Graphcore의 IPU(Intelligence Processing Unit)는 Bulk Synchronous Parallel(BSP) 실행 모델을 사용합니다.

MK2 GC200: 1,472개의 IPU 타일, 각 타일에 8,832개 스레드
온칩 메모리: 900MB (SRAM)
대역폭: 45TB/s
특징: 희소(sparse) 연산 최적화, 그래프 신경망에 탁월

IPU는 불규칙한 그래프 구조 연산에서 GPU를 능가하며, 강화학습이나 GNN 워크로드에 유리합니다.

Groq LPU

Groq LPU(Language Processing Unit)는 LLM 추론에 특화된 ASIC으로, 결정론적 실행(deterministic execution) 아키텍처가 특징입니다.

소프트웨어 정의 메모리: 런타임에 동적 메모리 관리 없음
SIMD 스트리밍: 컴파일 시점에 모든 메모리 접근 패턴 결정
클럭 사이클당 처리량: 예측 가능한 지연시간

결과적으로 LLaMA-3 70B 추론에서 Groq는 초당 240토큰 이상을 달성하는데, 이는 GPU 대비 10배 이상 빠른 수치입니다.

SambaNova DataScale

SambaNova의 RDU(Reconfigurable Dataflow Unit)는 데이터플로우 아키텍처를 채택합니다.

모델 가중치를 온칩 SRAM에 완전히 적재
DRAM 접근 최소화로 메모리 병목 해소
GPT-4급 모델 추론 지원

4. 추론 전용 칩

AWS Inferentia 2

AWS가 자체 설계한 추론 전용 칩으로, Trainium과 함께 AWS의 AI 하드웨어 전략의 핵심입니다.

항목	Inferentia 1	Inferentia 2
NeuronCore 수	4	2 (강화된 설계)
FP16 TFLOPS	128	384
메모리	8GB	32GB HBM
메모리 대역폭	50GB/s	820GB/s
NeuronLink 대역폭	-	384GB/s
가격 (시간당)	inf1.xlarge ~$0.228	inf2.xlarge ~$0.758

Inferentia 2는 NeuronSDK를 통해 PyTorch, TensorFlow, JAX 모델을 투명하게 지원합니다.

Intel Gaudi 3

Intel Gaudi 3는 Habana Labs(Intel 인수)의 설계로 H100과 직접 경쟁합니다.

항목	Gaudi 3	H100 SXM5
BF16 TFLOPS	1,835	1,979
FP8 TOPS	1,835	3,958
HBM 용량	96GB HBM2e	80GB HBM3
HBM 대역폭	3.7TB/s	3.35TB/s
네트워크	24x 200GbE RoCE	NVLink 4.0
TDP	900W	700W

비용 효율 면에서 Gaudi 3는 H100 대비 약 30% 저렴한 클라우드 인스턴스를 제공합니다.

Qualcomm Cloud AI 100

Qualcomm의 데이터센터 추론 칩으로, 전력 효율이 강점입니다.

AI 100 Ultra: 960 TOPS (INT8), 400W
온칩 메모리: 144MB SRAM
메모리 대역폭: 3.6TB/s
서버당 최대 8개 카드 지원

5. 엣지 AI 칩

Apple Neural Engine (ANE)

Apple Silicon의 Neural Engine은 iPhone, iPad, Mac에 내장된 전용 AI 가속기입니다.

칩	ANE 성능	출시연도
A15 Bionic	15.8 TOPS	2021
A16 Bionic	17 TOPS	2022
A17 Pro	35 TOPS	2023
M4	38 TOPS	2024

ANE는 CoreML 프레임워크를 통해 접근 가능하며, 모델 추론에서 CPU 대비 최대 10배 전력 효율을 보입니다.

# Apple CoreML로 엣지 AI 배포
import coremltools as ct
import torch
import torchvision

# PyTorch 모델을 CoreML로 변환
model = torchvision.models.mobilenet_v3_small(pretrained=True)
model.eval()

# 예시 입력으로 트레이싱
example_input = torch.rand(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)

# CoreML 변환 (Neural Engine 타깃)
mlmodel = ct.convert(
    traced_model,
    inputs=[ct.ImageType(
        name="input",
        shape=example_input.shape,
        color_layout=ct.colorlayout.RGB
    )],
    compute_units=ct.ComputeUnit.ALL,  # ANE + GPU + CPU 자동 선택
    minimum_deployment_target=ct.target.iOS17,
)

mlmodel.save("mobilenet_v3_small.mlpackage")
print("CoreML 모델 저장 완료 - Neural Engine 최적화 적용")

Qualcomm Hexagon DSP

Qualcomm Snapdragon에 내장된 Hexagon DSP는 스마트폰 AI 처리의 핵심입니다.

Hexagon 698 (Snapdragon 8 Gen 3): 98 TOPS
HVX(Hexagon Vector eXtensions): SIMD 벡터 연산
HTA(Hexagon Tensor Accelerator): 트랜스포머 전용 가속

Qualcomm Neural Processing SDK(SNPE)를 통해 TensorFlow/PyTorch 모델을 Hexagon에 배포할 수 있습니다.

Raspberry Pi 5 AI HAT

Raspberry Pi AI HAT+는 Hailo-8L 칩을 탑재한 엣지 AI 가속기입니다.

Hailo-8L: 13 TOPS
M.2 인터페이스로 RPi 5에 연결
가격: 약 $70
용도: 실시간 영상 분석, 객체 탐지

6. 메모리 기술: HBM3e vs GDDR7

HBM(High Bandwidth Memory) 아키텍처

HBM은 DRAM 다이를 수직으로 적층(3D stacking)하고 실리콘 인터포저를 통해 GPU와 연결하는 메모리 기술입니다.

메모리	대역폭	용량	전력	핀 수	주요 용도
HBM2e	3.2TB/s	최대 80GB	~460W	1,024	A100
HBM3	3.35TB/s	최대 80GB	~700W	1,024	H100
HBM3e	4.8TB/s	최대 141GB	~700W	1,024	H200, MI300X
GDDR6X	576GB/s	최대 24GB	低	384	RTX 4090
GDDR7	960GB/s	최대 32GB	低	512	RTX 5090

HBM이 AI 학습에 유리한 이유는 크게 세 가지입니다.

대역폭: GDDR7 대비 5배 이상 높은 메모리 대역폭은 대형 배치 학습 시 메모리 병목을 해소합니다.
용량: 단일 GPU에 80~141GB 탑재 가능해 70B 파라미터 모델도 단일 GPU에서 추론 가능합니다.
에너지 효율: 바이트당 전력 소비가 GDDR 대비 낮아 TCO가 유리합니다.

Near-Memory Computing

Near-memory computing(또는 Processing-in-Memory, PIM)은 메모리 내부에 연산 유닛을 배치하는 개념입니다. Samsung HBM-PIM, SK Hynix AiM(Accelerator in Memory)이 대표적입니다.

메모리-연산 유닛 간 데이터 이동 최소화
메모리 대역폭 병목의 근본적 해소
특히 추론 단계에서 메모리 바운드 연산에 효과적

CXL(Compute Express Link)

CXL은 CPU와 가속기, 메모리 확장 장치를 PCIe 물리 레이어 위에서 연결하는 차세대 인터커넥트 표준입니다.

CXL 1.1: Type 1(가속기), Type 2(가속기+메모리), Type 3(메모리 확장)
CXL 2.0: 스위칭 지원으로 다중 호스트 공유
CXL 3.0: P2P 통신, 패브릭 지원

AI 서버에서 CXL Type 3 메모리 확장으로 GPU VRAM 부족 문제를 해결하려는 시도가 늘고 있습니다.

7. 하드웨어 선택 가이드

학습 vs 추론

워크로드 유형에 따라 최적 하드웨어가 다릅니다.

대규모 학습(Pre-training)

최적: H100 SXM5 (NVLink 필수), TPU v4 Pod
이유: 높은 MFU(Model FLOP Utilization), NVLink/ICI 집합 통신 속도
배치 크기: 가능한 한 크게 (Global batch 수백만 토큰)

파인튜닝(Fine-tuning)

최적: H100/A100, AMD MI300X, Gaudi 3
이유: 중간 규모 GPU 클러스터, 비용 효율
배치 크기: 중간 (512~4096 토큰)

대규모 추론(Serving, 높은 처리량)

최적: H100, Inferentia 2, Gaudi 3
이유: 대용량 KV캐시, 높은 처리량
배치 크기: 동적 (연속 배칭)

저지연 추론(Latency-critical)

최적: Groq LPU, Cerebras CS-3
이유: 결정론적 실행, 메모리 병목 없음
배치 크기: 소규모 (1~8)

모델 크기별 하드웨어 요구사항 (추론 기준)

모델 크기	파라미터	FP16 VRAM	BF16 최소 GPU
Small	7B	14GB	1x A10G (24GB)
Medium	13B	26GB	1x A100 (40GB)
Large	34B	68GB	2x A100 (80GB)
XL	70B	140GB	2x H100 (80GB)
XXL	405B	810GB	10x H100 (80GB)

PyTorch 디바이스 선택 및 벤치마킹

# PyTorch 디바이스 선택 및 벤치마킹
import torch
import time

def benchmark_matmul(device_name: str, size: int = 4096, dtype=torch.float16):
    """행렬 곱셈 벤치마크"""
    device = torch.device(device_name)

    a = torch.randn(size, size, dtype=dtype, device=device)
    b = torch.randn(size, size, dtype=dtype, device=device)

    # 워밍업
    for _ in range(5):
        _ = torch.matmul(a, b)

    if device.type == 'cuda':
        torch.cuda.synchronize()

    start = time.perf_counter()
    for _ in range(100):
        c = torch.matmul(a, b)
    if device.type == 'cuda':
        torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    ops = 2 * size ** 3 * 100  # FLOPs
    tflops = ops / elapsed / 1e12
    print(f"{device_name} ({dtype}): {tflops:.2f} TFLOPS ({elapsed*1000/100:.2f} ms/iter)")

# 사용 가능한 디바이스 자동 선택
if torch.cuda.is_available():
    benchmark_matmul("cuda:0", dtype=torch.float16)
    benchmark_matmul("cuda:0", dtype=torch.bfloat16)

if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    benchmark_matmul("mps", dtype=torch.float16)

benchmark_matmul("cpu", dtype=torch.float32)

torch.compile로 하드웨어 최적화

# torch.compile 활용 하드웨어 최적화
import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, d_model=1024, nhead=16):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, nhead, batch_first=True)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.GELU(),
            nn.Linear(d_model * 4, d_model),
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        attn_out, _ = self.attn(x, x, x)
        x = self.norm1(x + attn_out)
        x = self.norm2(x + self.ff(x))
        return x

model = TransformerBlock().cuda().to(torch.bfloat16)

# torch.compile: Triton 커널로 자동 최적화
# H100에서 Hopper 전용 FlashAttention 활용
compiled_model = torch.compile(model, mode="max-autotune")

x = torch.randn(8, 512, 1024, dtype=torch.bfloat16, device="cuda")

# 첫 실행 시 컴파일 (수 초 소요)
with torch.autocast("cuda", dtype=torch.bfloat16):
    out = compiled_model(x)

print(f"출력 shape: {out.shape}")

비용 효율 분석 (2025년 기준 클라우드 시간당 가격)

인스턴스	GPU	시간당 가격	TFLOPS (BF16)	$/TFLOP
p4d.24xlarge	8x A100 40GB	$32.77	8 x 312 = 2,496	$13.1
p4de.24xlarge	8x A100 80GB	$40.96	8 x 312 = 2,496	$16.4
p5.48xlarge	8x H100 80GB	$98.32	8 x 1,979 = 15,832	$6.2
trn1.32xlarge	16x Trainium	$21.50	16 x 420 = 6,720	$3.2
inf2.48xlarge	12x Inferentia2	$12.98	12 x 384 = 4,608	$2.8
g6.48xlarge	8x L40S 48GB	$16.29	8 x 733 = 5,864	$2.8

추론 워크로드에서는 Inferentia 2와 Trainium이 비용 효율이 가장 높습니다.

8. 하드웨어 비교 종합표

가속기	유형	BF16 TFLOPS	메모리	대역폭	TDP	주요 용도
H100 SXM5	GPU	1,979	80GB HBM3	3.35TB/s	700W	학습/추론
H200 SXM5	GPU	1,979	141GB HBM3e	4.8TB/s	700W	대형 모델 추론
A100 SXM4	GPU	312	80GB HBM2e	2.0TB/s	400W	범용
AMD MI300X	GPU	1,307	192GB HBM3	5.3TB/s	750W	대형 모델
TPU v5e	ASIC	197 (INT8: 394)	16GB HBM	1.6TB/s	90W	대규모 추론
Cerebras WSE-3	ASIC	125,000	44GB SRAM	21PB/s	23kW/시스템	초대형 학습
Groq LPU	ASIC	750	230MB SRAM	80TB/s	300W	저지연 추론
Gaudi 3	ASIC	1,835	96GB HBM2e	3.7TB/s	900W	비용효율 학습
Inferentia 2	ASIC	384	32GB HBM	820GB/s	75W	클라우드 추론
Apple M4 ANE	엣지	38 TOPS	공유	공유	~10W	온디바이스
Hailo-8L	엣지	13 TOPS	-	-	1W	임베디드

퀴즈

Q1. NVIDIA H100의 Transformer Engine이 FP8 학습에서 정밀도를 유지하는 방법은?

정답: 동적 스케일링(Dynamic Scaling)과 혼합 정밀도 유지

설명: Transformer Engine은 각 레이어마다 활성화(activation)와 가중치(weight)의 통계(최댓값)를 추적합니다. 이를 기반으로 FP8 양자화 시 최적 스케일 팩터를 계산합니다. 순전파는 FP8로 수행하지만, 그래디언트 누적은 BF16/FP32로 유지합니다. 또한 레이어별로 수치 범위를 모니터링하여 오버플로나 언더플로 발생 시 자동으로 재스케일합니다. 이 Delayed Scaling 메커니즘 덕분에 FP8의 속도 이점을 누리면서도 BF16에 가까운 학습 안정성을 유지합니다.

Q2. Google TPU의 systolic array가 행렬 곱셈을 병렬화하는 방식은?

정답: 데이터 재사용 파이프라인 방식의 MAC 연산 배열

설명: Systolic array는 NxN 개의 MAC(Multiply-Accumulate) 유닛이 격자 형태로 배치된 구조입니다. 행렬 A의 행 데이터는 왼쪽에서 오른쪽으로, 행렬 B의 열 데이터는 위에서 아래로 흘러갑니다. 각 셀은 자신을 통과하는 두 값을 곱하고, 이전 셀의 누적값에 더합니다. 물결(systole)처럼 데이터가 흐르기 때문에 각 데이터 원소가 어레이의 모든 관련 셀을 통과하며 재사용됩니다. TPU v4의 128x128 MXU는 한 클럭 사이클당 128x128=16,384번의 MAC 연산을 수행하며, 메모리 접근 없이 온칩에서 처리합니다.

Q3. HBM이 GDDR보다 AI 학습에 유리한 이유 (대역폭 vs 용량)?

정답: 높은 대역폭과 대용량 두 가지 모두에서 우위

설명: 대역폭 측면에서 HBM3e(H200)는 4.8TB/s인 반면 GDDR7(RTX 5090)은 960GB/s로 5배 차이입니다. AI 학습은 메모리 대역폭에 민감한(bandwidth-bound) 연산이 많아 이 차이가 직접적인 성능 차이로 이어집니다. 용량 측면에서 H200의 141GB HBM3e는 RTX 5090의 32GB GDDR7 대비 4배 이상 많아, 70B 파라미터 모델을 단일 GPU에서 처리할 수 있습니다. 구조적으로 HBM은 DRAM 다이를 수직 적층하고 수천 개의 와이드 버스로 GPU와 연결하여 높은 대역폭과 에너지 효율을 동시에 달성합니다.

Q4. Cerebras WSE-3의 웨이퍼 스케일 집적이 inter-chip 통신 병목을 제거하는 원리는?

정답: 단일 웨이퍼 내 온칩 패브릭으로 모든 코어 연결

설명: 일반 GPU 클러스터에서는 수백 개의 칩이 NVLink, InfiniBand 등의 네트워크로 연결됩니다. 이 inter-chip 통신은 수 마이크로초의 지연시간과 제한된 대역폭을 가집니다. WSE-3는 900,000개의 AI 코어가 하나의 웨이퍼 위에 있어 모든 코어 간 통신이 온칩 패브릭을 통해 이루어집니다. 온칩 패브릭 지연시간은 나노초 수준이며 대역폭은 220Pb/s에 달합니다. 또한 44GB의 SRAM을 코어 근처에 분산 배치하여 메모리 접근 지연도 최소화합니다. 이 덕분에 대규모 모델 학습 시 통신 오버헤드가 거의 없어 near-linear 스케일링이 가능합니다.

Q5. LLM 추론에서 Groq LPU가 GPU보다 낮은 지연시간을 달성하는 아키텍처 결정은?

정답: 컴파일 시점의 결정론적 메모리 스케줄링

설명: GPU에서 LLM 추론 시 지연시간이 높은 주요 원인은 불규칙한 메모리 접근 패턴과 런타임 동적 스케줄링입니다. Groq LPU는 컴파일 시점에 모든 텐서의 메모리 위치와 이동 경로를 정적으로 결정합니다. 실행 중 메모리 할당/해제나 스케줄러 오버헤드가 없습니다. 또한 SRAM 기반 메모리 아키텍처로 DRAM의 불규칙한 접근 지연이 없습니다. 모든 연산이 정해진 클럭 사이클에 실행되어 지연시간이 예측 가능합니다. 이 결정론적 실행 덕분에 LLaMA-3 70B 기준 초당 240토큰 이상의 처리량과 매우 낮은 첫 토큰 생성 지연시간(TTFT)을 달성합니다.

마치며

AI 하드웨어 가속기 시장은 2024-2026년 사이 빠르게 다양화되고 있습니다. NVIDIA H100/H200이 학습 워크로드의 황금 표준이지만, 목적별 최적화된 가속기들이 특정 사용 사례에서 우위를 보입니다.

핵심 선택 원칙은 다음과 같습니다.

학습: 대역폭과 NVLink가 핵심 — H100 SXM5, TPU v4 Pod
고처리량 추론: 비용 효율 중시 — Inferentia 2, Gaudi 3, TPU v5e
저지연 추론: 결정론적 실행 — Groq LPU
엣지 배포: 전력 효율 — Apple ANE, Qualcomm Hexagon
초대형 학습: inter-chip 병목 없음 — Cerebras WSE-3

하드웨어 선택은 결국 워크로드 특성, 예산, 생태계 성숙도의 균형입니다. NVIDIA 에코시스템의 성숙도는 여전히 강력한 이점이지만, 특정 워크로드에서는 전용 ASIC이 훨씬 경제적일 수 있습니다.

AI Hardware Accelerators Complete Guide: H100, TPU, Cerebras, and Edge AI Chips Compared

Introduction

As AI workloads diversify, the hardware accelerator market has exploded in variety. While NVIDIA GPUs remain dominant, purpose-built accelerators — Google TPU, Cerebras WSE-3, AWS Inferentia, Apple Neural Engine, and many others — are rapidly claiming their niches.

This guide systematically compares the architecture, performance characteristics, and use cases of major AI hardware accelerators. From selecting training GPUs to deploying models on edge chips, everything you need to make the right hardware decision is covered here.

1. NVIDIA Hopper Architecture: H100 & H200

Hopper SM Structure

The NVIDIA H100 is built on the Hopper microarchitecture. Each Streaming Multiprocessor (SM) contains the following components.

4 warp schedulers: Schedule 4 warps (32 threads each) simultaneously
4th-generation Tensor Cores: Support FP8, FP16, BF16, TF32, and FP64
Shared memory: Up to 228KB per SM (including L1 cache)
Register file: 65,536 32-bit registers per SM

Full H100 SXM5 specifications are as follows.

Specification	H100 SXM5	H200 SXM5
SM count	132	132
CUDA cores	16,896	16,896
Tensor Cores (4th gen)	528	528
FP8 TFLOPS	3,958	3,958
BF16 TFLOPS	1,979	1,979
Memory type	HBM3	HBM3e
Memory capacity	80GB	141GB
Memory bandwidth	3.35TB/s	4.8TB/s
TDP	700W	700W
NVLink bandwidth	900GB/s	900GB/s

4th-Gen Tensor Cores and Transformer Engine

The key innovation in H100 is the Transformer Engine. This engine supports FP8 computation while minimizing precision loss.

The operating principle: per transformer layer, statistics (max value, standard deviation) of activations are tracked, and a dynamic scaling factor is computed from these. FP8 arithmetic is used while scaling maintains numerical stability.

# CUDA device properties query
import torch

def query_gpu_properties():
    if not torch.cuda.is_available():
        print("CUDA is not available.")
        return

    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        print(f"GPU {i}: {props.name}")
        print(f"  Compute Capability: {props.major}.{props.minor}")
        print(f"  Total Memory: {props.total_memory / 1024**3:.1f} GB")
        print(f"  Multiprocessors: {props.multi_processor_count}")
        print(f"  Max Threads/SM: {props.max_threads_per_multi_processor}")
        print(f"  L2 Cache Size: {props.l2_cache_size / 1024**2:.1f} MB")

        # Check if Hopper (Compute Capability 9.0)
        if props.major == 9:
            print(f"  Architecture: Hopper (H100/H200)")
        elif props.major == 8:
            print(f"  Architecture: Ampere (A100/A800)")

query_gpu_properties()

NVLink 4.0 and NVSwitch

High-speed communication between multiple GPUs is essential for large-scale model training. H100's NVLink 4.0 delivers 900GB/s bidirectional bandwidth per GPU.

NVLink 3.0 (A100): 600GB/s per GPU
NVLink 4.0 (H100): 900GB/s per GPU
NVSwitch 3rd gen: 7.2TB/s total bandwidth per switch

In a DGX H100 system (8 GPUs), three NVSwitch units connect all GPUs in a full-mesh topology. This makes any-to-any GPU communication more than 7x faster than PCIe.

2. Google TPU: Systolic Array Architecture

The Heart of TPU: Systolic Array

A TPU (Tensor Processing Unit) is an ASIC specialized for matrix multiplication. The core compute unit, the systolic array, is a structure where data flows through in waves (systolic) while computation occurs.

The MXU (Matrix Multiply Unit) in TPU v4 uses a 128x128 systolic array. Each cell receives inputs from previous cells, performs a MAC (Multiply-Accumulate) operation, and passes results to the next cell.

The advantages of this structure are as follows.

Minimizes memory accesses: data is reused as it passes through the array
High arithmetic intensity: more operations per data element
Deterministic execution: predictable latency

TPU v4 vs v5e Comparison

Specification	TPU v4	TPU v5e
BF16 TFLOPS	275	197
INT8 TOPS	275	394
HBM capacity	32GB	16GB
HBM bandwidth	1,200GB/s	1,600GB/s
ICI bandwidth	1,200GB/s/chip	1,600GB/s/chip
Power consumption	~170W	~90W
Cost efficiency	Training-optimized	Inference-optimized

TPU v5e is optimized for power efficiency and is particularly economical for inference workloads.

TPU Pod and ICI

A TPU Pod is a cluster of thousands of TPU chips connected via high-speed ICI (Inter-Chip Interconnect). ICI uses direct chip-to-chip connections instead of data center networks, dramatically reducing latency.

TPU v4 Pod: 4,096 chips, over 1 exaFLOPS (BF16)
ICI topology: 3D torus mesh

Using TPU with JAX/XLA

# JAX on TPU basic example
import jax
import jax.numpy as jnp
from jax import random

# Check available devices
devices = jax.devices()
print(f"Available devices: {devices}")

# Use data sharding to utilize full TPU Pod
from jax.sharding import Mesh, PartitionSpec, NamedSharding
import numpy as np

# Set up 8-way tensor parallelism
mesh = Mesh(np.array(jax.devices()).reshape(2, 4), ('batch', 'model'))

def matrix_multiply_tpu(a, b):
    # XLA automatically optimizes for TPU systolic array usage
    return jnp.dot(a, b)

# Apply XLA optimization with jit compilation
compiled_matmul = jax.jit(matrix_multiply_tpu)

key = random.PRNGKey(0)
a = random.normal(key, (4096, 4096), dtype=jnp.bfloat16)
b = random.normal(key, (4096, 4096), dtype=jnp.bfloat16)

result = compiled_matmul(a, b)
print(f"Result shape: {result.shape}, dtype: {result.dtype}")

3. AI ASICs: Purpose-Built Accelerators

Cerebras WSE-3: Wafer Scale Engine

The Cerebras WSE-3 (Wafer Scale Engine 3) is a groundbreaking design that uses an entire silicon wafer as a single chip.

Specification	WSE-3
Die size	46,225 mm² (full wafer)
AI cores	900,000
On-chip SRAM	44GB
Memory bandwidth	21PB/s (on-chip)
FP16 performance	125 PFLOPS
Fabric bandwidth	220Pb/s

The key advantage is the complete elimination of inter-chip communication bottlenecks. In conventional GPU clusters, hundreds of GPUs are connected via networks or NVLink, incurring communication overhead. In WSE-3, all cores are connected via an on-chip fabric on a single wafer, with latency in the nanosecond range.

Cerebras claims that a single CS-3 system can replace up to 24 server racks of GPU clusters for large model training.

Graphcore IPU

Graphcore's IPU (Intelligence Processing Unit) uses the Bulk Synchronous Parallel (BSP) execution model.

MK2 GC200: 1,472 IPU tiles, each with 8,832 threads
On-chip memory: 900MB (SRAM)
Bandwidth: 45TB/s
Strengths: Optimized for sparse operations, excellent for graph neural networks

The IPU outperforms GPUs for irregular graph structure computations and excels at reinforcement learning and GNN workloads.

Groq LPU

The Groq LPU (Language Processing Unit) is an ASIC specialized for LLM inference, characterized by a deterministic execution architecture.

Software-defined memory: No dynamic memory management at runtime
SIMD streaming: All memory access patterns determined at compile time
Throughput per clock cycle: Predictable latency

As a result, Groq achieves over 240 tokens per second for LLaMA-3 70B inference — more than 10x faster than a GPU.

SambaNova DataScale

SambaNova's RDU (Reconfigurable Dataflow Unit) adopts a dataflow architecture.

Loads model weights entirely into on-chip SRAM
Minimizes DRAM access, eliminating memory bottlenecks
Supports GPT-4-class model inference

4. Inference-Only Chips

AWS Inferentia 2

AWS's proprietary inference chip, designed in-house. Together with Trainium, it forms the core of AWS's AI hardware strategy.

Specification	Inferentia 1	Inferentia 2
NeuronCore count	4	2 (enhanced design)
FP16 TFLOPS	128	384
Memory	8GB	32GB HBM
Memory bandwidth	50GB/s	820GB/s
NeuronLink bandwidth	—	384GB/s
Price (per hour)	inf1.xlarge ~$0.228	inf2.xlarge ~$0.758

Inferentia 2 transparently supports PyTorch, TensorFlow, and JAX models through the NeuronSDK.

Intel Gaudi 3

Intel Gaudi 3, designed by Habana Labs (acquired by Intel), directly competes with the H100.

Specification	Gaudi 3	H100 SXM5
BF16 TFLOPS	1,835	1,979
FP8 TOPS	1,835	3,958
HBM capacity	96GB HBM2e	80GB HBM3
HBM bandwidth	3.7TB/s	3.35TB/s
Networking	24x 200GbE RoCE	NVLink 4.0
TDP	900W	700W

In terms of cost efficiency, Gaudi 3 offers cloud instances approximately 30% cheaper than H100.

Qualcomm Cloud AI 100

Qualcomm's data center inference chip, with power efficiency as its strength.

AI 100 Ultra: 960 TOPS (INT8), 400W
On-chip memory: 144MB SRAM
Memory bandwidth: 3.6TB/s
Up to 8 cards per server supported

5. Edge AI Chips

Apple Neural Engine (ANE)

Apple Silicon's Neural Engine is a dedicated AI accelerator built into iPhone, iPad, and Mac devices.

Chip	ANE Performance	Release Year
A15 Bionic	15.8 TOPS	2021
A16 Bionic	17 TOPS	2022
A17 Pro	35 TOPS	2023
M4	38 TOPS	2024

The ANE is accessible through the CoreML framework and delivers up to 10x better power efficiency than the CPU for model inference.

# Deploy edge AI with Apple CoreML
import coremltools as ct
import torch
import torchvision

# Convert PyTorch model to CoreML
model = torchvision.models.mobilenet_v3_small(pretrained=True)
model.eval()

# Trace with example input
example_input = torch.rand(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)

# CoreML conversion (targeting Neural Engine)
mlmodel = ct.convert(
    traced_model,
    inputs=[ct.ImageType(
        name="input",
        shape=example_input.shape,
        color_layout=ct.colorlayout.RGB
    )],
    compute_units=ct.ComputeUnit.ALL,  # Auto-select ANE + GPU + CPU
    minimum_deployment_target=ct.target.iOS17,
)

mlmodel.save("mobilenet_v3_small.mlpackage")
print("CoreML model saved - Neural Engine optimization applied")

Qualcomm Hexagon DSP

The Hexagon DSP embedded in Qualcomm Snapdragon is the heart of smartphone AI processing.

Hexagon 698 (Snapdragon 8 Gen 3): 98 TOPS
HVX (Hexagon Vector eXtensions): SIMD vector operations
HTA (Hexagon Tensor Accelerator): Transformer-dedicated acceleration

TensorFlow/PyTorch models can be deployed to Hexagon via the Qualcomm Neural Processing SDK (SNPE).

Raspberry Pi 5 AI HAT

The Raspberry Pi AI HAT+ is an edge AI accelerator featuring the Hailo-8L chip.

Hailo-8L: 13 TOPS
Connects to RPi 5 via M.2 interface
Price: ~$70
Use cases: real-time video analysis, object detection

6. Memory Technology: HBM3e vs GDDR7

HBM (High Bandwidth Memory) Architecture

HBM is a memory technology that stacks DRAM dies vertically (3D stacking) and connects them to the GPU through a silicon interposer.

Memory	Bandwidth	Capacity	Power	Pin count	Primary Use
HBM2e	3.2TB/s	up to 80GB	~460W	1,024	A100
HBM3	3.35TB/s	up to 80GB	~700W	1,024	H100
HBM3e	4.8TB/s	up to 141GB	~700W	1,024	H200, MI300X
GDDR6X	576GB/s	up to 24GB	Low	384	RTX 4090
GDDR7	960GB/s	up to 32GB	Low	512	RTX 5090

There are three primary reasons HBM is advantageous for AI training.

Bandwidth: Over 5x higher memory bandwidth than GDDR7 directly eliminates memory bottlenecks during large-batch training.
Capacity: 80–141GB per single GPU allows inference of 70B parameter models on a single GPU.
Energy efficiency: Lower power consumption per byte than GDDR improves TCO.

Near-Memory Computing

Near-memory computing (also called Processing-in-Memory, PIM) places compute units inside the memory itself. Samsung HBM-PIM and SK Hynix AiM (Accelerator in Memory) are representative examples.

Minimizes data movement between memory and compute units
Fundamentally resolves memory bandwidth bottlenecks
Especially effective for memory-bound operations during inference

CXL (Compute Express Link)

CXL is a next-generation interconnect standard that connects CPUs, accelerators, and memory expansion devices over a PCIe physical layer.

CXL 1.1: Type 1 (accelerator), Type 2 (accelerator + memory), Type 3 (memory expansion)
CXL 2.0: Multi-host sharing with switching support
CXL 3.0: P2P communication, fabric support

Attempts to solve GPU VRAM shortages using CXL Type 3 memory expansion in AI servers are increasing.

7. Hardware Selection Guide

Training vs Inference

Optimal hardware differs by workload type.

Large-scale training (Pre-training)

Best: H100 SXM5 (NVLink required), TPU v4 Pod
Reason: High MFU (Model FLOP Utilization), fast collective communication via NVLink/ICI
Batch size: As large as possible (global batch of millions of tokens)

Fine-tuning

Best: H100/A100, AMD MI300X, Gaudi 3
Reason: Mid-scale GPU clusters, cost efficiency
Batch size: Medium (512–4096 tokens)

Large-scale inference (Serving, high throughput)

Best: H100, Inferentia 2, Gaudi 3
Reason: Large KV cache capacity, high throughput
Batch size: Dynamic (continuous batching)

Low-latency inference (Latency-critical)

Best: Groq LPU, Cerebras CS-3
Reason: Deterministic execution, no memory bottlenecks
Batch size: Small (1–8)

VRAM Requirements by Model Size (Inference)

Model Size	Parameters	FP16 VRAM	Minimum GPU (BF16)
Small	7B	14GB	1x A10G (24GB)
Medium	13B	26GB	1x A100 (40GB)
Large	34B	68GB	2x A100 (80GB)
XL	70B	140GB	2x H100 (80GB)
XXL	405B	810GB	10x H100 (80GB)

PyTorch Device Selection and Benchmarking

# PyTorch device selection and benchmarking
import torch
import time

def benchmark_matmul(device_name: str, size: int = 4096, dtype=torch.float16):
    """Matrix multiplication benchmark"""
    device = torch.device(device_name)

    a = torch.randn(size, size, dtype=dtype, device=device)
    b = torch.randn(size, size, dtype=dtype, device=device)

    # Warm-up
    for _ in range(5):
        _ = torch.matmul(a, b)

    if device.type == 'cuda':
        torch.cuda.synchronize()

    start = time.perf_counter()
    for _ in range(100):
        c = torch.matmul(a, b)
    if device.type == 'cuda':
        torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    ops = 2 * size ** 3 * 100  # FLOPs
    tflops = ops / elapsed / 1e12
    print(f"{device_name} ({dtype}): {tflops:.2f} TFLOPS ({elapsed*1000/100:.2f} ms/iter)")

# Auto-select available device
if torch.cuda.is_available():
    benchmark_matmul("cuda:0", dtype=torch.float16)
    benchmark_matmul("cuda:0", dtype=torch.bfloat16)

if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    benchmark_matmul("mps", dtype=torch.float16)

benchmark_matmul("cpu", dtype=torch.float32)

torch.compile for Hardware Optimization

# Hardware optimization with torch.compile
import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, d_model=1024, nhead=16):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, nhead, batch_first=True)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.GELU(),
            nn.Linear(d_model * 4, d_model),
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        attn_out, _ = self.attn(x, x, x)
        x = self.norm1(x + attn_out)
        x = self.norm2(x + self.ff(x))
        return x

model = TransformerBlock().cuda().to(torch.bfloat16)

# torch.compile: automatic optimization with Triton kernels
# Leverages Hopper-specific FlashAttention on H100
compiled_model = torch.compile(model, mode="max-autotune")

x = torch.randn(8, 512, 1024, dtype=torch.bfloat16, device="cuda")

# First run triggers compilation (takes a few seconds)
with torch.autocast("cuda", dtype=torch.bfloat16):
    out = compiled_model(x)

print(f"Output shape: {out.shape}")

Cost Efficiency Analysis (Cloud hourly prices, 2025)

Instance	GPU	Hourly Price	TFLOPS (BF16)	$/TFLOP
p4d.24xlarge	8x A100 40GB	$32.77	8 x 312 = 2,496	$13.1
p4de.24xlarge	8x A100 80GB	$40.96	8 x 312 = 2,496	$16.4
p5.48xlarge	8x H100 80GB	$98.32	8 x 1,979 = 15,832	$6.2
trn1.32xlarge	16x Trainium	$21.50	16 x 420 = 6,720	$3.2
inf2.48xlarge	12x Inferentia2	$12.98	12 x 384 = 4,608	$2.8
g6.48xlarge	8x L40S 48GB	$16.29	8 x 733 = 5,864	$2.8

For inference workloads, Inferentia 2 and Trainium offer the best cost efficiency.

8. Comprehensive Hardware Comparison

Accelerator	Type	BF16 TFLOPS	Memory	Bandwidth	TDP	Primary Use
H100 SXM5	GPU	1,979	80GB HBM3	3.35TB/s	700W	Training/Inference
H200 SXM5	GPU	1,979	141GB HBM3e	4.8TB/s	700W	Large model inference
A100 SXM4	GPU	312	80GB HBM2e	2.0TB/s	400W	General purpose
AMD MI300X	GPU	1,307	192GB HBM3	5.3TB/s	750W	Large models
TPU v5e	ASIC	197 (INT8: 394)	16GB HBM	1.6TB/s	90W	Large-scale inference
Cerebras WSE-3	ASIC	125,000	44GB SRAM	21PB/s	23kW/system	Ultra-large training
Groq LPU	ASIC	750	230MB SRAM	80TB/s	300W	Low-latency inference
Gaudi 3	ASIC	1,835	96GB HBM2e	3.7TB/s	900W	Cost-efficient training
Inferentia 2	ASIC	384	32GB HBM	820GB/s	75W	Cloud inference
Apple M4 ANE	Edge	38 TOPS	Shared	Shared	~10W	On-device
Hailo-8L	Edge	13 TOPS	—	—	1W	Embedded

Quiz

Q1. How does NVIDIA H100's Transformer Engine maintain precision during FP8 training?

Answer: Dynamic Scaling combined with mixed-precision accumulation

Explanation: The Transformer Engine tracks statistics (maximum value) of activations and weights per layer. From these, it computes an optimal scale factor for FP8 quantization. The forward pass is executed in FP8, but gradient accumulation is maintained in BF16/FP32. The engine also monitors the numerical range per layer and automatically rescales when overflow or underflow is detected. Thanks to this Delayed Scaling mechanism, FP8 speed benefits are achieved while maintaining training stability close to BF16.

Q2. How does Google TPU's systolic array parallelize matrix multiplication?

Answer: Pipeline-style MAC operation array with data reuse

Explanation: A systolic array consists of NxN MAC (Multiply-Accumulate) units arranged in a grid. Row data from matrix A flows left to right, and column data from matrix B flows top to bottom. Each cell multiplies the two values passing through it and adds the result to the accumulated value from the previous cell. Because data flows like waves (systoles), each data element is reused by all relevant cells in the array. TPU v4's 128x128 MXU performs 128x128 = 16,384 MAC operations per clock cycle, all processed on-chip without memory accesses.

Q3. Why is HBM better than GDDR for AI training (bandwidth vs capacity)?

Answer: HBM holds advantages in both bandwidth and capacity

Explanation: On the bandwidth side, HBM3e (H200) is 4.8TB/s while GDDR7 (RTX 5090) is 960GB/s — a 5x difference. AI training has many bandwidth-bound operations, so this difference translates directly into performance. On the capacity side, the H200's 141GB HBM3e is more than 4x the RTX 5090's 32GB GDDR7, allowing 70B parameter models to be processed on a single GPU. Structurally, HBM vertically stacks DRAM dies and connects them to the GPU with thousands of wide buses, achieving both high bandwidth and energy efficiency simultaneously.

Q4. How does Cerebras WSE-3's wafer-scale integration eliminate inter-chip communication bottlenecks?

Answer: All cores connected through an on-chip fabric on a single wafer

Explanation: In conventional GPU clusters, hundreds of chips are connected via NVLink, InfiniBand, and similar networks. This inter-chip communication has latency in the microsecond range and limited bandwidth. WSE-3's 900,000 AI cores all reside on a single wafer, so all inter-core communication flows through on-chip fabric. On-chip fabric latency is in the nanosecond range and bandwidth reaches 220Pb/s. Additionally, 44GB of SRAM is distributed near the cores, minimizing memory access latency. This enables near-linear scaling for large model training with almost no communication overhead.

Q5. What architectural decisions allow Groq LPU to achieve lower latency than GPUs for LLM inference?

Answer: Deterministic memory scheduling at compile time

Explanation: The primary causes of high LLM inference latency on GPUs are irregular memory access patterns and runtime dynamic scheduling. The Groq LPU statically determines all tensor memory locations and movement paths at compile time. There is no memory allocation/deallocation or scheduler overhead during execution. The SRAM-based memory architecture also eliminates the irregular access latency of DRAM. All operations execute at predetermined clock cycles, making latency fully predictable. Thanks to this deterministic execution, Groq achieves over 240 tokens per second throughput and very low time-to-first-token (TTFT) latency for LLaMA-3 70B.

Conclusion

The AI hardware accelerator market is diversifying rapidly between 2024 and 2026. While NVIDIA H100/H200 remain the gold standard for training workloads, purpose-optimized accelerators demonstrate advantages in specific use cases.

The core selection principles are as follows.

Training: Bandwidth and NVLink are critical — H100 SXM5, TPU v4 Pod
High-throughput inference: Cost efficiency matters — Inferentia 2, Gaudi 3, TPU v5e
Low-latency inference: Deterministic execution — Groq LPU
Edge deployment: Power efficiency — Apple ANE, Qualcomm Hexagon
Ultra-large training: No inter-chip bottleneck — Cerebras WSE-3

Hardware selection ultimately balances workload characteristics, budget, and ecosystem maturity. The maturity of the NVIDIA ecosystem remains a powerful advantage, but purpose-built ASICs can be far more economical for specific workloads.