- Published on
AI Hardware Accelerators Complete Guide: H100, TPU, Cerebras, and Edge AI Chips Compared
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Introduction
As AI workloads diversify, the hardware accelerator market has exploded in variety. While NVIDIA GPUs remain dominant, purpose-built accelerators — Google TPU, Cerebras WSE-3, AWS Inferentia, Apple Neural Engine, and many others — are rapidly claiming their niches.
This guide systematically compares the architecture, performance characteristics, and use cases of major AI hardware accelerators. From selecting training GPUs to deploying models on edge chips, everything you need to make the right hardware decision is covered here.
1. NVIDIA Hopper Architecture: H100 & H200
Hopper SM Structure
The NVIDIA H100 is built on the Hopper microarchitecture. Each Streaming Multiprocessor (SM) contains the following components.
- 4 warp schedulers: Schedule 4 warps (32 threads each) simultaneously
- 4th-generation Tensor Cores: Support FP8, FP16, BF16, TF32, and FP64
- Shared memory: Up to 228KB per SM (including L1 cache)
- Register file: 65,536 32-bit registers per SM
Full H100 SXM5 specifications are as follows.
| Specification | H100 SXM5 | H200 SXM5 |
|---|---|---|
| SM count | 132 | 132 |
| CUDA cores | 16,896 | 16,896 |
| Tensor Cores (4th gen) | 528 | 528 |
| FP8 TFLOPS | 3,958 | 3,958 |
| BF16 TFLOPS | 1,979 | 1,979 |
| Memory type | HBM3 | HBM3e |
| Memory capacity | 80GB | 141GB |
| Memory bandwidth | 3.35TB/s | 4.8TB/s |
| TDP | 700W | 700W |
| NVLink bandwidth | 900GB/s | 900GB/s |
4th-Gen Tensor Cores and Transformer Engine
The key innovation in H100 is the Transformer Engine. This engine supports FP8 computation while minimizing precision loss.
The operating principle: per transformer layer, statistics (max value, standard deviation) of activations are tracked, and a dynamic scaling factor is computed from these. FP8 arithmetic is used while scaling maintains numerical stability.
# CUDA device properties query
import torch
def query_gpu_properties():
if not torch.cuda.is_available():
print("CUDA is not available.")
return
for i in range(torch.cuda.device_count()):
props = torch.cuda.get_device_properties(i)
print(f"GPU {i}: {props.name}")
print(f" Compute Capability: {props.major}.{props.minor}")
print(f" Total Memory: {props.total_memory / 1024**3:.1f} GB")
print(f" Multiprocessors: {props.multi_processor_count}")
print(f" Max Threads/SM: {props.max_threads_per_multi_processor}")
print(f" L2 Cache Size: {props.l2_cache_size / 1024**2:.1f} MB")
# Check if Hopper (Compute Capability 9.0)
if props.major == 9:
print(f" Architecture: Hopper (H100/H200)")
elif props.major == 8:
print(f" Architecture: Ampere (A100/A800)")
query_gpu_properties()
NVLink 4.0 and NVSwitch
High-speed communication between multiple GPUs is essential for large-scale model training. H100's NVLink 4.0 delivers 900GB/s bidirectional bandwidth per GPU.
- NVLink 3.0 (A100): 600GB/s per GPU
- NVLink 4.0 (H100): 900GB/s per GPU
- NVSwitch 3rd gen: 7.2TB/s total bandwidth per switch
In a DGX H100 system (8 GPUs), three NVSwitch units connect all GPUs in a full-mesh topology. This makes any-to-any GPU communication more than 7x faster than PCIe.
2. Google TPU: Systolic Array Architecture
The Heart of TPU: Systolic Array
A TPU (Tensor Processing Unit) is an ASIC specialized for matrix multiplication. The core compute unit, the systolic array, is a structure where data flows through in waves (systolic) while computation occurs.
The MXU (Matrix Multiply Unit) in TPU v4 uses a 128x128 systolic array. Each cell receives inputs from previous cells, performs a MAC (Multiply-Accumulate) operation, and passes results to the next cell.
The advantages of this structure are as follows.
- Minimizes memory accesses: data is reused as it passes through the array
- High arithmetic intensity: more operations per data element
- Deterministic execution: predictable latency
TPU v4 vs v5e Comparison
| Specification | TPU v4 | TPU v5e |
|---|---|---|
| BF16 TFLOPS | 275 | 197 |
| INT8 TOPS | 275 | 394 |
| HBM capacity | 32GB | 16GB |
| HBM bandwidth | 1,200GB/s | 1,600GB/s |
| ICI bandwidth | 1,200GB/s/chip | 1,600GB/s/chip |
| Power consumption | ~170W | ~90W |
| Cost efficiency | Training-optimized | Inference-optimized |
TPU v5e is optimized for power efficiency and is particularly economical for inference workloads.
TPU Pod and ICI
A TPU Pod is a cluster of thousands of TPU chips connected via high-speed ICI (Inter-Chip Interconnect). ICI uses direct chip-to-chip connections instead of data center networks, dramatically reducing latency.
- TPU v4 Pod: 4,096 chips, over 1 exaFLOPS (BF16)
- ICI topology: 3D torus mesh
Using TPU with JAX/XLA
# JAX on TPU basic example
import jax
import jax.numpy as jnp
from jax import random
# Check available devices
devices = jax.devices()
print(f"Available devices: {devices}")
# Use data sharding to utilize full TPU Pod
from jax.sharding import Mesh, PartitionSpec, NamedSharding
import numpy as np
# Set up 8-way tensor parallelism
mesh = Mesh(np.array(jax.devices()).reshape(2, 4), ('batch', 'model'))
def matrix_multiply_tpu(a, b):
# XLA automatically optimizes for TPU systolic array usage
return jnp.dot(a, b)
# Apply XLA optimization with jit compilation
compiled_matmul = jax.jit(matrix_multiply_tpu)
key = random.PRNGKey(0)
a = random.normal(key, (4096, 4096), dtype=jnp.bfloat16)
b = random.normal(key, (4096, 4096), dtype=jnp.bfloat16)
result = compiled_matmul(a, b)
print(f"Result shape: {result.shape}, dtype: {result.dtype}")
3. AI ASICs: Purpose-Built Accelerators
Cerebras WSE-3: Wafer Scale Engine
The Cerebras WSE-3 (Wafer Scale Engine 3) is a groundbreaking design that uses an entire silicon wafer as a single chip.
| Specification | WSE-3 |
|---|---|
| Die size | 46,225 mm² (full wafer) |
| AI cores | 900,000 |
| On-chip SRAM | 44GB |
| Memory bandwidth | 21PB/s (on-chip) |
| FP16 performance | 125 PFLOPS |
| Fabric bandwidth | 220Pb/s |
The key advantage is the complete elimination of inter-chip communication bottlenecks. In conventional GPU clusters, hundreds of GPUs are connected via networks or NVLink, incurring communication overhead. In WSE-3, all cores are connected via an on-chip fabric on a single wafer, with latency in the nanosecond range.
Cerebras claims that a single CS-3 system can replace up to 24 server racks of GPU clusters for large model training.
Graphcore IPU
Graphcore's IPU (Intelligence Processing Unit) uses the Bulk Synchronous Parallel (BSP) execution model.
- MK2 GC200: 1,472 IPU tiles, each with 8,832 threads
- On-chip memory: 900MB (SRAM)
- Bandwidth: 45TB/s
- Strengths: Optimized for sparse operations, excellent for graph neural networks
The IPU outperforms GPUs for irregular graph structure computations and excels at reinforcement learning and GNN workloads.
Groq LPU
The Groq LPU (Language Processing Unit) is an ASIC specialized for LLM inference, characterized by a deterministic execution architecture.
- Software-defined memory: No dynamic memory management at runtime
- SIMD streaming: All memory access patterns determined at compile time
- Throughput per clock cycle: Predictable latency
As a result, Groq achieves over 240 tokens per second for LLaMA-3 70B inference — more than 10x faster than a GPU.
SambaNova DataScale
SambaNova's RDU (Reconfigurable Dataflow Unit) adopts a dataflow architecture.
- Loads model weights entirely into on-chip SRAM
- Minimizes DRAM access, eliminating memory bottlenecks
- Supports GPT-4-class model inference
4. Inference-Only Chips
AWS Inferentia 2
AWS's proprietary inference chip, designed in-house. Together with Trainium, it forms the core of AWS's AI hardware strategy.
| Specification | Inferentia 1 | Inferentia 2 |
|---|---|---|
| NeuronCore count | 4 | 2 (enhanced design) |
| FP16 TFLOPS | 128 | 384 |
| Memory | 8GB | 32GB HBM |
| Memory bandwidth | 50GB/s | 820GB/s |
| NeuronLink bandwidth | — | 384GB/s |
| Price (per hour) | inf1.xlarge ~$0.228 | inf2.xlarge ~$0.758 |
Inferentia 2 transparently supports PyTorch, TensorFlow, and JAX models through the NeuronSDK.
Intel Gaudi 3
Intel Gaudi 3, designed by Habana Labs (acquired by Intel), directly competes with the H100.
| Specification | Gaudi 3 | H100 SXM5 |
|---|---|---|
| BF16 TFLOPS | 1,835 | 1,979 |
| FP8 TOPS | 1,835 | 3,958 |
| HBM capacity | 96GB HBM2e | 80GB HBM3 |
| HBM bandwidth | 3.7TB/s | 3.35TB/s |
| Networking | 24x 200GbE RoCE | NVLink 4.0 |
| TDP | 900W | 700W |
In terms of cost efficiency, Gaudi 3 offers cloud instances approximately 30% cheaper than H100.
Qualcomm Cloud AI 100
Qualcomm's data center inference chip, with power efficiency as its strength.
- AI 100 Ultra: 960 TOPS (INT8), 400W
- On-chip memory: 144MB SRAM
- Memory bandwidth: 3.6TB/s
- Up to 8 cards per server supported
5. Edge AI Chips
Apple Neural Engine (ANE)
Apple Silicon's Neural Engine is a dedicated AI accelerator built into iPhone, iPad, and Mac devices.
| Chip | ANE Performance | Release Year |
|---|---|---|
| A15 Bionic | 15.8 TOPS | 2021 |
| A16 Bionic | 17 TOPS | 2022 |
| A17 Pro | 35 TOPS | 2023 |
| M4 | 38 TOPS | 2024 |
The ANE is accessible through the CoreML framework and delivers up to 10x better power efficiency than the CPU for model inference.
# Deploy edge AI with Apple CoreML
import coremltools as ct
import torch
import torchvision
# Convert PyTorch model to CoreML
model = torchvision.models.mobilenet_v3_small(pretrained=True)
model.eval()
# Trace with example input
example_input = torch.rand(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)
# CoreML conversion (targeting Neural Engine)
mlmodel = ct.convert(
traced_model,
inputs=[ct.ImageType(
name="input",
shape=example_input.shape,
color_layout=ct.colorlayout.RGB
)],
compute_units=ct.ComputeUnit.ALL, # Auto-select ANE + GPU + CPU
minimum_deployment_target=ct.target.iOS17,
)
mlmodel.save("mobilenet_v3_small.mlpackage")
print("CoreML model saved - Neural Engine optimization applied")
Qualcomm Hexagon DSP
The Hexagon DSP embedded in Qualcomm Snapdragon is the heart of smartphone AI processing.
- Hexagon 698 (Snapdragon 8 Gen 3): 98 TOPS
- HVX (Hexagon Vector eXtensions): SIMD vector operations
- HTA (Hexagon Tensor Accelerator): Transformer-dedicated acceleration
TensorFlow/PyTorch models can be deployed to Hexagon via the Qualcomm Neural Processing SDK (SNPE).
Raspberry Pi 5 AI HAT
The Raspberry Pi AI HAT+ is an edge AI accelerator featuring the Hailo-8L chip.
- Hailo-8L: 13 TOPS
- Connects to RPi 5 via M.2 interface
- Price: ~$70
- Use cases: real-time video analysis, object detection
6. Memory Technology: HBM3e vs GDDR7
HBM (High Bandwidth Memory) Architecture
HBM is a memory technology that stacks DRAM dies vertically (3D stacking) and connects them to the GPU through a silicon interposer.
| Memory | Bandwidth | Capacity | Power | Pin count | Primary Use |
|---|---|---|---|---|---|
| HBM2e | 3.2TB/s | up to 80GB | ~460W | 1,024 | A100 |
| HBM3 | 3.35TB/s | up to 80GB | ~700W | 1,024 | H100 |
| HBM3e | 4.8TB/s | up to 141GB | ~700W | 1,024 | H200, MI300X |
| GDDR6X | 576GB/s | up to 24GB | Low | 384 | RTX 4090 |
| GDDR7 | 960GB/s | up to 32GB | Low | 512 | RTX 5090 |
There are three primary reasons HBM is advantageous for AI training.
- Bandwidth: Over 5x higher memory bandwidth than GDDR7 directly eliminates memory bottlenecks during large-batch training.
- Capacity: 80–141GB per single GPU allows inference of 70B parameter models on a single GPU.
- Energy efficiency: Lower power consumption per byte than GDDR improves TCO.
Near-Memory Computing
Near-memory computing (also called Processing-in-Memory, PIM) places compute units inside the memory itself. Samsung HBM-PIM and SK Hynix AiM (Accelerator in Memory) are representative examples.
- Minimizes data movement between memory and compute units
- Fundamentally resolves memory bandwidth bottlenecks
- Especially effective for memory-bound operations during inference
CXL (Compute Express Link)
CXL is a next-generation interconnect standard that connects CPUs, accelerators, and memory expansion devices over a PCIe physical layer.
- CXL 1.1: Type 1 (accelerator), Type 2 (accelerator + memory), Type 3 (memory expansion)
- CXL 2.0: Multi-host sharing with switching support
- CXL 3.0: P2P communication, fabric support
Attempts to solve GPU VRAM shortages using CXL Type 3 memory expansion in AI servers are increasing.
7. Hardware Selection Guide
Training vs Inference
Optimal hardware differs by workload type.
Large-scale training (Pre-training)
- Best: H100 SXM5 (NVLink required), TPU v4 Pod
- Reason: High MFU (Model FLOP Utilization), fast collective communication via NVLink/ICI
- Batch size: As large as possible (global batch of millions of tokens)
Fine-tuning
- Best: H100/A100, AMD MI300X, Gaudi 3
- Reason: Mid-scale GPU clusters, cost efficiency
- Batch size: Medium (512–4096 tokens)
Large-scale inference (Serving, high throughput)
- Best: H100, Inferentia 2, Gaudi 3
- Reason: Large KV cache capacity, high throughput
- Batch size: Dynamic (continuous batching)
Low-latency inference (Latency-critical)
- Best: Groq LPU, Cerebras CS-3
- Reason: Deterministic execution, no memory bottlenecks
- Batch size: Small (1–8)
VRAM Requirements by Model Size (Inference)
| Model Size | Parameters | FP16 VRAM | Minimum GPU (BF16) |
|---|---|---|---|
| Small | 7B | 14GB | 1x A10G (24GB) |
| Medium | 13B | 26GB | 1x A100 (40GB) |
| Large | 34B | 68GB | 2x A100 (80GB) |
| XL | 70B | 140GB | 2x H100 (80GB) |
| XXL | 405B | 810GB | 10x H100 (80GB) |
PyTorch Device Selection and Benchmarking
# PyTorch device selection and benchmarking
import torch
import time
def benchmark_matmul(device_name: str, size: int = 4096, dtype=torch.float16):
"""Matrix multiplication benchmark"""
device = torch.device(device_name)
a = torch.randn(size, size, dtype=dtype, device=device)
b = torch.randn(size, size, dtype=dtype, device=device)
# Warm-up
for _ in range(5):
_ = torch.matmul(a, b)
if device.type == 'cuda':
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(100):
c = torch.matmul(a, b)
if device.type == 'cuda':
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
ops = 2 * size ** 3 * 100 # FLOPs
tflops = ops / elapsed / 1e12
print(f"{device_name} ({dtype}): {tflops:.2f} TFLOPS ({elapsed*1000/100:.2f} ms/iter)")
# Auto-select available device
if torch.cuda.is_available():
benchmark_matmul("cuda:0", dtype=torch.float16)
benchmark_matmul("cuda:0", dtype=torch.bfloat16)
if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
benchmark_matmul("mps", dtype=torch.float16)
benchmark_matmul("cpu", dtype=torch.float32)
torch.compile for Hardware Optimization
# Hardware optimization with torch.compile
import torch
import torch.nn as nn
class TransformerBlock(nn.Module):
def __init__(self, d_model=1024, nhead=16):
super().__init__()
self.attn = nn.MultiheadAttention(d_model, nhead, batch_first=True)
self.ff = nn.Sequential(
nn.Linear(d_model, d_model * 4),
nn.GELU(),
nn.Linear(d_model * 4, d_model),
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
def forward(self, x):
attn_out, _ = self.attn(x, x, x)
x = self.norm1(x + attn_out)
x = self.norm2(x + self.ff(x))
return x
model = TransformerBlock().cuda().to(torch.bfloat16)
# torch.compile: automatic optimization with Triton kernels
# Leverages Hopper-specific FlashAttention on H100
compiled_model = torch.compile(model, mode="max-autotune")
x = torch.randn(8, 512, 1024, dtype=torch.bfloat16, device="cuda")
# First run triggers compilation (takes a few seconds)
with torch.autocast("cuda", dtype=torch.bfloat16):
out = compiled_model(x)
print(f"Output shape: {out.shape}")
Cost Efficiency Analysis (Cloud hourly prices, 2025)
| Instance | GPU | Hourly Price | TFLOPS (BF16) | $/TFLOP |
|---|---|---|---|---|
| p4d.24xlarge | 8x A100 40GB | $32.77 | 8 x 312 = 2,496 | $13.1 |
| p4de.24xlarge | 8x A100 80GB | $40.96 | 8 x 312 = 2,496 | $16.4 |
| p5.48xlarge | 8x H100 80GB | $98.32 | 8 x 1,979 = 15,832 | $6.2 |
| trn1.32xlarge | 16x Trainium | $21.50 | 16 x 420 = 6,720 | $3.2 |
| inf2.48xlarge | 12x Inferentia2 | $12.98 | 12 x 384 = 4,608 | $2.8 |
| g6.48xlarge | 8x L40S 48GB | $16.29 | 8 x 733 = 5,864 | $2.8 |
For inference workloads, Inferentia 2 and Trainium offer the best cost efficiency.
8. Comprehensive Hardware Comparison
| Accelerator | Type | BF16 TFLOPS | Memory | Bandwidth | TDP | Primary Use |
|---|---|---|---|---|---|---|
| H100 SXM5 | GPU | 1,979 | 80GB HBM3 | 3.35TB/s | 700W | Training/Inference |
| H200 SXM5 | GPU | 1,979 | 141GB HBM3e | 4.8TB/s | 700W | Large model inference |
| A100 SXM4 | GPU | 312 | 80GB HBM2e | 2.0TB/s | 400W | General purpose |
| AMD MI300X | GPU | 1,307 | 192GB HBM3 | 5.3TB/s | 750W | Large models |
| TPU v5e | ASIC | 197 (INT8: 394) | 16GB HBM | 1.6TB/s | 90W | Large-scale inference |
| Cerebras WSE-3 | ASIC | 125,000 | 44GB SRAM | 21PB/s | 23kW/system | Ultra-large training |
| Groq LPU | ASIC | 750 | 230MB SRAM | 80TB/s | 300W | Low-latency inference |
| Gaudi 3 | ASIC | 1,835 | 96GB HBM2e | 3.7TB/s | 900W | Cost-efficient training |
| Inferentia 2 | ASIC | 384 | 32GB HBM | 820GB/s | 75W | Cloud inference |
| Apple M4 ANE | Edge | 38 TOPS | Shared | Shared | ~10W | On-device |
| Hailo-8L | Edge | 13 TOPS | — | — | 1W | Embedded |
Quiz
Q1. How does NVIDIA H100's Transformer Engine maintain precision during FP8 training?
Answer: Dynamic Scaling combined with mixed-precision accumulation
Explanation: The Transformer Engine tracks statistics (maximum value) of activations and weights per layer. From these, it computes an optimal scale factor for FP8 quantization. The forward pass is executed in FP8, but gradient accumulation is maintained in BF16/FP32. The engine also monitors the numerical range per layer and automatically rescales when overflow or underflow is detected. Thanks to this Delayed Scaling mechanism, FP8 speed benefits are achieved while maintaining training stability close to BF16.
Q2. How does Google TPU's systolic array parallelize matrix multiplication?
Answer: Pipeline-style MAC operation array with data reuse
Explanation: A systolic array consists of NxN MAC (Multiply-Accumulate) units arranged in a grid. Row data from matrix A flows left to right, and column data from matrix B flows top to bottom. Each cell multiplies the two values passing through it and adds the result to the accumulated value from the previous cell. Because data flows like waves (systoles), each data element is reused by all relevant cells in the array. TPU v4's 128x128 MXU performs 128x128 = 16,384 MAC operations per clock cycle, all processed on-chip without memory accesses.
Q3. Why is HBM better than GDDR for AI training (bandwidth vs capacity)?
Answer: HBM holds advantages in both bandwidth and capacity
Explanation: On the bandwidth side, HBM3e (H200) is 4.8TB/s while GDDR7 (RTX 5090) is 960GB/s — a 5x difference. AI training has many bandwidth-bound operations, so this difference translates directly into performance. On the capacity side, the H200's 141GB HBM3e is more than 4x the RTX 5090's 32GB GDDR7, allowing 70B parameter models to be processed on a single GPU. Structurally, HBM vertically stacks DRAM dies and connects them to the GPU with thousands of wide buses, achieving both high bandwidth and energy efficiency simultaneously.
Q4. How does Cerebras WSE-3's wafer-scale integration eliminate inter-chip communication bottlenecks?
Answer: All cores connected through an on-chip fabric on a single wafer
Explanation: In conventional GPU clusters, hundreds of chips are connected via NVLink, InfiniBand, and similar networks. This inter-chip communication has latency in the microsecond range and limited bandwidth. WSE-3's 900,000 AI cores all reside on a single wafer, so all inter-core communication flows through on-chip fabric. On-chip fabric latency is in the nanosecond range and bandwidth reaches 220Pb/s. Additionally, 44GB of SRAM is distributed near the cores, minimizing memory access latency. This enables near-linear scaling for large model training with almost no communication overhead.
Q5. What architectural decisions allow Groq LPU to achieve lower latency than GPUs for LLM inference?
Answer: Deterministic memory scheduling at compile time
Explanation: The primary causes of high LLM inference latency on GPUs are irregular memory access patterns and runtime dynamic scheduling. The Groq LPU statically determines all tensor memory locations and movement paths at compile time. There is no memory allocation/deallocation or scheduler overhead during execution. The SRAM-based memory architecture also eliminates the irregular access latency of DRAM. All operations execute at predetermined clock cycles, making latency fully predictable. Thanks to this deterministic execution, Groq achieves over 240 tokens per second throughput and very low time-to-first-token (TTFT) latency for LLaMA-3 70B.
Conclusion
The AI hardware accelerator market is diversifying rapidly between 2024 and 2026. While NVIDIA H100/H200 remain the gold standard for training workloads, purpose-optimized accelerators demonstrate advantages in specific use cases.
The core selection principles are as follows.
- Training: Bandwidth and NVLink are critical — H100 SXM5, TPU v4 Pod
- High-throughput inference: Cost efficiency matters — Inferentia 2, Gaudi 3, TPU v5e
- Low-latency inference: Deterministic execution — Groq LPU
- Edge deployment: Power efficiency — Apple ANE, Qualcomm Hexagon
- Ultra-large training: No inter-chip bottleneck — Cerebras WSE-3
Hardware selection ultimately balances workload characteristics, budget, and ecosystem maturity. The maturity of the NVIDIA ecosystem remains a powerful advantage, but purpose-built ASICs can be far more economical for specific workloads.