AI Hardware Accelerators Complete Guide: H100, TPU, Cerebras, and Edge AI Chips Compared

Introduction

As AI workloads diversify, the hardware accelerator market has exploded in variety. While NVIDIA GPUs remain dominant, purpose-built accelerators — Google TPU, Cerebras WSE-3, AWS Inferentia, Apple Neural Engine, and many others — are rapidly claiming their niches.

This guide systematically compares the architecture, performance characteristics, and use cases of major AI hardware accelerators. From selecting training GPUs to deploying models on edge chips, everything you need to make the right hardware decision is covered here.

1. NVIDIA Hopper Architecture: H100 & H200

Hopper SM Structure

The NVIDIA H100 is built on the Hopper microarchitecture. Each Streaming Multiprocessor (SM) contains the following components.

4 warp schedulers: Schedule 4 warps (32 threads each) simultaneously
4th-generation Tensor Cores: Support FP8, FP16, BF16, TF32, and FP64
Shared memory: Up to 228KB per SM (including L1 cache)
Register file: 65,536 32-bit registers per SM

Full H100 SXM5 specifications are as follows.

Specification	H100 SXM5	H200 SXM5
SM count	132	132
CUDA cores	16,896	16,896
Tensor Cores (4th gen)	528	528
FP8 TFLOPS	3,958	3,958
BF16 TFLOPS	1,979	1,979
Memory type	HBM3	HBM3e
Memory capacity	80GB	141GB
Memory bandwidth	3.35TB/s	4.8TB/s
TDP	700W	700W
NVLink bandwidth	900GB/s	900GB/s

4th-Gen Tensor Cores and Transformer Engine

The key innovation in H100 is the Transformer Engine. This engine supports FP8 computation while minimizing precision loss.

The operating principle: per transformer layer, statistics (max value, standard deviation) of activations are tracked, and a dynamic scaling factor is computed from these. FP8 arithmetic is used while scaling maintains numerical stability.

# CUDA device properties query
import torch

def query_gpu_properties():
    if not torch.cuda.is_available():
        print("CUDA is not available.")
        return

    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        print(f"GPU {i}: {props.name}")
        print(f"  Compute Capability: {props.major}.{props.minor}")
        print(f"  Total Memory: {props.total_memory / 1024**3:.1f} GB")
        print(f"  Multiprocessors: {props.multi_processor_count}")
        print(f"  Max Threads/SM: {props.max_threads_per_multi_processor}")
        print(f"  L2 Cache Size: {props.l2_cache_size / 1024**2:.1f} MB")

        # Check if Hopper (Compute Capability 9.0)
        if props.major == 9:
            print(f"  Architecture: Hopper (H100/H200)")
        elif props.major == 8:
            print(f"  Architecture: Ampere (A100/A800)")

query_gpu_properties()

NVLink 4.0 and NVSwitch

High-speed communication between multiple GPUs is essential for large-scale model training. H100's NVLink 4.0 delivers 900GB/s bidirectional bandwidth per GPU.

NVLink 3.0 (A100): 600GB/s per GPU
NVLink 4.0 (H100): 900GB/s per GPU
NVSwitch 3rd gen: 7.2TB/s total bandwidth per switch

In a DGX H100 system (8 GPUs), three NVSwitch units connect all GPUs in a full-mesh topology. This makes any-to-any GPU communication more than 7x faster than PCIe.

2. Google TPU: Systolic Array Architecture

The Heart of TPU: Systolic Array

A TPU (Tensor Processing Unit) is an ASIC specialized for matrix multiplication. The core compute unit, the systolic array, is a structure where data flows through in waves (systolic) while computation occurs.

The MXU (Matrix Multiply Unit) in TPU v4 uses a 128x128 systolic array. Each cell receives inputs from previous cells, performs a MAC (Multiply-Accumulate) operation, and passes results to the next cell.

The advantages of this structure are as follows.

Minimizes memory accesses: data is reused as it passes through the array
High arithmetic intensity: more operations per data element
Deterministic execution: predictable latency

TPU v4 vs v5e Comparison

Specification	TPU v4	TPU v5e
BF16 TFLOPS	275	197
INT8 TOPS	275	394
HBM capacity	32GB	16GB
HBM bandwidth	1,200GB/s	1,600GB/s
ICI bandwidth	1,200GB/s/chip	1,600GB/s/chip
Power consumption	~170W	~90W
Cost efficiency	Training-optimized	Inference-optimized

TPU v5e is optimized for power efficiency and is particularly economical for inference workloads.

TPU Pod and ICI

A TPU Pod is a cluster of thousands of TPU chips connected via high-speed ICI (Inter-Chip Interconnect). ICI uses direct chip-to-chip connections instead of data center networks, dramatically reducing latency.

TPU v4 Pod: 4,096 chips, over 1 exaFLOPS (BF16)
ICI topology: 3D torus mesh

Using TPU with JAX/XLA

# JAX on TPU basic example
import jax
import jax.numpy as jnp
from jax import random

# Check available devices
devices = jax.devices()
print(f"Available devices: {devices}")

# Use data sharding to utilize full TPU Pod
from jax.sharding import Mesh, PartitionSpec, NamedSharding
import numpy as np

# Set up 8-way tensor parallelism
mesh = Mesh(np.array(jax.devices()).reshape(2, 4), ('batch', 'model'))

def matrix_multiply_tpu(a, b):
    # XLA automatically optimizes for TPU systolic array usage
    return jnp.dot(a, b)

# Apply XLA optimization with jit compilation
compiled_matmul = jax.jit(matrix_multiply_tpu)

key = random.PRNGKey(0)
a = random.normal(key, (4096, 4096), dtype=jnp.bfloat16)
b = random.normal(key, (4096, 4096), dtype=jnp.bfloat16)

result = compiled_matmul(a, b)
print(f"Result shape: {result.shape}, dtype: {result.dtype}")

3. AI ASICs: Purpose-Built Accelerators

Cerebras WSE-3: Wafer Scale Engine

The Cerebras WSE-3 (Wafer Scale Engine 3) is a groundbreaking design that uses an entire silicon wafer as a single chip.

Specification	WSE-3
Die size	46,225 mm² (full wafer)
AI cores	900,000
On-chip SRAM	44GB
Memory bandwidth	21PB/s (on-chip)
FP16 performance	125 PFLOPS
Fabric bandwidth	220Pb/s

The key advantage is the complete elimination of inter-chip communication bottlenecks. In conventional GPU clusters, hundreds of GPUs are connected via networks or NVLink, incurring communication overhead. In WSE-3, all cores are connected via an on-chip fabric on a single wafer, with latency in the nanosecond range.

Cerebras claims that a single CS-3 system can replace up to 24 server racks of GPU clusters for large model training.

Graphcore IPU

Graphcore's IPU (Intelligence Processing Unit) uses the Bulk Synchronous Parallel (BSP) execution model.

MK2 GC200: 1,472 IPU tiles, each with 8,832 threads
On-chip memory: 900MB (SRAM)
Bandwidth: 45TB/s
Strengths: Optimized for sparse operations, excellent for graph neural networks

The IPU outperforms GPUs for irregular graph structure computations and excels at reinforcement learning and GNN workloads.

Groq LPU

The Groq LPU (Language Processing Unit) is an ASIC specialized for LLM inference, characterized by a deterministic execution architecture.

Software-defined memory: No dynamic memory management at runtime
SIMD streaming: All memory access patterns determined at compile time
Throughput per clock cycle: Predictable latency

As a result, Groq achieves over 240 tokens per second for LLaMA-3 70B inference — more than 10x faster than a GPU.

SambaNova DataScale

SambaNova's RDU (Reconfigurable Dataflow Unit) adopts a dataflow architecture.

Loads model weights entirely into on-chip SRAM
Minimizes DRAM access, eliminating memory bottlenecks
Supports GPT-4-class model inference

4. Inference-Only Chips

AWS Inferentia 2

AWS's proprietary inference chip, designed in-house. Together with Trainium, it forms the core of AWS's AI hardware strategy.

Specification	Inferentia 1	Inferentia 2
NeuronCore count	4	2 (enhanced design)
FP16 TFLOPS	128	384
Memory	8GB	32GB HBM
Memory bandwidth	50GB/s	820GB/s
NeuronLink bandwidth	—	384GB/s
Price (per hour)	inf1.xlarge ~$0.228	inf2.xlarge ~$0.758

Inferentia 2 transparently supports PyTorch, TensorFlow, and JAX models through the NeuronSDK.

Intel Gaudi 3

Intel Gaudi 3, designed by Habana Labs (acquired by Intel), directly competes with the H100.

Specification	Gaudi 3	H100 SXM5
BF16 TFLOPS	1,835	1,979
FP8 TOPS	1,835	3,958
HBM capacity	96GB HBM2e	80GB HBM3
HBM bandwidth	3.7TB/s	3.35TB/s
Networking	24x 200GbE RoCE	NVLink 4.0
TDP	900W	700W

In terms of cost efficiency, Gaudi 3 offers cloud instances approximately 30% cheaper than H100.

Qualcomm Cloud AI 100

Qualcomm's data center inference chip, with power efficiency as its strength.

AI 100 Ultra: 960 TOPS (INT8), 400W
On-chip memory: 144MB SRAM
Memory bandwidth: 3.6TB/s
Up to 8 cards per server supported

5. Edge AI Chips

Apple Neural Engine (ANE)

Apple Silicon's Neural Engine is a dedicated AI accelerator built into iPhone, iPad, and Mac devices.

Chip	ANE Performance	Release Year
A15 Bionic	15.8 TOPS	2021
A16 Bionic	17 TOPS	2022
A17 Pro	35 TOPS	2023
M4	38 TOPS	2024

The ANE is accessible through the CoreML framework and delivers up to 10x better power efficiency than the CPU for model inference.

# Deploy edge AI with Apple CoreML
import coremltools as ct
import torch
import torchvision

# Convert PyTorch model to CoreML
model = torchvision.models.mobilenet_v3_small(pretrained=True)
model.eval()

# Trace with example input
example_input = torch.rand(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)

# CoreML conversion (targeting Neural Engine)
mlmodel = ct.convert(
    traced_model,
    inputs=[ct.ImageType(
        name="input",
        shape=example_input.shape,
        color_layout=ct.colorlayout.RGB
    )],
    compute_units=ct.ComputeUnit.ALL,  # Auto-select ANE + GPU + CPU
    minimum_deployment_target=ct.target.iOS17,
)

mlmodel.save("mobilenet_v3_small.mlpackage")
print("CoreML model saved - Neural Engine optimization applied")

Qualcomm Hexagon DSP

The Hexagon DSP embedded in Qualcomm Snapdragon is the heart of smartphone AI processing.

Hexagon 698 (Snapdragon 8 Gen 3): 98 TOPS
HVX (Hexagon Vector eXtensions): SIMD vector operations
HTA (Hexagon Tensor Accelerator): Transformer-dedicated acceleration

TensorFlow/PyTorch models can be deployed to Hexagon via the Qualcomm Neural Processing SDK (SNPE).

Raspberry Pi 5 AI HAT

The Raspberry Pi AI HAT+ is an edge AI accelerator featuring the Hailo-8L chip.

Hailo-8L: 13 TOPS
Connects to RPi 5 via M.2 interface
Price: ~$70
Use cases: real-time video analysis, object detection

6. Memory Technology: HBM3e vs GDDR7

HBM (High Bandwidth Memory) Architecture

HBM is a memory technology that stacks DRAM dies vertically (3D stacking) and connects them to the GPU through a silicon interposer.

Memory	Bandwidth	Capacity	Power	Pin count	Primary Use
HBM2e	3.2TB/s	up to 80GB	~460W	1,024	A100
HBM3	3.35TB/s	up to 80GB	~700W	1,024	H100
HBM3e	4.8TB/s	up to 141GB	~700W	1,024	H200, MI300X
GDDR6X	576GB/s	up to 24GB	Low	384	RTX 4090
GDDR7	960GB/s	up to 32GB	Low	512	RTX 5090

There are three primary reasons HBM is advantageous for AI training.

Bandwidth: Over 5x higher memory bandwidth than GDDR7 directly eliminates memory bottlenecks during large-batch training.
Capacity: 80–141GB per single GPU allows inference of 70B parameter models on a single GPU.
Energy efficiency: Lower power consumption per byte than GDDR improves TCO.

Near-Memory Computing

Near-memory computing (also called Processing-in-Memory, PIM) places compute units inside the memory itself. Samsung HBM-PIM and SK Hynix AiM (Accelerator in Memory) are representative examples.

Minimizes data movement between memory and compute units
Fundamentally resolves memory bandwidth bottlenecks
Especially effective for memory-bound operations during inference

CXL (Compute Express Link)

CXL is a next-generation interconnect standard that connects CPUs, accelerators, and memory expansion devices over a PCIe physical layer.

CXL 1.1: Type 1 (accelerator), Type 2 (accelerator + memory), Type 3 (memory expansion)
CXL 2.0: Multi-host sharing with switching support
CXL 3.0: P2P communication, fabric support

Attempts to solve GPU VRAM shortages using CXL Type 3 memory expansion in AI servers are increasing.

7. Hardware Selection Guide

Training vs Inference

Optimal hardware differs by workload type.

Large-scale training (Pre-training)

Best: H100 SXM5 (NVLink required), TPU v4 Pod
Reason: High MFU (Model FLOP Utilization), fast collective communication via NVLink/ICI
Batch size: As large as possible (global batch of millions of tokens)

Fine-tuning

Best: H100/A100, AMD MI300X, Gaudi 3
Reason: Mid-scale GPU clusters, cost efficiency
Batch size: Medium (512–4096 tokens)

Large-scale inference (Serving, high throughput)

Best: H100, Inferentia 2, Gaudi 3
Reason: Large KV cache capacity, high throughput
Batch size: Dynamic (continuous batching)

Low-latency inference (Latency-critical)

Best: Groq LPU, Cerebras CS-3
Reason: Deterministic execution, no memory bottlenecks
Batch size: Small (1–8)

VRAM Requirements by Model Size (Inference)

Model Size	Parameters	FP16 VRAM	Minimum GPU (BF16)
Small	7B	14GB	1x A10G (24GB)
Medium	13B	26GB	1x A100 (40GB)
Large	34B	68GB	2x A100 (80GB)
XL	70B	140GB	2x H100 (80GB)
XXL	405B	810GB	10x H100 (80GB)

PyTorch Device Selection and Benchmarking

# PyTorch device selection and benchmarking
import torch
import time

def benchmark_matmul(device_name: str, size: int = 4096, dtype=torch.float16):
    """Matrix multiplication benchmark"""
    device = torch.device(device_name)

    a = torch.randn(size, size, dtype=dtype, device=device)
    b = torch.randn(size, size, dtype=dtype, device=device)

    # Warm-up
    for _ in range(5):
        _ = torch.matmul(a, b)

    if device.type == 'cuda':
        torch.cuda.synchronize()

    start = time.perf_counter()
    for _ in range(100):
        c = torch.matmul(a, b)
    if device.type == 'cuda':
        torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    ops = 2 * size ** 3 * 100  # FLOPs
    tflops = ops / elapsed / 1e12
    print(f"{device_name} ({dtype}): {tflops:.2f} TFLOPS ({elapsed*1000/100:.2f} ms/iter)")

# Auto-select available device
if torch.cuda.is_available():
    benchmark_matmul("cuda:0", dtype=torch.float16)
    benchmark_matmul("cuda:0", dtype=torch.bfloat16)

if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    benchmark_matmul("mps", dtype=torch.float16)

benchmark_matmul("cpu", dtype=torch.float32)

torch.compile for Hardware Optimization

# Hardware optimization with torch.compile
import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, d_model=1024, nhead=16):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, nhead, batch_first=True)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.GELU(),
            nn.Linear(d_model * 4, d_model),
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        attn_out, _ = self.attn(x, x, x)
        x = self.norm1(x + attn_out)
        x = self.norm2(x + self.ff(x))
        return x

model = TransformerBlock().cuda().to(torch.bfloat16)

# torch.compile: automatic optimization with Triton kernels
# Leverages Hopper-specific FlashAttention on H100
compiled_model = torch.compile(model, mode="max-autotune")

x = torch.randn(8, 512, 1024, dtype=torch.bfloat16, device="cuda")

# First run triggers compilation (takes a few seconds)
with torch.autocast("cuda", dtype=torch.bfloat16):
    out = compiled_model(x)

print(f"Output shape: {out.shape}")

Cost Efficiency Analysis (Cloud hourly prices, 2025)

Instance	GPU	Hourly Price	TFLOPS (BF16)	$/TFLOP
p4d.24xlarge	8x A100 40GB	$32.77	8 x 312 = 2,496	$13.1
p4de.24xlarge	8x A100 80GB	$40.96	8 x 312 = 2,496	$16.4
p5.48xlarge	8x H100 80GB	$98.32	8 x 1,979 = 15,832	$6.2
trn1.32xlarge	16x Trainium	$21.50	16 x 420 = 6,720	$3.2
inf2.48xlarge	12x Inferentia2	$12.98	12 x 384 = 4,608	$2.8
g6.48xlarge	8x L40S 48GB	$16.29	8 x 733 = 5,864	$2.8

For inference workloads, Inferentia 2 and Trainium offer the best cost efficiency.

8. Comprehensive Hardware Comparison

Accelerator	Type	BF16 TFLOPS	Memory	Bandwidth	TDP	Primary Use
H100 SXM5	GPU	1,979	80GB HBM3	3.35TB/s	700W	Training/Inference
H200 SXM5	GPU	1,979	141GB HBM3e	4.8TB/s	700W	Large model inference
A100 SXM4	GPU	312	80GB HBM2e	2.0TB/s	400W	General purpose
AMD MI300X	GPU	1,307	192GB HBM3	5.3TB/s	750W	Large models
TPU v5e	ASIC	197 (INT8: 394)	16GB HBM	1.6TB/s	90W	Large-scale inference
Cerebras WSE-3	ASIC	125,000	44GB SRAM	21PB/s	23kW/system	Ultra-large training
Groq LPU	ASIC	750	230MB SRAM	80TB/s	300W	Low-latency inference
Gaudi 3	ASIC	1,835	96GB HBM2e	3.7TB/s	900W	Cost-efficient training
Inferentia 2	ASIC	384	32GB HBM	820GB/s	75W	Cloud inference
Apple M4 ANE	Edge	38 TOPS	Shared	Shared	~10W	On-device
Hailo-8L	Edge	13 TOPS	—	—	1W	Embedded

Quiz

Q1. How does NVIDIA H100's Transformer Engine maintain precision during FP8 training?

Answer: Dynamic Scaling combined with mixed-precision accumulation

Explanation: The Transformer Engine tracks statistics (maximum value) of activations and weights per layer. From these, it computes an optimal scale factor for FP8 quantization. The forward pass is executed in FP8, but gradient accumulation is maintained in BF16/FP32. The engine also monitors the numerical range per layer and automatically rescales when overflow or underflow is detected. Thanks to this Delayed Scaling mechanism, FP8 speed benefits are achieved while maintaining training stability close to BF16.

Q2. How does Google TPU's systolic array parallelize matrix multiplication?

Answer: Pipeline-style MAC operation array with data reuse

Explanation: A systolic array consists of NxN MAC (Multiply-Accumulate) units arranged in a grid. Row data from matrix A flows left to right, and column data from matrix B flows top to bottom. Each cell multiplies the two values passing through it and adds the result to the accumulated value from the previous cell. Because data flows like waves (systoles), each data element is reused by all relevant cells in the array. TPU v4's 128x128 MXU performs 128x128 = 16,384 MAC operations per clock cycle, all processed on-chip without memory accesses.

Q3. Why is HBM better than GDDR for AI training (bandwidth vs capacity)?

Answer: HBM holds advantages in both bandwidth and capacity

Explanation: On the bandwidth side, HBM3e (H200) is 4.8TB/s while GDDR7 (RTX 5090) is 960GB/s — a 5x difference. AI training has many bandwidth-bound operations, so this difference translates directly into performance. On the capacity side, the H200's 141GB HBM3e is more than 4x the RTX 5090's 32GB GDDR7, allowing 70B parameter models to be processed on a single GPU. Structurally, HBM vertically stacks DRAM dies and connects them to the GPU with thousands of wide buses, achieving both high bandwidth and energy efficiency simultaneously.

Q4. How does Cerebras WSE-3's wafer-scale integration eliminate inter-chip communication bottlenecks?

Answer: All cores connected through an on-chip fabric on a single wafer

Explanation: In conventional GPU clusters, hundreds of chips are connected via NVLink, InfiniBand, and similar networks. This inter-chip communication has latency in the microsecond range and limited bandwidth. WSE-3's 900,000 AI cores all reside on a single wafer, so all inter-core communication flows through on-chip fabric. On-chip fabric latency is in the nanosecond range and bandwidth reaches 220Pb/s. Additionally, 44GB of SRAM is distributed near the cores, minimizing memory access latency. This enables near-linear scaling for large model training with almost no communication overhead.

Q5. What architectural decisions allow Groq LPU to achieve lower latency than GPUs for LLM inference?

Answer: Deterministic memory scheduling at compile time

Explanation: The primary causes of high LLM inference latency on GPUs are irregular memory access patterns and runtime dynamic scheduling. The Groq LPU statically determines all tensor memory locations and movement paths at compile time. There is no memory allocation/deallocation or scheduler overhead during execution. The SRAM-based memory architecture also eliminates the irregular access latency of DRAM. All operations execute at predetermined clock cycles, making latency fully predictable. Thanks to this deterministic execution, Groq achieves over 240 tokens per second throughput and very low time-to-first-token (TTFT) latency for LLaMA-3 70B.

Conclusion

The AI hardware accelerator market is diversifying rapidly between 2024 and 2026. While NVIDIA H100/H200 remain the gold standard for training workloads, purpose-optimized accelerators demonstrate advantages in specific use cases.

The core selection principles are as follows.

Training: Bandwidth and NVLink are critical — H100 SXM5, TPU v4 Pod
High-throughput inference: Cost efficiency matters — Inferentia 2, Gaudi 3, TPU v5e
Low-latency inference: Deterministic execution — Groq LPU
Edge deployment: Power efficiency — Apple ANE, Qualcomm Hexagon
Ultra-large training: No inter-chip bottleneck — Cerebras WSE-3

Hardware selection ultimately balances workload characteristics, budget, and ecosystem maturity. The maturity of the NVIDIA ecosystem remains a powerful advantage, but purpose-built ASICs can be far more economical for specific workloads.