Skip to content
Published on

GPU Hardware Complete Guide for AI: From Architecture to Selection Criteria

Authors

Introduction

The GPU is the undisputed workhorse behind the explosive growth of AI and deep learning. Training large language models like GPT-4, Llama 3, or Gemini requires thousands of GPUs running for weeks to months. So why are GPUs so critical for AI? Which GPU should you choose?

This guide covers everything about GPU hardware for AI engineers, researchers, and ML practitioners — from architectural fundamentals to the latest Blackwell GPUs, cloud service comparisons, and practical selection guidance.


1. GPU vs CPU: Why AI Training Needs GPUs

Parallel Computing: The Essence of AI

The core operation in deep learning is matrix multiplication. Both forward and backward passes through neural networks consist of billions of multiplications and additions. These operations are independent of each other and thus perfectly suited for parallelization.

CPUs are optimized for high-performance serial processing. A typical server CPU has 64 to 128 cores, each equipped with complex control logic, large caches, and branch prediction. This excels at sequential tasks, complex conditional branching, and operating system management.

GPUs, in contrast, pack thousands to tens of thousands of small cores that simultaneously perform the same operation in SIMD (Single Instruction, Multiple Data) fashion. The NVIDIA H100 has an astounding 16,896 CUDA cores. For workloads that repeat the same operation — like matrix multiplication — a GPU can deliver hundreds of times more throughput than a CPU.

FLOPS: The Measure of Compute Performance

The most common unit in deep learning performance discussions is FLOPS (Floating Point Operations Per Second).

  • TFLOPS (teraflops): 1 trillion floating-point operations per second
  • PFLOPS (petaflops): 1 quadrillion floating-point operations per second

Modern AI workloads primarily use these precisions:

  • FP32 (single-precision float): Master weight storage during training
  • FP16 (half-precision float): Mixed-precision training
  • BF16 (Brain Float 16): More stable training than FP16
  • TF32 (TensorFloat-32): Supported since NVIDIA A100
  • FP8: Supported in Hopper and Blackwell; used for both inference and training
  • FP4: New in Blackwell; ultra-high-density inference

The NVIDIA H100's FP16 Tensor Core performance reaches a staggering 989 TFLOPS — roughly 1 PFLOPS.

Memory Bandwidth: The True Bottleneck

Many AI workloads are limited not by compute, but by memory bandwidth — a condition called memory-bound.

Take Large Language Model (LLM) inference as an example: every time you generate a token, the full model weights must be read from memory. A Llama 3 70B model in FP16 takes roughly 140GB of memory. Generating dozens of tokens per second requires reading terabytes per second from memory.

Latest GPU memory bandwidths:

  • NVIDIA A100 SXM: 2,000 GB/s
  • NVIDIA H100 SXM: 3,350 GB/s
  • NVIDIA H200 SXM: 4,800 GB/s
  • NVIDIA B200: 8,000 GB/s (projected)

This is why HBM (High Bandwidth Memory) has been adopted in datacenter GPUs. HBM provides far higher bandwidth than conventional GDDR memory.


2. History of NVIDIA GPU Architecture

Pascal Architecture (2016): The Start of the AI Renaissance

Named after the engineer Blaise Pascal, the Pascal architecture launched in 2016. The GTX 1080 (consumer) and P100 (datacenter) used this architecture.

P100 key specs:

  • CUDA cores: 3,584
  • FP32 performance: 9.3 TFLOPS
  • FP16 performance: 18.7 TFLOPS
  • Memory: 16GB HBM2, 720 GB/s

The P100 was the first datacenter GPU to adopt HBM2 memory. NVLink 1.0 was introduced in this era as well. AlphaGo defeated Lee Sedol around this time, and the AI boom began in earnest.

Volta Architecture (2017): The Arrival of Tensor Cores

The Volta architecture marks a turning point in GPU history. The V100, released in 2017, introduced Tensor Cores to the world for the first time. Tensor Cores are dedicated hardware units that accelerate matrix multiplication.

V100 key specs:

  • CUDA cores: 5,120
  • 1st-gen Tensor Cores: 640
  • FP32 performance: 14 TFLOPS
  • FP16 Tensor Core performance: 112 TFLOPS (8x improvement!)
  • Memory: 32GB HBM2, 900 GB/s
  • NVLink 2.0: 300 GB/s

A single Tensor Core performs a 4x4 matrix multiply-accumulate (D = A*B + C) in one cycle. This boosted FP16 performance 8x over FP32, revolutionizing deep learning training speeds.

Turing Architecture (2018): RT Cores and DLSS

The Turing architecture is famous for the consumer RTX series. RT Cores (dedicated ray-tracing units) debuted here, along with DLSS — AI-based image upscaling.

RTX 2080 Ti key specs:

  • CUDA cores: 4,352
  • Tensor Cores: 544 (2nd gen)
  • FP32 performance: 13.4 TFLOPS
  • FP16 Tensor Core: 107 TFLOPS
  • Memory: 11GB GDDR6, 616 GB/s

From an AI standpoint, Turing's significance was INT8 quantized inference support. Quantizing models to INT8 for inference servers delivered 2x faster inference compared to FP16.

Ampere Architecture (2020): A100 and 3rd Gen Tensor Cores

The Ampere architecture shifted the paradigm again. The A100 remains a workhorse GPU in many datacenters today.

A100 SXM4 80GB key specs:

  • CUDA cores: 6,912
  • 3rd-gen Tensor Cores: 432
  • FP32 performance: 19.5 TFLOPS
  • FP16 Tensor Core: 312 TFLOPS
  • TF32 Tensor Core: 156 TFLOPS
  • BF16 Tensor Core: 312 TFLOPS
  • INT8 Tensor Core: 624 TOPS
  • Memory: 80GB HBM2e, 2,000 GB/s
  • NVLink 3.0: 600 GB/s

Ampere's key innovations:

TF32 (TensorFloat-32): A hybrid between FP32 and FP16. Exponent bits match FP32 (8 bits); mantissa matches FP16 (10 bits). Existing FP32 code can leverage Tensor Core speeds without modification — balancing numerical stability and speed.

Sparsity support: A100 supports 2:4 structured sparsity in hardware. Setting 50% of model parameters to zero (pruning) lets Tensor Cores exploit this for a 2x additional performance gain. Up to 1,248 TOPS in INT8 theoretically.

Multi-Instance GPU (MIG): A100 can be partitioned into up to 7 independent GPU instances. Useful for running multiple small models in isolated environments on inference servers.

Hopper Architecture (2022): Transformer Engine

The Hopper architecture brought innovations specifically tailored to transformer models. The H100 is currently the most widely deployed top-tier AI training GPU.

H100 SXM5 80GB key specs:

  • CUDA cores: 16,896
  • 4th-gen Tensor Cores: 528
  • FP32 performance: 60 TFLOPS
  • FP16/BF16 Tensor Core: 989 TFLOPS (~1 PFLOPS!)
  • FP8 Tensor Core: 1,979 TFLOPS (~2 PFLOPS)
  • Memory: 80GB HBM3, 3,350 GB/s
  • NVLink 4.0: 900 GB/s
  • TDP: 700W

Hopper's key innovations:

Transformer Engine: Hardware-level optimization for attention and MLP layers in transformer models. Automatically switches between FP8 and FP16 per layer. Up to 9x AI performance over A100 at launch.

FP8 support: Supports E4M3 and E5M2 FP8 formats. 2x Tensor Core performance vs FP16. Half the memory usage.

Thread Block Clusters: SMs (Streaming Multiprocessors) communicate with each other like shared memory. Distributed shared memory becomes possible.

NVLink 4.0: 900 GB/s, 1.5x improvement over the previous generation. Full-mesh connectivity for up to 8 GPUs.

H200 SXM 141GB key specs:

  • Compute: Same as H100
  • Memory: 141GB HBM3e (76% more than H100)
  • Bandwidth: 4,800 GB/s (43% more than H100)
  • Up to 2x faster throughput for LLM inference
  • Fits large models (70B+ LLMs) on a single GPU

Blackwell Architecture (2024): Next-Generation AI Acceleration

The Blackwell architecture, announced in 2024, is NVIDIA's latest.

B200 SXM key specs:

  • Compute: 20 PFLOPS (FP4)
  • FP8 performance: 9 PFLOPS
  • Memory: 192GB HBM3e, 8,000 GB/s
  • NVLink 5.0: 1,800 GB/s

Blackwell's key innovations:

FP4 support: 4-bit floating-point support enables ultra-high-density inference — more than 2x throughput vs FP8.

2nd Generation Transformer Engine: Automatically manages new precision formats including FP4 and FP6.

NVLink 5.0: 1,800 GB/s, 2x improvement over the previous generation.

GB200 NVL72: 36 Grace CPUs and 72 B200 GPUs integrated into a single rack-scale system. All GPUs connected via NVLink, functioning as one giant GPU. Achieves 1.4 ExaFLOPS (FP4).


3. Tensor Core Deep Dive

CUDA Core vs Tensor Core

A CUDA Core is a general-purpose floating-point execution unit. It processes one FP32 FMA (Fused Multiply-Add) operation per clock cycle.

A Tensor Core is a dedicated matrix multiplication unit. A 1st-gen Tensor Core performs a 4x4 FP16 matrix multiply-accumulate (D = A * B + C) in a single clock cycle. This is equivalent to 64 FP16 multiplications and 64 FP16 additions — 128 FP16 operations per cycle.

Tensor Core evolution by generation:

GenArchitectureSupported PrecisionsMatrix SizeNotes
1stVoltaFP164x4First Tensor Core
2ndTuringFP16, INT8, INT4-INT support added
3rdAmpereFP16, BF16, TF32, INT8, INT4-TF32, Sparsity
4thHopperFP16, BF16, TF32, FP8, INT8-FP8, Transformer Engine
5thBlackwellFP16, BF16, TF32, FP8, FP4-FP4 support

WMMA (Warp Matrix Multiply-Accumulate)

To use Tensor Cores directly in CUDA programming, you use the WMMA API:

#include <mma.h>
using namespace nvcuda::wmma;

// 16x16x16 FP16 matrix multiply
fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
fragment<accumulator, 16, 16, 16, float> c_frag;

fill_fragment(c_frag, 0.0f);

// Load matrices
load_matrix_sync(a_frag, a_ptr, 16);
load_matrix_sync(b_frag, b_ptr, 16);

// Execute Tensor Core multiply
mma_sync(c_frag, a_frag, b_frag, c_frag);

// Store result
store_matrix_sync(c_ptr, c_frag, 16, mem_row_major);

In practice, cuBLAS and PyTorch handle this automatically.

Mixed Precision Training

Mixed precision training maintains FP32 master weights while performing forward/backward passes in FP16 or BF16.

# PyTorch AMP (Automatic Mixed Precision)
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    with autocast(dtype=torch.bfloat16):
        output = model(batch)
        loss = criterion(output, target)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

BF16 is more stable than FP16 for training. FP16's 5-bit exponent gives a narrow range, causing potential overflow, while BF16's 8-bit exponent (same as FP32) represents a much wider range.

Sparsity Support

The 2:4 structured sparsity supported from A100 onward groups parameters into sets of 4 and sets exactly 2 of them to zero.

# PyTorch Sparse Tensor Core
from torch.nn.utils import prune

# Apply 2:4 structured pruning
prune.ln_structured(model.layer, name='weight', amount=0.5, n=2, dim=0)

After 50% pruning, the same model runs 2x faster at inference (theoretically).


4. GPU Memory Hierarchy

GDDR vs HBM: A Game Changer

GDDR6 (Graphics DDR6): Used in consumer GPUs. Separate chips mounted outside the package. RTX 4090: 24GB GDDR6X, 1,008 GB/s.

HBM2e (High Bandwidth Memory 2e): Used in datacenter GPUs. Stacked next to the GPU die in 2.5D packaging, connected via a silicon interposer. A100: 80GB HBM2e, 2,000 GB/s.

HBM3: Deployed in H100. 80GB, 3,350 GB/s.

HBM3e: Deployed in H200. 141GB, 4,800 GB/s. Also in B200: 192GB, 8,000 GB/s.

Why is HBM faster? HBM stacks multiple layers of DRAM dies vertically, connected by thousands of fine Through-Silicon Vias (TSVs). The GPU die and HBM stack sit side by side on a silicon interposer, providing ultra-wide bandwidth over extremely short distances.

Memory Hierarchy Structure

Registers (Register File)
    └── Fastest; tens to hundreds per thread
L1 Cache / Shared Memory
    └── Shared within an SM (Streaming Multiprocessor)
    └── H100: 228KB per SM
L2 Cache
    └── Shared across all SMs
    └── H100: 50MB
HBM (Main Memory)
    └── Accessible by all SMs
    └── H100: 80GB

The key to kernel optimization is keeping data in shared memory as long as possible to minimize slow HBM accesses. Flash Attention is a prime example of applying this principle to attention computation.

ECC Memory

ECC (Error-Correcting Code) memory detects and corrects bit errors. Datacenter GPUs (A100, H100, etc.) support ECC. Consumer GPUs (RTX 4090) do not.

Memory errors during long training runs can cause training to diverge or produce NaN values. ECC-capable GPUs are recommended for critical training jobs. Enabling ECC reduces usable memory capacity by approximately 6.25%.


The PCIe Bottleneck

A standard PCIe 4.0 x16 slot offers up to 32 GB/s bandwidth (64 GB/s bidirectional). In multi-GPU training, gradient synchronization creates a bottleneck through this interface.

Consider All-Reduce across 4 GPUs: each GPU must exchange gradients with the other 3. A 10B parameter model carries ~40GB of FP32 gradients. Exchanging this over PCIe could take tens of seconds.

VersionArchitectureUnidirectional BWBidirectional BW
1.0Pascal20 GB/s40 GB/s
2.0Volta25 GB/s50 GB/s
3.0Ampere25 GB/s50 GB/s (600 GB/s total)
4.0Hopper50 GB/s900 GB/s
5.0Blackwell100 GB/s1,800 GB/s

NVLink 4.0 provides up to 900 GB/s bidirectional bandwidth between GPU pairs — more than 14x faster than PCIe 4.0 x16.

NVSwitch: All-to-All Connectivity

NVLink connects pairs of GPUs directly, but connecting 8 or more GPUs requires NVSwitch — a dedicated GPU interconnect switch chip that lets all connected GPUs communicate directly at full NVLink bandwidth.

DGX H100 system configuration:

  • 8× H100 SXM5 GPUs
  • 4× NVSwitch 4.0
  • All GPU pairs directly connected at 900 GB/s
  • NVLink All-to-All total bandwidth: 7.2 TB/s

DGX A100 vs DGX H100

DGX A100:

  • GPUs: 8× A100 80GB
  • NVLink total bandwidth: 4.8 TB/s
  • GPU memory: 640GB
  • AI performance: 5 PFLOPS (FP16)

DGX H100:

  • GPUs: 8× H100 80GB
  • NVLink total bandwidth: 7.2 TB/s
  • GPU memory: 640GB
  • AI performance: 32 PFLOPS (FP8)
  • Approximately 6.4x performance over DGX A100

InfiniBand: Inter-Node Connectivity

NVLink handles intra-node connectivity; InfiniBand (IB) networking connects multiple server nodes. NVIDIA ConnectX-7 NICs with InfiniBand NDR (400 Gb/s) minimize inter-server communication latency.

Large-scale LLM training requires connecting thousands of GPUs. Meta's Llama 3 training used 16,000 H100s, all interconnected by a massive InfiniBand fabric.


6. Detailed AI GPU Comparison

NVIDIA A100 (80GB HBM2e)

Released in 2020, the A100 remains the standard for many AI workloads. Offers FP16 312 TFLOPS, BF16 312 TFLOPS, and TF32 156 TFLOPS.

The SXM4 form factor supports NVLink 3.0 for up to 8 GPUs; a PCIe 4.0 version also exists. MIG (Multi-Instance GPU) partitions the A100 into up to 7 independent instances.

Approximate cloud hourly rate: ~$32.77/h for AWS p4d.24xlarge (8× A100).

NVIDIA H100 (80GB HBM3)

Released 2022. Currently the most widely deployed high-end AI training GPU.

SXM5 version:

  • FP16/BF16 Tensor Core: 989 TFLOPS
  • FP8 Tensor Core: 1,979 TFLOPS
  • Memory: 80GB HBM3, 3,350 GB/s
  • TDP: 700W
  • NVLink 4.0: 900 GB/s

PCIe version:

  • FP16/BF16 Tensor Core: 756 TFLOPS
  • Memory: 80GB HBM3, 2,000 GB/s
  • TDP: 350W

H100 vs A100:

  • Tensor Core performance: 3.2x (FP16)
  • FP8 vs INT8: 6x
  • Memory bandwidth: 1.7x (HBM3)
  • NVLink bandwidth: 1.5x

Approximate cloud hourly rate: ~$98.32/h for AWS p5.48xlarge (8× H100).

NVIDIA H200 (141GB HBM3e)

A memory-upgraded H100. Compute performance is identical to H100, but memory capacity and bandwidth are dramatically improved.

  • Memory: 141GB HBM3e (76% more than H100)
  • Bandwidth: 4,800 GB/s (43% more than H100)
  • Up to 2x faster throughput vs H100 for LLM inference
  • Fits large models (70B+ LLMs) on a single GPU

NVIDIA B100 / B200 (Blackwell)

Announced 2024. Still in early deployment.

B200 SXM:

  • FP4 Tensor Core: 20 PFLOPS
  • FP8 Tensor Core: 9 PFLOPS
  • FP16/BF16 Tensor Core: 4.5 PFLOPS
  • Memory: 192GB HBM3e, 8,000 GB/s
  • TDP: 1,000W

B200 vs H100:

  • FP8 performance: 4.5x
  • Memory: 2.4x
  • Bandwidth: 2.4x

NVIDIA GB200 NVL72 (Rack-Scale AI)

The GB200 integrates a Grace CPU (ARM-based) and B200 GPU into a single superchip package. GB200 NVL72 integrates 36 Grace CPUs and 72 B200 GPUs into a single rack system.

GB200 NVL72 specs:

  • GPUs: 72× B200
  • CPUs: 36× Grace Hopper Superchip
  • GPU memory: 13.8TB HBM3e
  • NVLink 5.0 All-to-All connectivity
  • AI performance: 1.4 ExaFLOPS (FP4), 720 PFLOPS (FP8)
  • Total power: 120kW

This effectively operates as one enormous GPU. A single rack can handle large models like Llama 3 405B at high throughput.

GeForce RTX 4090 (Consumer)

The best consumer-grade GPU for AI startups and individual researchers.

  • CUDA cores: 16,384
  • FP32 performance: 82.6 TFLOPS
  • FP16 Tensor Core: ~330 TFLOPS (approximate)
  • Memory: 24GB GDDR6X, 1,008 GB/s
  • TDP: 450W
  • Price: ~$1,599 (MSRP)

vs H100 SXM:

  • Tensor Core: ~1/3 the performance
  • Memory: 24GB vs 80GB
  • Bandwidth: 1,008 vs 3,350 GB/s
  • ECC: Not supported
  • NVLink: Not supported (PCIe only)
  • Price: ~1/20th (H100 is $30,000+)

AMD MI300X

AMD's datacenter AI GPU.

  • Compute Units (CU): 304
  • FP16 performance: 1,307 TFLOPS
  • BF16 performance: 1,307 TFLOPS
  • FP8 performance: 2,614 TOPS
  • Memory: 192GB HBM3, 5,300 GB/s
  • TDP: 750W

The MI300X has a significant memory capacity (192GB vs 80GB) and bandwidth (5,300 vs 3,350 GB/s) advantage over the H100. It particularly excels in LLM inference.

GPU Performance Comparison Table

GPUFP16 TFLOPSMemoryBandwidthTDPYear
A100 SXM31280GB HBM2e2,000 GB/s400W2020
RTX 4090~33024GB GDDR6X1,008 GB/s450W2022
H100 SXM98980GB HBM33,350 GB/s700W2022
MI300X1,307192GB HBM35,300 GB/s750W2023
H200 SXM989141GB HBM3e4,800 GB/s700W2024
B200 SXM4,500192GB HBM3e8,000 GB/s1,000W2024

7. AMD GPU for AI

The ROCm Ecosystem

AMD's AI software stack is ROCm (Radeon Open Compute) — an open-source platform compatible with CUDA. Compatibility with major frameworks like PyTorch and TensorFlow has improved significantly in recent years.

ROCm's CUDA-equivalent components:

  • HIP (Heterogeneous-compute Interface for Portability): CUDA C++ equivalent
  • rocBLAS: cuBLAS equivalent (matrix operations)
  • MIOpen: cuDNN equivalent (deep learning primitives)
  • rccl: NCCL equivalent (GPU communication)

Installing PyTorch with ROCm support:

# Install PyTorch with ROCm support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0

MI300X Key Features

The MI300X is AMD's current flagship AI GPU. It uses the CDNA 3 architecture with advanced packaging (MCM: Multi-Chip Module) that integrates GPU dies and HBM stacks in 3D.

The MI300X's 192GB HBM3 and 5,300 GB/s bandwidth can outperform the H100 in some LLM inference scenarios — especially memory-bound workloads (large batches, long sequences).

Microsoft Azure, Oracle Cloud, and others have begun offering MI300X instances. Major tech companies including Meta and Microsoft are actively adopting AMD GPUs.

AMD vs NVIDIA Software Ecosystem

Honestly, the NVIDIA CUDA software ecosystem still dominates AI today.

  • NVIDIA-exclusive libraries: cuDNN, cuBLAS, TensorRT, NCCL, NVTX, etc.
  • FlashAttention was originally CUDA-only (ROCm port came later)
  • Much research code assumes CUDA
  • ROCm is rapidly closing the gap but isn't fully compatible yet

Choosing AMD GPUs for production may require additional engineering time to address software compatibility issues.


8. Cloud GPU Service Comparison

AWS GPU Instances

p3 series (V100):

  • p3.2xlarge: 1× V100, $3.06/h
  • p3.16xlarge: 8× V100, $24.48/h

p4d series (A100):

  • p4d.24xlarge: 8× A100, 320GB HBM2, $32.77/h

p5 series (H100):

  • p5.48xlarge: 8× H100, 640GB HBM3, $98.32/h

AWS Trainium (Trn1):

  • AWS-proprietary AI training chip (Trainium 2)
  • Trn1.32xlarge: 16× Trainium, $21.50/h
  • Better price-performance for LLM training vs H100

AWS Inferentia (Inf2):

  • Inference-only chip
  • Inf2.48xlarge: 12× Inferentia2, $12.98/h
  • Optimized for Llama 2 70B inference

Spot instances can save 60-90% vs on-demand. Checkpointing is essential since instances can be interrupted.

Google Cloud GPU Instances

A100 instances:

  • a2-highgpu-1g: 1× A100 (40GB), $3.67/h
  • a2-megagpu-16g: 16× A100, $55.74/h

H100 instances:

  • a3-highgpu-8g: 8× H100, ~$19-25/h (varies by region)

Google TPU v4/v5:

  • TPU v4: AI training-optimized ASIC, 400 TFlops/chip
  • TPU v5e: Optimized for large-scale inference
  • TPU v5p: Latest training-focused, 459 TFlops/chip
  • Best compatibility with Google's JAX framework

Azure GPU Instances

ND H100 v5:

  • Standard_ND96isr_H100_v5: 8× H100
  • Inter-node connectivity via InfiniBand NDR

NCas_T4_v3:

  • T4-based inference instances
  • Standard_NC64as_T4_v3: 4× T4, $4.35/h

Lambda Labs, CoreWeave, Vast.ai

Cloud startups offer GPUs more cheaply than AWS/GCP/Azure.

Lambda Labs:

  • H100 SXM5 8× instance: $26.80/h (73% cheaper than AWS p5)
  • Lambda Cloud is tailored for AI researchers

CoreWeave:

  • Professional GPU cloud
  • H100 single: $2.89/h
  • Large cluster configurations available

Vast.ai:

  • GPU marketplace (individuals/companies renting out GPUs)
  • H100 at ~$2-3/h (market price)
  • Suitable for experimental training where security sensitivity is low

Cloud GPU Cost Estimation

LLM training cost example (Llama 3 8B, 8× A100, 100B tokens):

Training time estimate:
- Chinchilla law: 8B params x 100B tokens
- ~7-10 days on an 8× A100 cluster
- p4d.24xlarge: $32.77/h x 24h x 8 days = ~$6,291

Cost-saving strategies:
1. Spot instances: 70% savings -> ~$1,887
2. Lambda Labs: $11.60/h x 24h x 8 days = ~$2,227
3. CoreWeave: even cheaper possible

9. GPU Selection Guide

Training vs Inference

Important for training:

  • High-performance Tensor Cores (BF16/FP8)
  • Sufficient memory (batch size, gradients, optimizer state)
  • NVLink bandwidth (multi-GPU gradient synchronization)
  • ECC memory (stability)

Important for inference:

  • Memory bandwidth (KV cache read speed)
  • Memory capacity (model + KV cache)
  • INT8/FP8/FP4 support
  • MIG (isolated execution of multiple small models)

GPU Requirements by Model Size

Memory required at FP16:

Model SizeParameter MemoryTraining MemoryGPUs Needed
7B14GB~56GB1× H100 (80GB)
13B26GB~104GB2× H100
70B140GB~560GB8× H100
405B810GB~3.2TB40+ H100
1T2TB~8TB100+ H100

Training memory = parameters x 4 (FP16 params, gradients, Adam 2 moments) + activations

Memory-saving techniques:

  • Gradient Checkpointing: dramatically reduces activation memory (20-30% speed trade-off)
  • FSDP/ZeRO: distributes parameters, gradients, optimizer state across GPUs
  • Flash Attention: attention computation O(N²) → O(N) memory
  • FP8 training: halves memory usage

Recommendations by Budget

Individual researchers (~$2,000):

  • RTX 4090 (24GB, ~$1,600): fine-tuning small models, LoRA training
  • 7B model QLoRA fine-tuning is feasible
  • FlashAttention supported (Ada Lovelace architecture)

Startup team (~$10,000-50,000):

  • 4-8× RTX 4090: small LLM experiments
  • Or used A100 40GB/80GB × 1-4
  • Hybrid cloud strategy recommended

Mid-size AI team (~$1M+):

  • 8× H100 (DGX H100 level): $320,000+
  • Or Lambda Labs/CoreWeave cloud
  • Can train 100B+ parameter models

Large research institutions/enterprises:

  • Hundreds to thousands of H100/H200 GPUs
  • GB200 NVL72 rack systems
  • Dedicated InfiniBand network

Consumer vs Datacenter

FeatureRTX 4090H100 SXM
Memory24GB GDDR6X80GB HBM3
Bandwidth1,008 GB/s3,350 GB/s
FP16 Perf~330 TFLOPS989 TFLOPS
ECCNot supportedSupported
NVLinkNot supportedSupported
MIGNot supportedSupported
TDP450W700W
Price~$1,600~$30,000+
Warranty3yr consumerEnterprise

10. GPU Monitoring and Optimization

Using nvidia-smi

nvidia-smi is the primary CLI tool for monitoring NVIDIA GPUs.

# Basic GPU status
nvidia-smi

# Real-time monitoring (1 second refresh)
watch -n 1 nvidia-smi

# CSV output for logging
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,\
pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,\
utilization.memory,memory.total,memory.free,memory.used \
--format=csv -l 1 > gpu_log.csv

# Per-process GPU memory usage
nvidia-smi pmon -s m

# GPU topology (NVLink connections)
nvidia-smi topo -m

PyTorch GPU Utilization Optimization

import torch

# Check GPU memory usage
print(f"Allocated memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Reserved memory: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Detailed memory analysis
print(torch.cuda.memory_summary())

# DataLoader optimization
dataloader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=4,           # Tune to CPU core count
    pin_memory=True,         # Pin CPU RAM for faster GPU transfer
    prefetch_factor=2,       # Number of batches to prefetch
    persistent_workers=True  # Reuse worker processes
)

# CUDA Streams for overlapping operations
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()

with torch.cuda.stream(stream1):
    result1 = model_part1(data1)

with torch.cuda.stream(stream2):
    result2 = model_part2(data2)  # Runs in parallel with stream1

torch.cuda.synchronize()

GPU Utilization Optimization Checklist

Common causes of low GPU utilization (below 70%) and fixes:

  1. Data loading bottleneck: Increase num_workers, set pin_memory=True
  2. Batch size too small: Use Gradient Accumulation to increase effective batch size
  3. Python GIL bottleneck: Use CUDA Graphs to minimize CPU overhead
  4. Memory fragmentation: Call torch.cuda.empty_cache() periodically
# CUDA Graphs to minimize CPU overhead
static_input = torch.randn(batch_size, input_size, device='cuda')
static_target = torch.randn(batch_size, output_size, device='cuda')

# Warmup
for _ in range(3):
    output = model(static_input)
    loss = criterion(output, static_target)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# Capture CUDA Graph
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    output = model(static_input)
    loss = criterion(output, static_target)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# Run with real data
for real_input, real_target in dataloader:
    static_input.copy_(real_input)
    static_target.copy_(real_target)
    g.replay()  # Replay CUDA Graph (near-zero CPU overhead)

Thermal Management

Datacenter GPUs generate hundreds of watts of heat. Thermal management directly impacts performance and longevity.

  • H100 SXM TDP: 700W, max temperature: 83°C
  • Thermal throttling automatically limits performance when exceeded
  • DGX systems support direct liquid cooling
# Real-time GPU temperature monitoring
nvidia-smi dmon -s t

# Fan speed control (consumer GPUs)
nvidia-settings -a "[gpu:0]/GPUFanControlState=1" \
                -a "[fan:0]/GPUTargetFanSpeed=80"

# Power limit (prevent overheating at some performance cost)
sudo nvidia-smi -pl 300  # Limit power to 300W

Multi-GPU Setup

For 4+ GPU configurations, correct settings matter.

# Check GPU connection topology
nvidia-smi topo -m
# Shows NVLink-connected GPU pairs and PCIe connections

# Enable NCCL debug logging
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

# Verify NVLink P2P
nvidia-smi nvlink --status

# NUMA optimization (multi-socket servers)
numactl --cpunodebind=0 --membind=0 python train.py  # Same NUMA node as GPU 0-3
# PyTorch distributed training basics
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    dist.init_process_group(
        backend='nccl',   # NVIDIA GPU: nccl; AMD: gloo or rccl
        init_method='env://',
        world_size=world_size,
        rank=rank
    )
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

# Launch with torchrun:
# torchrun --nproc_per_node=8 --nnodes=2 --rdzv_id=100 \
#          --rdzv_backend=c10d --rdzv_endpoint=host:29400 train.py

Conclusion: Practical Principles for GPU Selection

The most important thing in GPU selection is accurately understanding your own workload.

  1. Memory capacity first: If your model and batch don't fit in GPU memory, nothing else matters.
  2. Bandwidth vs compute: Training tends to be compute-bound; large model inference tends to be memory-bound.
  3. Cloud-first strategy: If unsure, start with cloud to understand requirements before investing in on-premises hardware.
  4. Ecosystem matters: NVIDIA's CUDA ecosystem remains overwhelmingly mature. AMD ROCm is catching up fast.
  5. Power consumption: On-premises GPU cluster electricity costs are on par with hardware costs.

AI infrastructure is evolving rapidly. Blackwell makes FP4 quantization a reality; rack-scale systems like GB200 NVL72 are redefining the training and inference of large models. Keep a close eye on hardware developments and make choices optimized for your requirements.


References