- Authors

- Name
- Youngju Kim
- @fjvbn20031
Introduction
The GPU is the undisputed workhorse behind the explosive growth of AI and deep learning. Training large language models like GPT-4, Llama 3, or Gemini requires thousands of GPUs running for weeks to months. So why are GPUs so critical for AI? Which GPU should you choose?
This guide covers everything about GPU hardware for AI engineers, researchers, and ML practitioners — from architectural fundamentals to the latest Blackwell GPUs, cloud service comparisons, and practical selection guidance.
1. GPU vs CPU: Why AI Training Needs GPUs
Parallel Computing: The Essence of AI
The core operation in deep learning is matrix multiplication. Both forward and backward passes through neural networks consist of billions of multiplications and additions. These operations are independent of each other and thus perfectly suited for parallelization.
CPUs are optimized for high-performance serial processing. A typical server CPU has 64 to 128 cores, each equipped with complex control logic, large caches, and branch prediction. This excels at sequential tasks, complex conditional branching, and operating system management.
GPUs, in contrast, pack thousands to tens of thousands of small cores that simultaneously perform the same operation in SIMD (Single Instruction, Multiple Data) fashion. The NVIDIA H100 has an astounding 16,896 CUDA cores. For workloads that repeat the same operation — like matrix multiplication — a GPU can deliver hundreds of times more throughput than a CPU.
FLOPS: The Measure of Compute Performance
The most common unit in deep learning performance discussions is FLOPS (Floating Point Operations Per Second).
- TFLOPS (teraflops): 1 trillion floating-point operations per second
- PFLOPS (petaflops): 1 quadrillion floating-point operations per second
Modern AI workloads primarily use these precisions:
- FP32 (single-precision float): Master weight storage during training
- FP16 (half-precision float): Mixed-precision training
- BF16 (Brain Float 16): More stable training than FP16
- TF32 (TensorFloat-32): Supported since NVIDIA A100
- FP8: Supported in Hopper and Blackwell; used for both inference and training
- FP4: New in Blackwell; ultra-high-density inference
The NVIDIA H100's FP16 Tensor Core performance reaches a staggering 989 TFLOPS — roughly 1 PFLOPS.
Memory Bandwidth: The True Bottleneck
Many AI workloads are limited not by compute, but by memory bandwidth — a condition called memory-bound.
Take Large Language Model (LLM) inference as an example: every time you generate a token, the full model weights must be read from memory. A Llama 3 70B model in FP16 takes roughly 140GB of memory. Generating dozens of tokens per second requires reading terabytes per second from memory.
Latest GPU memory bandwidths:
- NVIDIA A100 SXM: 2,000 GB/s
- NVIDIA H100 SXM: 3,350 GB/s
- NVIDIA H200 SXM: 4,800 GB/s
- NVIDIA B200: 8,000 GB/s (projected)
This is why HBM (High Bandwidth Memory) has been adopted in datacenter GPUs. HBM provides far higher bandwidth than conventional GDDR memory.
2. History of NVIDIA GPU Architecture
Pascal Architecture (2016): The Start of the AI Renaissance
Named after the engineer Blaise Pascal, the Pascal architecture launched in 2016. The GTX 1080 (consumer) and P100 (datacenter) used this architecture.
P100 key specs:
- CUDA cores: 3,584
- FP32 performance: 9.3 TFLOPS
- FP16 performance: 18.7 TFLOPS
- Memory: 16GB HBM2, 720 GB/s
The P100 was the first datacenter GPU to adopt HBM2 memory. NVLink 1.0 was introduced in this era as well. AlphaGo defeated Lee Sedol around this time, and the AI boom began in earnest.
Volta Architecture (2017): The Arrival of Tensor Cores
The Volta architecture marks a turning point in GPU history. The V100, released in 2017, introduced Tensor Cores to the world for the first time. Tensor Cores are dedicated hardware units that accelerate matrix multiplication.
V100 key specs:
- CUDA cores: 5,120
- 1st-gen Tensor Cores: 640
- FP32 performance: 14 TFLOPS
- FP16 Tensor Core performance: 112 TFLOPS (8x improvement!)
- Memory: 32GB HBM2, 900 GB/s
- NVLink 2.0: 300 GB/s
A single Tensor Core performs a 4x4 matrix multiply-accumulate (D = A*B + C) in one cycle. This boosted FP16 performance 8x over FP32, revolutionizing deep learning training speeds.
Turing Architecture (2018): RT Cores and DLSS
The Turing architecture is famous for the consumer RTX series. RT Cores (dedicated ray-tracing units) debuted here, along with DLSS — AI-based image upscaling.
RTX 2080 Ti key specs:
- CUDA cores: 4,352
- Tensor Cores: 544 (2nd gen)
- FP32 performance: 13.4 TFLOPS
- FP16 Tensor Core: 107 TFLOPS
- Memory: 11GB GDDR6, 616 GB/s
From an AI standpoint, Turing's significance was INT8 quantized inference support. Quantizing models to INT8 for inference servers delivered 2x faster inference compared to FP16.
Ampere Architecture (2020): A100 and 3rd Gen Tensor Cores
The Ampere architecture shifted the paradigm again. The A100 remains a workhorse GPU in many datacenters today.
A100 SXM4 80GB key specs:
- CUDA cores: 6,912
- 3rd-gen Tensor Cores: 432
- FP32 performance: 19.5 TFLOPS
- FP16 Tensor Core: 312 TFLOPS
- TF32 Tensor Core: 156 TFLOPS
- BF16 Tensor Core: 312 TFLOPS
- INT8 Tensor Core: 624 TOPS
- Memory: 80GB HBM2e, 2,000 GB/s
- NVLink 3.0: 600 GB/s
Ampere's key innovations:
TF32 (TensorFloat-32): A hybrid between FP32 and FP16. Exponent bits match FP32 (8 bits); mantissa matches FP16 (10 bits). Existing FP32 code can leverage Tensor Core speeds without modification — balancing numerical stability and speed.
Sparsity support: A100 supports 2:4 structured sparsity in hardware. Setting 50% of model parameters to zero (pruning) lets Tensor Cores exploit this for a 2x additional performance gain. Up to 1,248 TOPS in INT8 theoretically.
Multi-Instance GPU (MIG): A100 can be partitioned into up to 7 independent GPU instances. Useful for running multiple small models in isolated environments on inference servers.
Hopper Architecture (2022): Transformer Engine
The Hopper architecture brought innovations specifically tailored to transformer models. The H100 is currently the most widely deployed top-tier AI training GPU.
H100 SXM5 80GB key specs:
- CUDA cores: 16,896
- 4th-gen Tensor Cores: 528
- FP32 performance: 60 TFLOPS
- FP16/BF16 Tensor Core: 989 TFLOPS (~1 PFLOPS!)
- FP8 Tensor Core: 1,979 TFLOPS (~2 PFLOPS)
- Memory: 80GB HBM3, 3,350 GB/s
- NVLink 4.0: 900 GB/s
- TDP: 700W
Hopper's key innovations:
Transformer Engine: Hardware-level optimization for attention and MLP layers in transformer models. Automatically switches between FP8 and FP16 per layer. Up to 9x AI performance over A100 at launch.
FP8 support: Supports E4M3 and E5M2 FP8 formats. 2x Tensor Core performance vs FP16. Half the memory usage.
Thread Block Clusters: SMs (Streaming Multiprocessors) communicate with each other like shared memory. Distributed shared memory becomes possible.
NVLink 4.0: 900 GB/s, 1.5x improvement over the previous generation. Full-mesh connectivity for up to 8 GPUs.
H200 SXM 141GB key specs:
- Compute: Same as H100
- Memory: 141GB HBM3e (76% more than H100)
- Bandwidth: 4,800 GB/s (43% more than H100)
- Up to 2x faster throughput for LLM inference
- Fits large models (70B+ LLMs) on a single GPU
Blackwell Architecture (2024): Next-Generation AI Acceleration
The Blackwell architecture, announced in 2024, is NVIDIA's latest.
B200 SXM key specs:
- Compute: 20 PFLOPS (FP4)
- FP8 performance: 9 PFLOPS
- Memory: 192GB HBM3e, 8,000 GB/s
- NVLink 5.0: 1,800 GB/s
Blackwell's key innovations:
FP4 support: 4-bit floating-point support enables ultra-high-density inference — more than 2x throughput vs FP8.
2nd Generation Transformer Engine: Automatically manages new precision formats including FP4 and FP6.
NVLink 5.0: 1,800 GB/s, 2x improvement over the previous generation.
GB200 NVL72: 36 Grace CPUs and 72 B200 GPUs integrated into a single rack-scale system. All GPUs connected via NVLink, functioning as one giant GPU. Achieves 1.4 ExaFLOPS (FP4).
3. Tensor Core Deep Dive
CUDA Core vs Tensor Core
A CUDA Core is a general-purpose floating-point execution unit. It processes one FP32 FMA (Fused Multiply-Add) operation per clock cycle.
A Tensor Core is a dedicated matrix multiplication unit. A 1st-gen Tensor Core performs a 4x4 FP16 matrix multiply-accumulate (D = A * B + C) in a single clock cycle. This is equivalent to 64 FP16 multiplications and 64 FP16 additions — 128 FP16 operations per cycle.
Tensor Core evolution by generation:
| Gen | Architecture | Supported Precisions | Matrix Size | Notes |
|---|---|---|---|---|
| 1st | Volta | FP16 | 4x4 | First Tensor Core |
| 2nd | Turing | FP16, INT8, INT4 | - | INT support added |
| 3rd | Ampere | FP16, BF16, TF32, INT8, INT4 | - | TF32, Sparsity |
| 4th | Hopper | FP16, BF16, TF32, FP8, INT8 | - | FP8, Transformer Engine |
| 5th | Blackwell | FP16, BF16, TF32, FP8, FP4 | - | FP4 support |
WMMA (Warp Matrix Multiply-Accumulate)
To use Tensor Cores directly in CUDA programming, you use the WMMA API:
#include <mma.h>
using namespace nvcuda::wmma;
// 16x16x16 FP16 matrix multiply
fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
fragment<accumulator, 16, 16, 16, float> c_frag;
fill_fragment(c_frag, 0.0f);
// Load matrices
load_matrix_sync(a_frag, a_ptr, 16);
load_matrix_sync(b_frag, b_ptr, 16);
// Execute Tensor Core multiply
mma_sync(c_frag, a_frag, b_frag, c_frag);
// Store result
store_matrix_sync(c_ptr, c_frag, 16, mem_row_major);
In practice, cuBLAS and PyTorch handle this automatically.
Mixed Precision Training
Mixed precision training maintains FP32 master weights while performing forward/backward passes in FP16 or BF16.
# PyTorch AMP (Automatic Mixed Precision)
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
with autocast(dtype=torch.bfloat16):
output = model(batch)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
BF16 is more stable than FP16 for training. FP16's 5-bit exponent gives a narrow range, causing potential overflow, while BF16's 8-bit exponent (same as FP32) represents a much wider range.
Sparsity Support
The 2:4 structured sparsity supported from A100 onward groups parameters into sets of 4 and sets exactly 2 of them to zero.
# PyTorch Sparse Tensor Core
from torch.nn.utils import prune
# Apply 2:4 structured pruning
prune.ln_structured(model.layer, name='weight', amount=0.5, n=2, dim=0)
After 50% pruning, the same model runs 2x faster at inference (theoretically).
4. GPU Memory Hierarchy
GDDR vs HBM: A Game Changer
GDDR6 (Graphics DDR6): Used in consumer GPUs. Separate chips mounted outside the package. RTX 4090: 24GB GDDR6X, 1,008 GB/s.
HBM2e (High Bandwidth Memory 2e): Used in datacenter GPUs. Stacked next to the GPU die in 2.5D packaging, connected via a silicon interposer. A100: 80GB HBM2e, 2,000 GB/s.
HBM3: Deployed in H100. 80GB, 3,350 GB/s.
HBM3e: Deployed in H200. 141GB, 4,800 GB/s. Also in B200: 192GB, 8,000 GB/s.
Why is HBM faster? HBM stacks multiple layers of DRAM dies vertically, connected by thousands of fine Through-Silicon Vias (TSVs). The GPU die and HBM stack sit side by side on a silicon interposer, providing ultra-wide bandwidth over extremely short distances.
Memory Hierarchy Structure
Registers (Register File)
└── Fastest; tens to hundreds per thread
L1 Cache / Shared Memory
└── Shared within an SM (Streaming Multiprocessor)
└── H100: 228KB per SM
L2 Cache
└── Shared across all SMs
└── H100: 50MB
HBM (Main Memory)
└── Accessible by all SMs
└── H100: 80GB
The key to kernel optimization is keeping data in shared memory as long as possible to minimize slow HBM accesses. Flash Attention is a prime example of applying this principle to attention computation.
ECC Memory
ECC (Error-Correcting Code) memory detects and corrects bit errors. Datacenter GPUs (A100, H100, etc.) support ECC. Consumer GPUs (RTX 4090) do not.
Memory errors during long training runs can cause training to diverge or produce NaN values. ECC-capable GPUs are recommended for critical training jobs. Enabling ECC reduces usable memory capacity by approximately 6.25%.
5. Multi-GPU Connectivity: NVLink & NVSwitch
The PCIe Bottleneck
A standard PCIe 4.0 x16 slot offers up to 32 GB/s bandwidth (64 GB/s bidirectional). In multi-GPU training, gradient synchronization creates a bottleneck through this interface.
Consider All-Reduce across 4 GPUs: each GPU must exchange gradients with the other 3. A 10B parameter model carries ~40GB of FP32 gradients. Exchanging this over PCIe could take tens of seconds.
NVLink Evolution
| Version | Architecture | Unidirectional BW | Bidirectional BW |
|---|---|---|---|
| 1.0 | Pascal | 20 GB/s | 40 GB/s |
| 2.0 | Volta | 25 GB/s | 50 GB/s |
| 3.0 | Ampere | 25 GB/s | 50 GB/s (600 GB/s total) |
| 4.0 | Hopper | 50 GB/s | 900 GB/s |
| 5.0 | Blackwell | 100 GB/s | 1,800 GB/s |
NVLink 4.0 provides up to 900 GB/s bidirectional bandwidth between GPU pairs — more than 14x faster than PCIe 4.0 x16.
NVSwitch: All-to-All Connectivity
NVLink connects pairs of GPUs directly, but connecting 8 or more GPUs requires NVSwitch — a dedicated GPU interconnect switch chip that lets all connected GPUs communicate directly at full NVLink bandwidth.
DGX H100 system configuration:
- 8× H100 SXM5 GPUs
- 4× NVSwitch 4.0
- All GPU pairs directly connected at 900 GB/s
- NVLink All-to-All total bandwidth: 7.2 TB/s
DGX A100 vs DGX H100
DGX A100:
- GPUs: 8× A100 80GB
- NVLink total bandwidth: 4.8 TB/s
- GPU memory: 640GB
- AI performance: 5 PFLOPS (FP16)
DGX H100:
- GPUs: 8× H100 80GB
- NVLink total bandwidth: 7.2 TB/s
- GPU memory: 640GB
- AI performance: 32 PFLOPS (FP8)
- Approximately 6.4x performance over DGX A100
InfiniBand: Inter-Node Connectivity
NVLink handles intra-node connectivity; InfiniBand (IB) networking connects multiple server nodes. NVIDIA ConnectX-7 NICs with InfiniBand NDR (400 Gb/s) minimize inter-server communication latency.
Large-scale LLM training requires connecting thousands of GPUs. Meta's Llama 3 training used 16,000 H100s, all interconnected by a massive InfiniBand fabric.
6. Detailed AI GPU Comparison
NVIDIA A100 (80GB HBM2e)
Released in 2020, the A100 remains the standard for many AI workloads. Offers FP16 312 TFLOPS, BF16 312 TFLOPS, and TF32 156 TFLOPS.
The SXM4 form factor supports NVLink 3.0 for up to 8 GPUs; a PCIe 4.0 version also exists. MIG (Multi-Instance GPU) partitions the A100 into up to 7 independent instances.
Approximate cloud hourly rate: ~$32.77/h for AWS p4d.24xlarge (8× A100).
NVIDIA H100 (80GB HBM3)
Released 2022. Currently the most widely deployed high-end AI training GPU.
SXM5 version:
- FP16/BF16 Tensor Core: 989 TFLOPS
- FP8 Tensor Core: 1,979 TFLOPS
- Memory: 80GB HBM3, 3,350 GB/s
- TDP: 700W
- NVLink 4.0: 900 GB/s
PCIe version:
- FP16/BF16 Tensor Core: 756 TFLOPS
- Memory: 80GB HBM3, 2,000 GB/s
- TDP: 350W
H100 vs A100:
- Tensor Core performance: 3.2x (FP16)
- FP8 vs INT8: 6x
- Memory bandwidth: 1.7x (HBM3)
- NVLink bandwidth: 1.5x
Approximate cloud hourly rate: ~$98.32/h for AWS p5.48xlarge (8× H100).
NVIDIA H200 (141GB HBM3e)
A memory-upgraded H100. Compute performance is identical to H100, but memory capacity and bandwidth are dramatically improved.
- Memory: 141GB HBM3e (76% more than H100)
- Bandwidth: 4,800 GB/s (43% more than H100)
- Up to 2x faster throughput vs H100 for LLM inference
- Fits large models (70B+ LLMs) on a single GPU
NVIDIA B100 / B200 (Blackwell)
Announced 2024. Still in early deployment.
B200 SXM:
- FP4 Tensor Core: 20 PFLOPS
- FP8 Tensor Core: 9 PFLOPS
- FP16/BF16 Tensor Core: 4.5 PFLOPS
- Memory: 192GB HBM3e, 8,000 GB/s
- TDP: 1,000W
B200 vs H100:
- FP8 performance: 4.5x
- Memory: 2.4x
- Bandwidth: 2.4x
NVIDIA GB200 NVL72 (Rack-Scale AI)
The GB200 integrates a Grace CPU (ARM-based) and B200 GPU into a single superchip package. GB200 NVL72 integrates 36 Grace CPUs and 72 B200 GPUs into a single rack system.
GB200 NVL72 specs:
- GPUs: 72× B200
- CPUs: 36× Grace Hopper Superchip
- GPU memory: 13.8TB HBM3e
- NVLink 5.0 All-to-All connectivity
- AI performance: 1.4 ExaFLOPS (FP4), 720 PFLOPS (FP8)
- Total power: 120kW
This effectively operates as one enormous GPU. A single rack can handle large models like Llama 3 405B at high throughput.
GeForce RTX 4090 (Consumer)
The best consumer-grade GPU for AI startups and individual researchers.
- CUDA cores: 16,384
- FP32 performance: 82.6 TFLOPS
- FP16 Tensor Core: ~330 TFLOPS (approximate)
- Memory: 24GB GDDR6X, 1,008 GB/s
- TDP: 450W
- Price: ~$1,599 (MSRP)
vs H100 SXM:
- Tensor Core: ~1/3 the performance
- Memory: 24GB vs 80GB
- Bandwidth: 1,008 vs 3,350 GB/s
- ECC: Not supported
- NVLink: Not supported (PCIe only)
- Price: ~1/20th (H100 is $30,000+)
AMD MI300X
AMD's datacenter AI GPU.
- Compute Units (CU): 304
- FP16 performance: 1,307 TFLOPS
- BF16 performance: 1,307 TFLOPS
- FP8 performance: 2,614 TOPS
- Memory: 192GB HBM3, 5,300 GB/s
- TDP: 750W
The MI300X has a significant memory capacity (192GB vs 80GB) and bandwidth (5,300 vs 3,350 GB/s) advantage over the H100. It particularly excels in LLM inference.
GPU Performance Comparison Table
| GPU | FP16 TFLOPS | Memory | Bandwidth | TDP | Year |
|---|---|---|---|---|---|
| A100 SXM | 312 | 80GB HBM2e | 2,000 GB/s | 400W | 2020 |
| RTX 4090 | ~330 | 24GB GDDR6X | 1,008 GB/s | 450W | 2022 |
| H100 SXM | 989 | 80GB HBM3 | 3,350 GB/s | 700W | 2022 |
| MI300X | 1,307 | 192GB HBM3 | 5,300 GB/s | 750W | 2023 |
| H200 SXM | 989 | 141GB HBM3e | 4,800 GB/s | 700W | 2024 |
| B200 SXM | 4,500 | 192GB HBM3e | 8,000 GB/s | 1,000W | 2024 |
7. AMD GPU for AI
The ROCm Ecosystem
AMD's AI software stack is ROCm (Radeon Open Compute) — an open-source platform compatible with CUDA. Compatibility with major frameworks like PyTorch and TensorFlow has improved significantly in recent years.
ROCm's CUDA-equivalent components:
- HIP (Heterogeneous-compute Interface for Portability): CUDA C++ equivalent
- rocBLAS: cuBLAS equivalent (matrix operations)
- MIOpen: cuDNN equivalent (deep learning primitives)
- rccl: NCCL equivalent (GPU communication)
Installing PyTorch with ROCm support:
# Install PyTorch with ROCm support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0
MI300X Key Features
The MI300X is AMD's current flagship AI GPU. It uses the CDNA 3 architecture with advanced packaging (MCM: Multi-Chip Module) that integrates GPU dies and HBM stacks in 3D.
The MI300X's 192GB HBM3 and 5,300 GB/s bandwidth can outperform the H100 in some LLM inference scenarios — especially memory-bound workloads (large batches, long sequences).
Microsoft Azure, Oracle Cloud, and others have begun offering MI300X instances. Major tech companies including Meta and Microsoft are actively adopting AMD GPUs.
AMD vs NVIDIA Software Ecosystem
Honestly, the NVIDIA CUDA software ecosystem still dominates AI today.
- NVIDIA-exclusive libraries: cuDNN, cuBLAS, TensorRT, NCCL, NVTX, etc.
- FlashAttention was originally CUDA-only (ROCm port came later)
- Much research code assumes CUDA
- ROCm is rapidly closing the gap but isn't fully compatible yet
Choosing AMD GPUs for production may require additional engineering time to address software compatibility issues.
8. Cloud GPU Service Comparison
AWS GPU Instances
p3 series (V100):
- p3.2xlarge: 1× V100, $3.06/h
- p3.16xlarge: 8× V100, $24.48/h
p4d series (A100):
- p4d.24xlarge: 8× A100, 320GB HBM2, $32.77/h
p5 series (H100):
- p5.48xlarge: 8× H100, 640GB HBM3, $98.32/h
AWS Trainium (Trn1):
- AWS-proprietary AI training chip (Trainium 2)
- Trn1.32xlarge: 16× Trainium, $21.50/h
- Better price-performance for LLM training vs H100
AWS Inferentia (Inf2):
- Inference-only chip
- Inf2.48xlarge: 12× Inferentia2, $12.98/h
- Optimized for Llama 2 70B inference
Spot instances can save 60-90% vs on-demand. Checkpointing is essential since instances can be interrupted.
Google Cloud GPU Instances
A100 instances:
- a2-highgpu-1g: 1× A100 (40GB), $3.67/h
- a2-megagpu-16g: 16× A100, $55.74/h
H100 instances:
- a3-highgpu-8g: 8× H100, ~$19-25/h (varies by region)
Google TPU v4/v5:
- TPU v4: AI training-optimized ASIC, 400 TFlops/chip
- TPU v5e: Optimized for large-scale inference
- TPU v5p: Latest training-focused, 459 TFlops/chip
- Best compatibility with Google's JAX framework
Azure GPU Instances
ND H100 v5:
- Standard_ND96isr_H100_v5: 8× H100
- Inter-node connectivity via InfiniBand NDR
NCas_T4_v3:
- T4-based inference instances
- Standard_NC64as_T4_v3: 4× T4, $4.35/h
Lambda Labs, CoreWeave, Vast.ai
Cloud startups offer GPUs more cheaply than AWS/GCP/Azure.
Lambda Labs:
- H100 SXM5 8× instance: $26.80/h (73% cheaper than AWS p5)
- Lambda Cloud is tailored for AI researchers
CoreWeave:
- Professional GPU cloud
- H100 single: $2.89/h
- Large cluster configurations available
Vast.ai:
- GPU marketplace (individuals/companies renting out GPUs)
- H100 at ~$2-3/h (market price)
- Suitable for experimental training where security sensitivity is low
Cloud GPU Cost Estimation
LLM training cost example (Llama 3 8B, 8× A100, 100B tokens):
Training time estimate:
- Chinchilla law: 8B params x 100B tokens
- ~7-10 days on an 8× A100 cluster
- p4d.24xlarge: $32.77/h x 24h x 8 days = ~$6,291
Cost-saving strategies:
1. Spot instances: 70% savings -> ~$1,887
2. Lambda Labs: $11.60/h x 24h x 8 days = ~$2,227
3. CoreWeave: even cheaper possible
9. GPU Selection Guide
Training vs Inference
Important for training:
- High-performance Tensor Cores (BF16/FP8)
- Sufficient memory (batch size, gradients, optimizer state)
- NVLink bandwidth (multi-GPU gradient synchronization)
- ECC memory (stability)
Important for inference:
- Memory bandwidth (KV cache read speed)
- Memory capacity (model + KV cache)
- INT8/FP8/FP4 support
- MIG (isolated execution of multiple small models)
GPU Requirements by Model Size
Memory required at FP16:
| Model Size | Parameter Memory | Training Memory | GPUs Needed |
|---|---|---|---|
| 7B | 14GB | ~56GB | 1× H100 (80GB) |
| 13B | 26GB | ~104GB | 2× H100 |
| 70B | 140GB | ~560GB | 8× H100 |
| 405B | 810GB | ~3.2TB | 40+ H100 |
| 1T | 2TB | ~8TB | 100+ H100 |
Training memory = parameters x 4 (FP16 params, gradients, Adam 2 moments) + activations
Memory-saving techniques:
- Gradient Checkpointing: dramatically reduces activation memory (20-30% speed trade-off)
- FSDP/ZeRO: distributes parameters, gradients, optimizer state across GPUs
- Flash Attention: attention computation O(N²) → O(N) memory
- FP8 training: halves memory usage
Recommendations by Budget
Individual researchers (~$2,000):
- RTX 4090 (24GB, ~$1,600): fine-tuning small models, LoRA training
- 7B model QLoRA fine-tuning is feasible
- FlashAttention supported (Ada Lovelace architecture)
Startup team (~$10,000-50,000):
- 4-8× RTX 4090: small LLM experiments
- Or used A100 40GB/80GB × 1-4
- Hybrid cloud strategy recommended
Mid-size AI team (~$1M+):
- 8× H100 (DGX H100 level): $320,000+
- Or Lambda Labs/CoreWeave cloud
- Can train 100B+ parameter models
Large research institutions/enterprises:
- Hundreds to thousands of H100/H200 GPUs
- GB200 NVL72 rack systems
- Dedicated InfiniBand network
Consumer vs Datacenter
| Feature | RTX 4090 | H100 SXM |
|---|---|---|
| Memory | 24GB GDDR6X | 80GB HBM3 |
| Bandwidth | 1,008 GB/s | 3,350 GB/s |
| FP16 Perf | ~330 TFLOPS | 989 TFLOPS |
| ECC | Not supported | Supported |
| NVLink | Not supported | Supported |
| MIG | Not supported | Supported |
| TDP | 450W | 700W |
| Price | ~$1,600 | ~$30,000+ |
| Warranty | 3yr consumer | Enterprise |
10. GPU Monitoring and Optimization
Using nvidia-smi
nvidia-smi is the primary CLI tool for monitoring NVIDIA GPUs.
# Basic GPU status
nvidia-smi
# Real-time monitoring (1 second refresh)
watch -n 1 nvidia-smi
# CSV output for logging
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,\
pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,\
utilization.memory,memory.total,memory.free,memory.used \
--format=csv -l 1 > gpu_log.csv
# Per-process GPU memory usage
nvidia-smi pmon -s m
# GPU topology (NVLink connections)
nvidia-smi topo -m
PyTorch GPU Utilization Optimization
import torch
# Check GPU memory usage
print(f"Allocated memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Reserved memory: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
# Detailed memory analysis
print(torch.cuda.memory_summary())
# DataLoader optimization
dataloader = DataLoader(
dataset,
batch_size=64,
num_workers=4, # Tune to CPU core count
pin_memory=True, # Pin CPU RAM for faster GPU transfer
prefetch_factor=2, # Number of batches to prefetch
persistent_workers=True # Reuse worker processes
)
# CUDA Streams for overlapping operations
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()
with torch.cuda.stream(stream1):
result1 = model_part1(data1)
with torch.cuda.stream(stream2):
result2 = model_part2(data2) # Runs in parallel with stream1
torch.cuda.synchronize()
GPU Utilization Optimization Checklist
Common causes of low GPU utilization (below 70%) and fixes:
- Data loading bottleneck: Increase
num_workers, setpin_memory=True - Batch size too small: Use Gradient Accumulation to increase effective batch size
- Python GIL bottleneck: Use CUDA Graphs to minimize CPU overhead
- Memory fragmentation: Call
torch.cuda.empty_cache()periodically
# CUDA Graphs to minimize CPU overhead
static_input = torch.randn(batch_size, input_size, device='cuda')
static_target = torch.randn(batch_size, output_size, device='cuda')
# Warmup
for _ in range(3):
output = model(static_input)
loss = criterion(output, static_target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Capture CUDA Graph
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
output = model(static_input)
loss = criterion(output, static_target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Run with real data
for real_input, real_target in dataloader:
static_input.copy_(real_input)
static_target.copy_(real_target)
g.replay() # Replay CUDA Graph (near-zero CPU overhead)
Thermal Management
Datacenter GPUs generate hundreds of watts of heat. Thermal management directly impacts performance and longevity.
- H100 SXM TDP: 700W, max temperature: 83°C
- Thermal throttling automatically limits performance when exceeded
- DGX systems support direct liquid cooling
# Real-time GPU temperature monitoring
nvidia-smi dmon -s t
# Fan speed control (consumer GPUs)
nvidia-settings -a "[gpu:0]/GPUFanControlState=1" \
-a "[fan:0]/GPUTargetFanSpeed=80"
# Power limit (prevent overheating at some performance cost)
sudo nvidia-smi -pl 300 # Limit power to 300W
Multi-GPU Setup
For 4+ GPU configurations, correct settings matter.
# Check GPU connection topology
nvidia-smi topo -m
# Shows NVLink-connected GPU pairs and PCIe connections
# Enable NCCL debug logging
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
# Verify NVLink P2P
nvidia-smi nvlink --status
# NUMA optimization (multi-socket servers)
numactl --cpunodebind=0 --membind=0 python train.py # Same NUMA node as GPU 0-3
# PyTorch distributed training basics
import torch.distributed as dist
import torch.multiprocessing as mp
def setup(rank, world_size):
dist.init_process_group(
backend='nccl', # NVIDIA GPU: nccl; AMD: gloo or rccl
init_method='env://',
world_size=world_size,
rank=rank
)
torch.cuda.set_device(rank)
def cleanup():
dist.destroy_process_group()
# Launch with torchrun:
# torchrun --nproc_per_node=8 --nnodes=2 --rdzv_id=100 \
# --rdzv_backend=c10d --rdzv_endpoint=host:29400 train.py
Conclusion: Practical Principles for GPU Selection
The most important thing in GPU selection is accurately understanding your own workload.
- Memory capacity first: If your model and batch don't fit in GPU memory, nothing else matters.
- Bandwidth vs compute: Training tends to be compute-bound; large model inference tends to be memory-bound.
- Cloud-first strategy: If unsure, start with cloud to understand requirements before investing in on-premises hardware.
- Ecosystem matters: NVIDIA's CUDA ecosystem remains overwhelmingly mature. AMD ROCm is catching up fast.
- Power consumption: On-premises GPU cluster electricity costs are on par with hardware costs.
AI infrastructure is evolving rapidly. Blackwell makes FP4 quantization a reality; rack-scale systems like GB200 NVL72 are redefining the training and inference of large models. Keep a close eye on hardware developments and make choices optimized for your requirements.