Split View: AI를 위한 GPU 하드웨어 완전 가이드: 아키텍처부터 선택 기준까지

AI를 위한 GPU 하드웨어 완전 가이드: 아키텍처부터 선택 기준까지

시작하며

AI와 딥러닝의 폭발적인 성장을 이끈 핵심 하드웨어는 단연 GPU입니다. GPT-4, Llama 3, Gemini 같은 대형 언어 모델을 학습하려면 수천 개의 GPU가 수주에서 수개월에 걸쳐 동작해야 합니다. 그렇다면 왜 GPU가 AI에 이토록 중요할까요? 어떤 GPU를 선택해야 할까요?

이 가이드는 AI 엔지니어, 연구자, ML 프랙티셔너를 위해 GPU 하드웨어의 모든 것을 다룹니다. 아키텍처의 기초부터 최신 Blackwell GPU, 클라우드 서비스 비교, 실전 선택 가이드까지 한 번에 정리했습니다.

1. GPU vs CPU: AI 학습에 GPU가 필요한 이유

병렬 연산: AI의 본질

딥러닝의 핵심 연산은 행렬 곱셈(matrix multiplication)입니다. 뉴럴 네트워크의 순전파(forward pass)와 역전파(backward pass) 모두 수십억 개의 곱셈과 덧셈으로 이루어져 있습니다. 이 연산들은 서로 독립적으로 수행될 수 있기 때문에 병렬화에 완벽하게 적합합니다.

CPU는 고성능 직렬 처리에 최적화되어 있습니다. 일반적인 서버 CPU는 64개에서 128개 정도의 코어를 가지며, 각 코어는 복잡한 제어 로직, 대용량 캐시, 분기 예측 등의 기능을 갖추고 있습니다. 이는 순차적인 작업, 복잡한 조건 분기, 운영체제 관리 같은 작업에 탁월합니다.

반면 GPU는 수천 개에서 수만 개의 소형 코어를 탑재하여 동시에 같은 연산을 수행하는 SIMD(Single Instruction, Multiple Data) 방식으로 동작합니다. NVIDIA H100은 무려 16,896개의 CUDA 코어를 가지고 있습니다. 행렬 곱셈처럼 같은 연산을 반복적으로 수행하는 작업에서 GPU는 CPU 대비 수백 배의 처리량을 보여줍니다.

FLOPS: 연산 성능의 척도

딥러닝 성능을 논할 때 가장 자주 등장하는 단위는 FLOPS(Floating Point Operations Per Second)입니다.

TFLOPS(테라플롭스): 초당 1조 번의 부동소수점 연산
PFLOPS(페타플롭스): 초당 1,000조 번의 부동소수점 연산

현대 AI 워크로드에서는 주로 다음 정밀도를 사용합니다:

FP32 (단정밀도 부동소수점): 학습의 마스터 가중치 저장
FP16 (반정밀도 부동소수점): 혼합 정밀도 학습
BF16 (Brain Float 16): FP16보다 안정적인 학습
TF32 (TensorFloat-32): NVIDIA A100 이후 지원
FP8: Hopper, Blackwell에서 지원, 추론/학습 모두 활용
FP4: Blackwell에서 새롭게 지원, 초고밀도 추론

NVIDIA H100의 경우 FP16 Tensor Core 성능이 무려 989 TFLOPS(약 1 PFLOPS)에 달합니다.

메모리 대역폭: 병목을 결정하는 요소

많은 AI 워크로드는 연산보다 메모리 대역폭에 의해 성능이 제한됩니다. 이를 메모리 바운드(memory-bound) 연산이라 합니다.

Large Language Model(LLM) 추론을 예로 들면, 토큰을 하나씩 생성할 때마다 전체 모델 가중치를 메모리에서 읽어야 합니다. Llama 3 70B 모델은 FP16으로 약 140GB의 메모리를 차지합니다. 초당 수십 개의 토큰을 생성하려면 초당 수 TB의 메모리 읽기가 필요합니다.

최신 GPU들의 메모리 대역폭:

NVIDIA A100 SXM: 2,000 GB/s
NVIDIA H100 SXM: 3,350 GB/s
NVIDIA H200 SXM: 4,800 GB/s
NVIDIA B200: 8,000 GB/s (예상)

이것이 HBM(High Bandwidth Memory)이 데이터센터 GPU에 채택된 이유입니다. HBM은 기존 GDDR 메모리 대비 훨씬 높은 대역폭을 제공합니다.

2. NVIDIA GPU 아키텍처 발전사

Pascal 아키텍처 (2016): AI 르네상스의 시작

파스칼(Pascal) 아키텍처는 NVIDIA의 창업자인 엔지니어 Blaise Pascal의 이름을 딴 것으로, 2016년에 등장했습니다. GTX 1080(소비자용)과 P100(데이터센터용)이 이 아키텍처를 사용합니다.

P100의 주요 사양:

CUDA 코어: 3,584개
FP32 성능: 9.3 TFLOPS
FP16 성능: 18.7 TFLOPS
메모리: 16GB HBM2, 720 GB/s

P100은 최초로 HBM2 메모리를 채택한 데이터센터 GPU였습니다. NVLink 1.0도 이 시기에 처음 도입되었습니다. 이 시기에 AlphaGo가 이세돌을 이겼고, AI 붐이 본격화되었습니다.

Volta 아키텍처 (2017): Tensor Core의 등장

볼타(Volta) 아키텍처는 GPU 역사의 전환점입니다. 2017년 등장한 V100은 Tensor Core를 세계 최초로 도입했습니다. Tensor Core는 행렬 곱셈을 하드웨어 수준에서 가속하는 전용 유닛입니다.

V100의 주요 사양:

CUDA 코어: 5,120개
1세대 Tensor Core: 640개
FP32 성능: 14 TFLOPS
FP16 Tensor Core 성능: 112 TFLOPS (8배 향상!)
메모리: 32GB HBM2, 900 GB/s
NVLink 2.0: 300 GB/s

Tensor Core 하나는 한 사이클에 4x4 행렬 곱셈(D = A*B + C)을 수행합니다. 이로 인해 FP16 성능이 FP32 대비 8배 향상되었고, 딥러닝 학습 속도가 혁명적으로 빨라졌습니다.

Turing 아키텍처 (2018): RT Core와 DLSS

튜링(Turing) 아키텍처는 게임용 RTX 시리즈로 유명합니다. RT Core(Ray Tracing 전용 유닛)가 처음 등장했고, AI 기반 이미지 업스케일링인 DLSS도 이 시기에 등장했습니다.

RTX 2080 Ti의 주요 사양:

CUDA 코어: 4,352개
Tensor Core: 544개 (2세대)
FP32 성능: 13.4 TFLOPS
FP16 Tensor Core: 107 TFLOPS
메모리: 11GB GDDR6, 616 GB/s

AI 관점에서 Turing의 의의는 INT8 양자화 추론 지원입니다. 추론 서버에서 모델을 INT8로 양자화하면 FP16 대비 2배 빠른 추론이 가능해졌습니다.

Ampere 아키텍처 (2020): A100과 3세대 Tensor Core

앰페어(Ampere) 아키텍처는 다시 한번 패러다임을 바꿨습니다. A100은 현재도 많은 데이터센터에서 사용 중인 워크호스(workhorse) GPU입니다.

A100 SXM4 80GB의 주요 사양:

CUDA 코어: 6,912개
3세대 Tensor Core: 432개
FP32 성능: 19.5 TFLOPS
FP16 Tensor Core: 312 TFLOPS
TF32 Tensor Core: 156 TFLOPS
BF16 Tensor Core: 312 TFLOPS
INT8 Tensor Core: 624 TOPS
메모리: 80GB HBM2e, 2,000 GB/s
NVLink 3.0: 600 GB/s

Ampere의 핵심 혁신들:

TF32 (TensorFloat-32): FP32와 FP16의 중간 형태. 지수부는 FP32(8비트), 가수부는 FP16(10비트). 기존 FP32 코드를 그대로 실행하면서 Tensor Core의 속도를 활용 가능. 수치 안정성과 속도의 균형점.

Sparsity 지원: A100은 2:4 구조적 희소성(structured sparsity)을 하드웨어에서 지원합니다. 모델 파라미터의 50%를 0으로 만들면(프루닝), Tensor Core가 이를 활용해 2배 추가 성능을 제공합니다. INT8 기준 이론적으로 1,248 TOPS까지 가능.

Multi-Instance GPU (MIG): A100은 최대 7개의 독립적인 GPU 인스턴스로 분할 가능. 추론 서버에서 여러 소형 모델을 격리된 환경에서 실행할 때 유용.

Hopper 아키텍처 (2022): Transformer Engine

호퍼(Hopper) 아키텍처는 트랜스포머 모델에 최적화된 혁신을 가져왔습니다. H100은 현재 가장 널리 사용되는 최고급 AI 학습 GPU입니다.

H100 SXM5 80GB의 주요 사양:

CUDA 코어: 16,896개
4세대 Tensor Core: 528개
FP32 성능: 60 TFLOPS
FP16/BF16 Tensor Core: 989 TFLOPS (약 1 PFLOPS!)
FP8 Tensor Core: 1,979 TFLOPS (약 2 PFLOPS)
메모리: 80GB HBM3, 3,350 GB/s
NVLink 4.0: 900 GB/s
TDP: 700W

H100의 핵심 혁신들:

Transformer Engine: 트랜스포머 모델의 어텐션 레이어와 MLP 레이어를 하드웨어 수준에서 최적화. FP8과 FP16을 레이어별로 자동 전환. H100 첫 출시 기준 A100 대비 최대 9배 AI 성능.

FP8 지원: E4M3과 E5M2 두 가지 FP8 포맷 지원. FP16 대비 2배 Tensor Core 성능. 메모리 사용량 절반.

Thread Block Clusters: SM(Streaming Multiprocessor)들이 공유 메모리처럼 서로 통신. 분산 공유 메모리(distributed shared memory) 가능.

NVLink 4.0: 이전 세대 대비 1.5배 향상된 900 GB/s. 최대 8개 GPU를 full-mesh로 연결.

H200 SXM 141GB의 주요 사양:

컴퓨트: H100과 동일
메모리: 141GB HBM3e, 4,800 GB/s
메모리 대역폭 43% 향상
LLM 추론에서 최대 2배 성능 향상 (메모리 바운드 워크로드)

Blackwell 아키텍처 (2024): 차세대 AI 가속

블랙웰(Blackwell) 아키텍처는 2024년에 발표된 NVIDIA의 최신 아키텍처입니다.

B200 SXM의 주요 사양:

컴퓨트: 20 PFLOPS (FP4 기준)
FP8 성능: 9 PFLOPS
메모리: 192GB HBM3e, 8,000 GB/s
NVLink 5.0: 1,800 GB/s

Blackwell의 핵심 혁신들:

FP4 지원: 4비트 부동소수점 지원으로 초고밀도 추론 가능. FP8 대비 2배 이상의 처리량.

2nd Generation Transformer Engine: FP4, FP6 등 새로운 정밀도 포맷 자동 관리.

NVLink 5.0: 이전 세대 대비 2배 향상된 1,800 GB/s.

GB200 NVL72: 36개의 Grace CPU와 72개의 B200 GPU를 단일 랙 스케일 시스템으로 연결. NVLink로 모든 GPU가 연결되어 사실상 하나의 거대한 GPU처럼 동작. 1.4 exaFLOPS (FP4) 달성.

3. Tensor Core 상세 분석

CUDA Core vs Tensor Core

CUDA Core는 범용 부동소수점 연산 유닛입니다. 한 클럭 사이클에 하나의 FP32 곱셈-덧셈(FMA: Fused Multiply-Add) 연산을 처리합니다.

Tensor Core는 행렬 곱셈 전용 유닛입니다. 1세대 Tensor Core는 한 클럭 사이클에 4x4 FP16 행렬 곱셈(D = A * B + C)을 처리합니다. 이는 64개의 FP16 곱셈과 64개의 FP16 덧셈, 총 128 FP16 연산을 한 사이클에 수행하는 것과 같습니다.

세대별 Tensor Core 진화:

세대	아키텍처	지원 정밀도	행렬 크기	특이사항
1세대	Volta	FP16	4x4	최초 Tensor Core
2세대	Turing	FP16, INT8, INT4	-	INT 지원 추가
3세대	Ampere	FP16, BF16, TF32, INT8, INT4	-	TF32, Sparsity
4세대	Hopper	FP16, BF16, TF32, FP8, INT8	-	FP8, Transformer Engine
5세대	Blackwell	FP16, BF16, TF32, FP8, FP4	-	FP4 지원

WMMA (Warp Matrix Multiply-Accumulate)

CUDA 프로그래밍에서 Tensor Core를 직접 활용하려면 WMMA API를 사용합니다.

#include <mma.h>
using namespace nvcuda::wmma;

// 16x16x16 FP16 행렬 곱셈
fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
fragment<accumulator, 16, 16, 16, float> c_frag;

fill_fragment(c_frag, 0.0f);

// 행렬 로드
load_matrix_sync(a_frag, a_ptr, 16);
load_matrix_sync(b_frag, b_ptr, 16);

// Tensor Core 곱셈 실행
mma_sync(c_frag, a_frag, b_frag, c_frag);

// 결과 저장
store_matrix_sync(c_ptr, c_frag, 16, mem_row_major);

실제로는 cuBLAS나 PyTorch가 이를 자동으로 처리해줍니다.

혼합 정밀도 학습 (Mixed Precision Training)

혼합 정밀도 학습은 FP32 마스터 가중치를 유지하면서 순전파/역전파를 FP16 또는 BF16으로 수행하는 기법입니다.

# PyTorch AMP (Automatic Mixed Precision) 사용
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    with autocast(dtype=torch.bfloat16):
        output = model(batch)
        loss = criterion(output, target)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

BF16은 FP16보다 학습 안정성이 높습니다. FP16은 지수부가 5비트라 표현 범위가 좁아 오버플로우가 발생할 수 있지만, BF16은 지수부가 8비트(FP32와 동일)라 훨씬 넓은 범위를 표현합니다.

Sparsity 지원

A100부터 지원되는 2:4 구조적 희소성은 파라미터를 4개 단위로 묶어 그 중 2개를 0으로 만드는 방식입니다.

# PyTorch에서 Sparse Tensor Core 활용
from torch.nn.utils import prune

# 2:4 구조적 프루닝 적용
prune.ln_structured(model.layer, name='weight', amount=0.5, n=2, dim=0)

50% 프루닝 후 동일한 모델이 2배 빠르게 추론됩니다 (이론적).

4. GPU 메모리 계층

GDDR vs HBM: 게임체인저

GDDR6 (Graphics DDR6): 소비자용 GPU에 사용. 패키지 외부에 별도의 칩으로 탑재. RTX 4090의 경우 24GB GDDR6X, 1,008 GB/s.

HBM2e (High Bandwidth Memory 2e): 데이터센터 GPU에 사용. GPU 다이 옆에 2.5D 방식으로 적층. 실리콘 인터포저로 연결. A100의 경우 80GB HBM2e, 2,000 GB/s.

HBM3: H100에 탑재. 80GB, 3,350 GB/s.

HBM3e: H200에 탑재. 141GB, 4,800 GB/s. B200에도 탑재 예정, 192GB, 8,000 GB/s.

왜 HBM이 더 빠를까요? HBM은 여러 겹의 DRAM 다이를 수직으로 적층하고 수천 개의 미세한 연결(Through-Silicon Vias, TSV)로 연결합니다. GPU 다이와 HBM 스택이 실리콘 인터포저 위에 나란히 놓여 극히 짧은 거리에서 초광대역 연결을 제공합니다.

메모리 계층 구조

GPU의 메모리 계층은 다음과 같습니다:

레지스터 (Register File)
    └── 가장 빠름, 쓰레드당 수십~수백 개
L1 캐시 / 공유 메모리 (Shared Memory)
    └── SM(Streaming Multiprocessor) 내 공유
    └── H100: SM당 228KB
L2 캐시
    └── 모든 SM이 공유
    └── H100: 50MB
HBM (주 메모리)
    └── 전체 GPU가 접근
    └── H100: 80GB

커널 최적화의 핵심은 데이터를 공유 메모리에 최대한 오래 유지하여 느린 HBM 접근을 최소화하는 것입니다. Flash Attention이 이 원리를 어텐션 연산에 적용한 대표적 사례입니다.

ECC 메모리

ECC(Error-Correcting Code) 메모리는 비트 오류를 감지하고 수정하는 기능입니다. 데이터센터 GPU(A100, H100 등)는 ECC를 지원합니다. 소비자용 GPU(RTX 4090)는 지원하지 않습니다.

장시간 학습 시 메모리 오류가 발생하면 학습이 잘못된 방향으로 진행되거나 NaN이 발생할 수 있습니다. 중요한 학습 작업에는 ECC 지원 GPU가 권장됩니다. ECC 활성화 시 유효 메모리 용량이 약 6.25% 감소합니다.

5. 멀티 GPU 연결: NVLink & NVSwitch

PCIe의 한계

일반적인 PCIe 4.0 x16 슬롯은 최대 32 GB/s 대역폭을 제공합니다(양방향 64 GB/s). 멀티 GPU 학습에서 그래디언트를 동기화할 때 이 대역폭이 병목이 됩니다.

4개의 GPU에서 All-Reduce를 수행한다고 하면, 각 GPU는 다른 3개의 GPU와 그래디언트를 교환해야 합니다. 100억 개 파라미터 모델의 경우 FP32 그래디언트만 40GB에 달합니다. PCIe를 통해 이를 교환하면 수십 초가 걸릴 수 있습니다.

NVLink 발전사

버전	아키텍처	단방향 대역폭	양방향 대역폭
1.0	Pascal	20 GB/s	40 GB/s
2.0	Volta	25 GB/s	50 GB/s
3.0	Ampere	25 GB/s	50 GB/s (총 600 GB/s)
4.0	Hopper	50 GB/s	900 GB/s
5.0	Blackwell	100 GB/s	1,800 GB/s

NVLink 4.0의 경우 GPU 쌍 간에 최대 900 GB/s 양방향 대역폭을 제공합니다. PCIe 4.0 x16 대비 14배 이상 빠릅니다.

NVSwitch: All-to-All 연결

NVLink는 두 GPU를 직접 연결하지만, 8개 이상의 GPU를 연결하려면 NVSwitch가 필요합니다. NVSwitch는 GPU 연결 전용 스위치 칩으로, 연결된 모든 GPU가 전체 NVLink 대역폭으로 서로 직접 통신할 수 있게 합니다.

DGX H100 시스템 구성:

8개의 H100 SXM5 GPU
NVSwitch 4.0 × 4개
모든 GPU 쌍이 900 GB/s로 직접 연결
NVLink All-to-All 총 대역폭: 7.2 TB/s

DGX A100 vs DGX H100

DGX A100:

GPU: 8× A100 80GB
NVLink 총 대역폭: 4.8 TB/s
GPU 메모리: 640GB
AI 성능: 5 PFLOPS (FP16)

DGX H100:

GPU: 8× H100 80GB
NVLink 총 대역폭: 7.2 TB/s
GPU 메모리: 640GB
AI 성능: 32 PFLOPS (FP8)
DGX A100 대비 약 6.4배 성능

InfiniBand: 노드 간 연결

단일 노드 내에서는 NVLink를 사용하지만, 여러 서버 노드를 연결할 때는 InfiniBand(IB) 네트워크를 사용합니다. NVIDIA ConnectX-7 NIC와 InfiniBand NDR(400 Gb/s)을 사용하면 서버 간 통신 지연을 최소화할 수 있습니다.

대규모 LLM 학습에서는 수천 개의 GPU를 연결해야 합니다. Meta의 Llama 3 학습에는 H100 16,000개가 사용되었고, 이를 연결하기 위해 대규모 InfiniBand 패브릭이 구축되었습니다.

6. 주요 AI GPU 상세 비교

NVIDIA A100 (80GB HBM2e)

2020년 출시된 A100은 현재도 많은 AI 워크로드의 표준입니다. FP16 312 TFLOPS, BF16 312 TFLOPS, TF32 156 TFLOPS를 제공합니다.

SXM4 폼 팩터는 NVLink 3.0으로 최대 8개 GPU를 연결 가능하며, PCIe 4.0 버전도 있습니다. MIG(Multi-Instance GPU) 기능으로 최대 7개 인스턴스로 분할 가능합니다.

클라우드 시간당 비용: AWS p4d.24xlarge(A100 8개) 기준 약 $32.77/시간.

NVIDIA H100 (80GB HBM3)

2022년 출시. 현재 최고 성능의 널리 보급된 AI GPU입니다.

SXM5 버전:

FP16/BF16 Tensor Core: 989 TFLOPS
FP8 Tensor Core: 1,979 TFLOPS
메모리: 80GB HBM3
대역폭: 3,350 GB/s
TDP: 700W
NVLink 4.0: 900 GB/s

PCIe 버전:

FP16/BF16 Tensor Core: 756 TFLOPS
메모리: 80GB HBM3
대역폭: 2,000 GB/s
TDP: 350W

H100 vs A100:

Tensor Core 성능: 3.2배 (FP16 기준)
FP8 성능: A100 대비 6배 (INT8과 비교)
메모리 대역폭: 1.7배 (HBM3)
NVLink 대역폭: 1.5배

클라우드 시간당 비용: AWS p5.48xlarge(H100 8개) 기준 약 $98.32/시간.

NVIDIA H200 (141GB HBM3e)

H100의 메모리 업그레이드 버전. 컴퓨트 성능은 H100과 동일하지만 메모리 용량과 대역폭이 크게 향상되었습니다.

메모리: 141GB HBM3e (H100 대비 76% 증가)
대역폭: 4,800 GB/s (H100 대비 43% 증가)
LLM 추론에서 H100 대비 최대 2배 빠른 처리량
대형 모델(70B+ LLM)을 단일 GPU에 수용 가능

NVIDIA B100 / B200 (Blackwell)

2024년 발표. 아직 본격 보급 단계.

B200 SXM:

FP4 Tensor Core: 20 PFLOPS
FP8 Tensor Core: 9 PFLOPS
FP16/BF16 Tensor Core: 4.5 PFLOPS
메모리: 192GB HBM3e
대역폭: 8,000 GB/s
TDP: 1,000W

B200 vs H100:

FP8 성능: 4.5배
메모리: 2.4배
대역폭: 2.4배

NVIDIA GB200 NVL72 (Rack-Scale AI)

GB200은 Grace CPU(ARM 기반)와 B200 GPU를 하나의 패키지에 통합한 슈퍼칩입니다. GB200 NVL72는 36개의 Grace CPU와 72개의 B200 GPU를 하나의 랙에 통합한 시스템입니다.

GB200 NVL72 사양:

GPU: 72× B200
CPU: 36× Grace Hopper Superchip
GPU 메모리: 13.8TB HBM3e
NVLink 5.0 All-to-All 연결
AI 성능: 1.4 ExaFLOPS (FP4), 720 PFLOPS (FP8)
총 전력: 120kW

이는 사실상 하나의 거대한 GPU처럼 동작합니다. 단일 랙에서 Llama 3 405B 같은 대형 모델을 고속으로 처리할 수 있습니다.

GeForce RTX 4090 (소비자용)

AI 스타트업이나 개인 연구자를 위한 최선의 소비자급 GPU입니다.

CUDA 코어: 16,384개
FP32 성능: 82.6 TFLOPS
FP16 Tensor Core: 330 TFLOPS (대략적)
메모리: 24GB GDDR6X
대역폭: 1,008 GB/s
TDP: 450W
가격: 약 $1,599 (MSRP)

H100 SXM 대비:

Tensor Core 성능: 약 1/3
메모리: 24GB vs 80GB
대역폭: 1,008 vs 3,350 GB/s
ECC: 미지원
NVLink: 미지원 (PCIe only)
가격: 약 1/20 (H100은 $30,000+)

AMD MI300X

AMD의 데이터센터 AI GPU.

계산 유닛(CU): 304개
FP16 성능: 1,307 TFLOPS
BF16 성능: 1,307 TFLOPS
FP8 성능: 2,614 TOPS
메모리: 192GB HBM3
대역폭: 5,300 GB/s
TDP: 750W

MI300X는 H100 대비 메모리 용량(192GB vs 80GB)과 대역폭(5,300 vs 3,350 GB/s)에서 우위를 점합니다. LLM 추론에서 특히 유리합니다.

GPU 성능 비교 표

GPU	FP16 TFLOPS	메모리	대역폭	TDP	출시연도
A100 SXM	312	80GB HBM2e	2,000 GB/s	400W	2020
RTX 4090	~330	24GB GDDR6X	1,008 GB/s	450W	2022
H100 SXM	989	80GB HBM3	3,350 GB/s	700W	2022
MI300X	1,307	192GB HBM3	5,300 GB/s	750W	2023
H200 SXM	989	141GB HBM3e	4,800 GB/s	700W	2024
B200 SXM	4,500	192GB HBM3e	8,000 GB/s	1,000W	2024

7. AMD GPU for AI

ROCm 에코시스템

AMD의 AI 소프트웨어 스택은 ROCm(Radeon Open Compute)입니다. CUDA와 호환되는 오픈소스 플랫폼으로, 최근 PyTorch, TensorFlow 등 주요 프레임워크와의 호환성이 크게 개선되었습니다.

ROCm의 CUDA 대응 구성:

HIP(Heterogeneous-compute Interface for Portability): CUDA C++ 대응
rocBLAS: cuBLAS 대응 (행렬 연산)
MIOpen: cuDNN 대응 (딥러닝 기본 연산)
rccl: NCCL 대응 (GPU 간 통신)

PyTorch ROCm 설치:

# ROCm 지원 PyTorch 설치
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0

MI300X 주요 특징

MI300X는 현재 AMD의 플래그십 AI GPU입니다. CDNA 3 아키텍처를 사용하며, GPU 다이와 HBM 스택을 3D 방식으로 통합하는 advanced packaging 기술(MCM: Multi-Chip Module)을 사용합니다.

MI300X는 192GB HBM3 메모리와 5,300 GB/s 대역폭으로 LLM 추론에서 H100을 능가하는 경우가 있습니다. 특히 메모리 바운드 워크로드(대형 배치, 긴 시퀀스)에서 유리합니다.

Microsoft Azure, Oracle Cloud 등에서 MI300X 인스턴스를 제공하기 시작했습니다. Meta, Microsoft 등 대형 테크 기업들이 AMD GPU를 적극 도입하고 있습니다.

AMD vs NVIDIA 소프트웨어 생태계

솔직히 말하면, 현재 AI 소프트웨어 생태계는 여전히 NVIDIA CUDA가 압도적입니다.

NVIDIA 전용 라이브러리: cuDNN, cuBLAS, TensorRT, NCCL, NVTX 등
FlashAttention은 원래 CUDA 전용으로 개발 (ROCm 포팅은 후발)
많은 연구 코드가 CUDA 가정하에 작성됨
ROCm은 지속적으로 격차를 좁히고 있지만 완전 호환은 아직

프로덕션 환경에서 AMD GPU를 선택할 경우 소프트웨어 호환성 문제에 추가 엔지니어링 시간이 필요할 수 있습니다.

8. 클라우드 GPU 서비스 비교

AWS GPU 인스턴스

p3 시리즈 (V100):

p3.2xlarge: V100 1개, $3.06/h
p3.16xlarge: V100 8개, $24.48/h

p4d 시리즈 (A100):

p4d.24xlarge: A100 8개, 320GB HBM2, $32.77/h

p5 시리즈 (H100):

p5.48xlarge: H100 8개, 640GB HBM3, $98.32/h

AWS Trainium (Trn1):

AWS 자체 AI 학습 칩 (Trainium 2)
Trn1.32xlarge: 16× Trainium, $21.50/h
LLM 학습에서 H100 대비 가격 대비 성능 우수

AWS Inferentia (Inf2):

추론 전용 칩
Inf2.48xlarge: 12× Inferentia2, $12.98/h
Llama 2 70B 추론에 최적화

스팟 인스턴스를 활용하면 온디맨드 대비 60-90% 절약 가능. 단, 중단될 수 있어 체크포인팅이 필수.

Google Cloud GPU 인스턴스

A100 인스턴스:

a2-highgpu-1g: A100 1개 (40GB), $3.67/h
a2-megagpu-16g: A100 16개, $55.74/h

H100 인스턴스:

a3-highgpu-8g: H100 8개, ~$19-25/h (리전에 따라 다름)

Google TPU v4/v5:

TPU v4: AI 학습 특화 ASIC, 400 TFlops/chip
TPU v5e: 대규모 추론에 최적화
TPU v5p: 최신 학습 특화, 459 TFlops/chip
Google 자체 개발 JaX 프레임워크와 최상의 호환성

Azure GPU 인스턴스

ND H100 v5:

Standard_ND96isr_H100_v5: H100 8개
InfiniBand NDR로 노드 간 연결

NCas_T4_v3:

T4 기반 추론 인스턴스
Standard_NC64as_T4_v3: T4 4개, $4.35/h

Lambda Labs, CoreWeave, Vast.ai

클라우드 스타트업들이 AWS/GCP/Azure보다 저렴한 GPU를 제공합니다.

Lambda Labs:

H100 SXM5 8× 인스턴스: $26.80/h (AWS p5 대비 73% 저렴)
Lambda Cloud는 AI 연구자에게 특화된 서비스

CoreWeave:

전문 GPU 클라우드
H100 단일: $2.89/h
대규모 클러스터 구성 가능

Vast.ai:

GPU 마켓플레이스 (개인/기업이 GPU를 렌탈)
H100 시간당 $2-3대 (시장 가격)
보안 민감도가 낮은 실험적 학습에 적합

클라우드 GPU 비용 계산

LLM 학습 비용 예시 (Llama 3 8B, A100 8개, 100B 토큰):

학습 소요 시간 추정:
- Chinchilla 법칙: 8B 파라미터 × 100B 토큰
- A100 8개 클러스터에서 약 7-10일 소요
- p4d.24xlarge: $32.77/h × 24h × 8일 = 약 $6,291

절약 방법:
1. 스팟 인스턴스: 70% 절약 → 약 $1,887
2. Lambda Labs 활용: $11.60/h × 24h × 8일 = 약 $2,227
3. CoreWeave: 더 저렴 가능

9. GPU 선택 가이드

학습 vs 추론

**학습 (Training)**에 중요한 요소:

고성능 Tensor Core (BF16/FP8)
충분한 메모리 (배치 크기, 그래디언트, 옵티마이저 상태)
NVLink 대역폭 (멀티 GPU 그래디언트 동기화)
ECC 메모리 (안정성)

**추론 (Inference)**에 중요한 요소:

메모리 대역폭 (KV 캐시 읽기 속도)
메모리 용량 (모델 + KV 캐시)
INT8/FP8/FP4 지원
MIG (여러 작은 모델 격리 실행)

모델 크기별 GPU 요구사항

FP16 기준 모델 크기와 필요 메모리:

모델 크기	파라미터 메모리	학습 메모리	필요 GPU
7B	14GB	~56GB	H100 1개 (80GB)
13B	26GB	~104GB	H100 2개
70B	140GB	~560GB	H100 8개
405B	810GB	~3.2TB	H100 40개+
1T	2TB	~8TB	H100 100개+

학습 메모리 = 파라미터 × 4 (FP16 파라미터, 그래디언트, Adam 옵티마이저 2 모멘트) + 활성화

메모리 절약 기법:

Gradient Checkpointing: 활성화 메모리 대폭 절약 (속도 20-30% 희생)
FSDP/ZeRO: 파라미터, 그래디언트, 옵티마이저 상태를 GPU 간 분산
Flash Attention: 어텐션 계산 시 메모리 O(N²) → O(N)
FP8 학습: 메모리 사용량 절반

예산별 추천

개인 연구자 (예산: 200-300만원):

RTX 4090 (24GB, ~$1,600): 소형 모델 파인튜닝, LoRA 학습
7B 모델 QLoRA 파인튜닝 가능
FlashAttention 지원 (Ada Lovelace 아키텍처)

스타트업 팀 (예산: 1,000-5,000만원):

RTX 4090 × 4-8개: 소규모 LLM 실험
또는 A100 40GB/80GB × 1-4개 중고 구매
클라우드 혼용 전략 추천

중견 AI 팀 (예산: 1억-10억원):

H100 × 8개 (DGX H100 수준): $320,000+
또는 Lambda Labs/CoreWeave 클라우드 활용
100B+ 파라미터 모델 학습 가능

대형 연구기관/기업:

H100/H200 클러스터 수백-수천 개
GB200 NVL72 랙 시스템
전용 InfiniBand 네트워크

소비자용 vs 데이터센터용

특성	RTX 4090	H100 SXM
메모리	24GB GDDR6X	80GB HBM3
대역폭	1,008 GB/s	3,350 GB/s
FP16 성능	~330 TFLOPS	989 TFLOPS
ECC	미지원	지원
NVLink	미지원	지원
MIG	미지원	지원
TDP	450W	700W
가격	~$1,600	~$30,000+
보증/지원	3년 소비자	엔터프라이즈

10. GPU 모니터링과 최적화

nvidia-smi 활용

nvidia-smi는 NVIDIA GPU를 모니터링하는 기본 CLI 도구입니다.

# 기본 GPU 상태 확인
nvidia-smi

# 실시간 모니터링 (1초 갱신)
watch -n 1 nvidia-smi

# CSV 형태로 출력 (로깅용)
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,\
pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,\
utilization.memory,memory.total,memory.free,memory.used \
--format=csv -l 1 > gpu_log.csv

# 프로세스별 GPU 메모리 사용량
nvidia-smi pmon -s m

# GPU 토폴로지 확인 (NVLink 연결)
nvidia-smi topo -m

PyTorch GPU 활용률 최적화

import torch

# GPU 메모리 사용량 확인
print(f"할당된 메모리: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"예약된 메모리: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# 메모리 사용량 상세 분석
print(torch.cuda.memory_summary())

# 데이터로더 최적화
dataloader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=4,          # CPU 코어 수에 맞게 조정
    pin_memory=True,        # CPU RAM을 pinned memory로 고정 (GPU 전송 속도 향상)
    prefetch_factor=2,      # 미리 로드할 배치 수
    persistent_workers=True  # 워커 프로세스 재사용
)

# CUDA Stream을 활용한 연산 중첩
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()

with torch.cuda.stream(stream1):
    # 연산 1
    result1 = model_part1(data1)

with torch.cuda.stream(stream2):
    # 연산 2 (stream1과 병렬 실행)
    result2 = model_part2(data2)

torch.cuda.synchronize()

GPU 활용률 최적화 체크리스트

낮은 GPU 활용률(70% 미만)의 주요 원인과 해결책:

데이터 로딩 병목: num_workers 증가, pin_memory=True 설정
너무 작은 배치 크기: Gradient Accumulation으로 유효 배치 크기 증가
Python GIL 병목: CUDA Graph 사용으로 CPU 오버헤드 최소화
메모리 단편화: torch.cuda.empty_cache() 정기 호출

# CUDA Graphs로 CPU 오버헤드 최소화
# 반복적인 모델 실행을 CUDA Graph로 캡처
static_input = torch.randn(batch_size, input_size, device='cuda')
static_target = torch.randn(batch_size, output_size, device='cuda')

# 워밍업
for _ in range(3):
    output = model(static_input)
    loss = criterion(output, static_target)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# CUDA Graph 캡처
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    output = model(static_input)
    loss = criterion(output, static_target)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# 이후 실제 데이터로 실행
for real_input, real_target in dataloader:
    static_input.copy_(real_input)
    static_target.copy_(real_target)
    g.replay()  # CUDA Graph 재실행 (CPU 오버헤드 거의 없음)

열 관리

데이터센터 GPU는 수백 와트의 열을 발생시킵니다. 열 관리는 성능과 수명에 직결됩니다.

H100 SXM의 TDP: 700W, 최대 온도: 83°C
열 스로틀링이 발생하면 성능이 자동으로 제한됨
DGX 시스템은 액체 냉각(direct liquid cooling) 지원

# GPU 온도 실시간 모니터링
nvidia-smi dmon -s t

# 팬 속도 설정 (소비자 GPU)
nvidia-settings -a "[gpu:0]/GPUFanControlState=1" \
                -a "[fan:0]/GPUTargetFanSpeed=80"

# 전력 제한 설정 (과열 방지, 일부 성능 희생)
sudo nvidia-smi -pl 300  # 전력을 300W로 제한

멀티 GPU 설정

4개 이상의 GPU를 사용하는 경우 올바른 설정이 중요합니다.

# GPU 연결 토폴로지 확인
nvidia-smi topo -m
# NVLink로 연결된 GPU 쌍, PCIe 연결 여부 표시

# NCCL 디버그 로그 활성화
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

# NVLink P2P 활성화 확인
nvidia-smi nvlink --status

# NUMA 최적화 (멀티소켓 서버)
numactl --cpunodebind=0 --membind=0 python train.py  # GPU 0-3과 같은 NUMA 노드 사용

# PyTorch 분산 학습 기본 설정
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    dist.init_process_group(
        backend='nccl',   # NVIDIA GPU: nccl, AMD: gloo 또는 rccl
        init_method='env://',
        world_size=world_size,
        rank=rank
    )
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

# torchrun으로 실행
# torchrun --nproc_per_node=8 --nnodes=2 --rdzv_id=100 \
#          --rdzv_backend=c10d --rdzv_endpoint=host:29400 train.py

마무리: GPU 선택의 실용적 원칙

GPU 선택에 있어 가장 중요한 것은 자신의 워크로드를 정확히 이해하는 것입니다.

메모리 용량이 먼저: 모델과 배치가 GPU 메모리에 들어가지 않으면 다른 것은 의미 없습니다.
대역폭 vs 컴퓨트: 학습은 컴퓨트 바운드, 대형 모델 추론은 메모리 바운드 경향.
클라우드 우선 전략: 확신이 없다면 클라우드로 시작해서 요구사항을 파악 후 온프레미스 투자.
생태계 고려: CUDA 생태계의 성숙도는 여전히 압도적. AMD ROCm은 빠르게 따라잡는 중.
소비전력 고려: 온프레미스 GPU 클러스터의 전기세는 하드웨어 비용 못지않게 중요.

AI 인프라는 빠르게 발전하고 있습니다. Blackwell의 등장으로 FP4 양자화가 현실화되고, GB200 NVL72 같은 랙 스케일 시스템이 대형 모델의 학습과 추론을 재정의하고 있습니다. 하드웨어의 발전을 주시하면서 자신의 요구사항에 맞는 최적의 선택을 해나가시기 바랍니다.

참고 자료

GPU Hardware Complete Guide for AI: From Architecture to Selection Criteria

Introduction

The GPU is the undisputed workhorse behind the explosive growth of AI and deep learning. Training large language models like GPT-4, Llama 3, or Gemini requires thousands of GPUs running for weeks to months. So why are GPUs so critical for AI? Which GPU should you choose?

This guide covers everything about GPU hardware for AI engineers, researchers, and ML practitioners — from architectural fundamentals to the latest Blackwell GPUs, cloud service comparisons, and practical selection guidance.

1. GPU vs CPU: Why AI Training Needs GPUs

Parallel Computing: The Essence of AI

The core operation in deep learning is matrix multiplication. Both forward and backward passes through neural networks consist of billions of multiplications and additions. These operations are independent of each other and thus perfectly suited for parallelization.

CPUs are optimized for high-performance serial processing. A typical server CPU has 64 to 128 cores, each equipped with complex control logic, large caches, and branch prediction. This excels at sequential tasks, complex conditional branching, and operating system management.

GPUs, in contrast, pack thousands to tens of thousands of small cores that simultaneously perform the same operation in SIMD (Single Instruction, Multiple Data) fashion. The NVIDIA H100 has an astounding 16,896 CUDA cores. For workloads that repeat the same operation — like matrix multiplication — a GPU can deliver hundreds of times more throughput than a CPU.

FLOPS: The Measure of Compute Performance

The most common unit in deep learning performance discussions is FLOPS (Floating Point Operations Per Second).

TFLOPS (teraflops): 1 trillion floating-point operations per second
PFLOPS (petaflops): 1 quadrillion floating-point operations per second

Modern AI workloads primarily use these precisions:

FP32 (single-precision float): Master weight storage during training
FP16 (half-precision float): Mixed-precision training
BF16 (Brain Float 16): More stable training than FP16
TF32 (TensorFloat-32): Supported since NVIDIA A100
FP8: Supported in Hopper and Blackwell; used for both inference and training
FP4: New in Blackwell; ultra-high-density inference

The NVIDIA H100's FP16 Tensor Core performance reaches a staggering 989 TFLOPS — roughly 1 PFLOPS.

Memory Bandwidth: The True Bottleneck

Many AI workloads are limited not by compute, but by memory bandwidth — a condition called memory-bound.

Take Large Language Model (LLM) inference as an example: every time you generate a token, the full model weights must be read from memory. A Llama 3 70B model in FP16 takes roughly 140GB of memory. Generating dozens of tokens per second requires reading terabytes per second from memory.

Latest GPU memory bandwidths:

NVIDIA A100 SXM: 2,000 GB/s
NVIDIA H100 SXM: 3,350 GB/s
NVIDIA H200 SXM: 4,800 GB/s
NVIDIA B200: 8,000 GB/s (projected)

This is why HBM (High Bandwidth Memory) has been adopted in datacenter GPUs. HBM provides far higher bandwidth than conventional GDDR memory.

2. History of NVIDIA GPU Architecture

Pascal Architecture (2016): The Start of the AI Renaissance

Named after the engineer Blaise Pascal, the Pascal architecture launched in 2016. The GTX 1080 (consumer) and P100 (datacenter) used this architecture.

P100 key specs:

CUDA cores: 3,584
FP32 performance: 9.3 TFLOPS
FP16 performance: 18.7 TFLOPS
Memory: 16GB HBM2, 720 GB/s

The P100 was the first datacenter GPU to adopt HBM2 memory. NVLink 1.0 was introduced in this era as well. AlphaGo defeated Lee Sedol around this time, and the AI boom began in earnest.

Volta Architecture (2017): The Arrival of Tensor Cores

The Volta architecture marks a turning point in GPU history. The V100, released in 2017, introduced Tensor Cores to the world for the first time. Tensor Cores are dedicated hardware units that accelerate matrix multiplication.

V100 key specs:

CUDA cores: 5,120
1st-gen Tensor Cores: 640
FP32 performance: 14 TFLOPS
FP16 Tensor Core performance: 112 TFLOPS (8x improvement!)
Memory: 32GB HBM2, 900 GB/s
NVLink 2.0: 300 GB/s

A single Tensor Core performs a 4x4 matrix multiply-accumulate (D = A*B + C) in one cycle. This boosted FP16 performance 8x over FP32, revolutionizing deep learning training speeds.

Turing Architecture (2018): RT Cores and DLSS

The Turing architecture is famous for the consumer RTX series. RT Cores (dedicated ray-tracing units) debuted here, along with DLSS — AI-based image upscaling.

RTX 2080 Ti key specs:

CUDA cores: 4,352
Tensor Cores: 544 (2nd gen)
FP32 performance: 13.4 TFLOPS
FP16 Tensor Core: 107 TFLOPS
Memory: 11GB GDDR6, 616 GB/s

From an AI standpoint, Turing's significance was INT8 quantized inference support. Quantizing models to INT8 for inference servers delivered 2x faster inference compared to FP16.

Ampere Architecture (2020): A100 and 3rd Gen Tensor Cores

The Ampere architecture shifted the paradigm again. The A100 remains a workhorse GPU in many datacenters today.

A100 SXM4 80GB key specs:

CUDA cores: 6,912
3rd-gen Tensor Cores: 432
FP32 performance: 19.5 TFLOPS
FP16 Tensor Core: 312 TFLOPS
TF32 Tensor Core: 156 TFLOPS
BF16 Tensor Core: 312 TFLOPS
INT8 Tensor Core: 624 TOPS
Memory: 80GB HBM2e, 2,000 GB/s
NVLink 3.0: 600 GB/s

Ampere's key innovations:

TF32 (TensorFloat-32): A hybrid between FP32 and FP16. Exponent bits match FP32 (8 bits); mantissa matches FP16 (10 bits). Existing FP32 code can leverage Tensor Core speeds without modification — balancing numerical stability and speed.

Sparsity support: A100 supports 2:4 structured sparsity in hardware. Setting 50% of model parameters to zero (pruning) lets Tensor Cores exploit this for a 2x additional performance gain. Up to 1,248 TOPS in INT8 theoretically.

Multi-Instance GPU (MIG): A100 can be partitioned into up to 7 independent GPU instances. Useful for running multiple small models in isolated environments on inference servers.

Hopper Architecture (2022): Transformer Engine

The Hopper architecture brought innovations specifically tailored to transformer models. The H100 is currently the most widely deployed top-tier AI training GPU.

H100 SXM5 80GB key specs:

CUDA cores: 16,896
4th-gen Tensor Cores: 528
FP32 performance: 60 TFLOPS
FP16/BF16 Tensor Core: 989 TFLOPS (~1 PFLOPS!)
FP8 Tensor Core: 1,979 TFLOPS (~2 PFLOPS)
Memory: 80GB HBM3, 3,350 GB/s
NVLink 4.0: 900 GB/s
TDP: 700W

Hopper's key innovations:

Transformer Engine: Hardware-level optimization for attention and MLP layers in transformer models. Automatically switches between FP8 and FP16 per layer. Up to 9x AI performance over A100 at launch.

FP8 support: Supports E4M3 and E5M2 FP8 formats. 2x Tensor Core performance vs FP16. Half the memory usage.

Thread Block Clusters: SMs (Streaming Multiprocessors) communicate with each other like shared memory. Distributed shared memory becomes possible.

NVLink 4.0: 900 GB/s, 1.5x improvement over the previous generation. Full-mesh connectivity for up to 8 GPUs.

H200 SXM 141GB key specs:

Compute: Same as H100
Memory: 141GB HBM3e (76% more than H100)
Bandwidth: 4,800 GB/s (43% more than H100)
Up to 2x faster throughput for LLM inference
Fits large models (70B+ LLMs) on a single GPU

Blackwell Architecture (2024): Next-Generation AI Acceleration

The Blackwell architecture, announced in 2024, is NVIDIA's latest.

B200 SXM key specs:

Compute: 20 PFLOPS (FP4)
FP8 performance: 9 PFLOPS
Memory: 192GB HBM3e, 8,000 GB/s
NVLink 5.0: 1,800 GB/s

Blackwell's key innovations:

FP4 support: 4-bit floating-point support enables ultra-high-density inference — more than 2x throughput vs FP8.

2nd Generation Transformer Engine: Automatically manages new precision formats including FP4 and FP6.

NVLink 5.0: 1,800 GB/s, 2x improvement over the previous generation.

GB200 NVL72: 36 Grace CPUs and 72 B200 GPUs integrated into a single rack-scale system. All GPUs connected via NVLink, functioning as one giant GPU. Achieves 1.4 ExaFLOPS (FP4).

3. Tensor Core Deep Dive

CUDA Core vs Tensor Core

A CUDA Core is a general-purpose floating-point execution unit. It processes one FP32 FMA (Fused Multiply-Add) operation per clock cycle.

A Tensor Core is a dedicated matrix multiplication unit. A 1st-gen Tensor Core performs a 4x4 FP16 matrix multiply-accumulate (D = A * B + C) in a single clock cycle. This is equivalent to 64 FP16 multiplications and 64 FP16 additions — 128 FP16 operations per cycle.

Tensor Core evolution by generation:

Gen	Architecture	Supported Precisions	Matrix Size	Notes
1st	Volta	FP16	4x4	First Tensor Core
2nd	Turing	FP16, INT8, INT4	-	INT support added
3rd	Ampere	FP16, BF16, TF32, INT8, INT4	-	TF32, Sparsity
4th	Hopper	FP16, BF16, TF32, FP8, INT8	-	FP8, Transformer Engine
5th	Blackwell	FP16, BF16, TF32, FP8, FP4	-	FP4 support

WMMA (Warp Matrix Multiply-Accumulate)

To use Tensor Cores directly in CUDA programming, you use the WMMA API:

#include <mma.h>
using namespace nvcuda::wmma;

// 16x16x16 FP16 matrix multiply
fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
fragment<accumulator, 16, 16, 16, float> c_frag;

fill_fragment(c_frag, 0.0f);

// Load matrices
load_matrix_sync(a_frag, a_ptr, 16);
load_matrix_sync(b_frag, b_ptr, 16);

// Execute Tensor Core multiply
mma_sync(c_frag, a_frag, b_frag, c_frag);

// Store result
store_matrix_sync(c_ptr, c_frag, 16, mem_row_major);

In practice, cuBLAS and PyTorch handle this automatically.

Mixed Precision Training

Mixed precision training maintains FP32 master weights while performing forward/backward passes in FP16 or BF16.

# PyTorch AMP (Automatic Mixed Precision)
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    with autocast(dtype=torch.bfloat16):
        output = model(batch)
        loss = criterion(output, target)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

BF16 is more stable than FP16 for training. FP16's 5-bit exponent gives a narrow range, causing potential overflow, while BF16's 8-bit exponent (same as FP32) represents a much wider range.

Sparsity Support

The 2:4 structured sparsity supported from A100 onward groups parameters into sets of 4 and sets exactly 2 of them to zero.

# PyTorch Sparse Tensor Core
from torch.nn.utils import prune

# Apply 2:4 structured pruning
prune.ln_structured(model.layer, name='weight', amount=0.5, n=2, dim=0)

After 50% pruning, the same model runs 2x faster at inference (theoretically).

4. GPU Memory Hierarchy

GDDR vs HBM: A Game Changer

GDDR6 (Graphics DDR6): Used in consumer GPUs. Separate chips mounted outside the package. RTX 4090: 24GB GDDR6X, 1,008 GB/s.

HBM2e (High Bandwidth Memory 2e): Used in datacenter GPUs. Stacked next to the GPU die in 2.5D packaging, connected via a silicon interposer. A100: 80GB HBM2e, 2,000 GB/s.

HBM3: Deployed in H100. 80GB, 3,350 GB/s.

HBM3e: Deployed in H200. 141GB, 4,800 GB/s. Also in B200: 192GB, 8,000 GB/s.

Why is HBM faster? HBM stacks multiple layers of DRAM dies vertically, connected by thousands of fine Through-Silicon Vias (TSVs). The GPU die and HBM stack sit side by side on a silicon interposer, providing ultra-wide bandwidth over extremely short distances.

Memory Hierarchy Structure

Registers (Register File)
    └── Fastest; tens to hundreds per thread
L1 Cache / Shared Memory
    └── Shared within an SM (Streaming Multiprocessor)
    └── H100: 228KB per SM
L2 Cache
    └── Shared across all SMs
    └── H100: 50MB
HBM (Main Memory)
    └── Accessible by all SMs
    └── H100: 80GB

The key to kernel optimization is keeping data in shared memory as long as possible to minimize slow HBM accesses. Flash Attention is a prime example of applying this principle to attention computation.

ECC Memory

ECC (Error-Correcting Code) memory detects and corrects bit errors. Datacenter GPUs (A100, H100, etc.) support ECC. Consumer GPUs (RTX 4090) do not.

Memory errors during long training runs can cause training to diverge or produce NaN values. ECC-capable GPUs are recommended for critical training jobs. Enabling ECC reduces usable memory capacity by approximately 6.25%.

5. Multi-GPU Connectivity: NVLink & NVSwitch

The PCIe Bottleneck

A standard PCIe 4.0 x16 slot offers up to 32 GB/s bandwidth (64 GB/s bidirectional). In multi-GPU training, gradient synchronization creates a bottleneck through this interface.

Consider All-Reduce across 4 GPUs: each GPU must exchange gradients with the other 3. A 10B parameter model carries ~40GB of FP32 gradients. Exchanging this over PCIe could take tens of seconds.

NVLink Evolution

Version	Architecture	Unidirectional BW	Bidirectional BW
1.0	Pascal	20 GB/s	40 GB/s
2.0	Volta	25 GB/s	50 GB/s
3.0	Ampere	25 GB/s	50 GB/s (600 GB/s total)
4.0	Hopper	50 GB/s	900 GB/s
5.0	Blackwell	100 GB/s	1,800 GB/s

NVLink 4.0 provides up to 900 GB/s bidirectional bandwidth between GPU pairs — more than 14x faster than PCIe 4.0 x16.

NVSwitch: All-to-All Connectivity

NVLink connects pairs of GPUs directly, but connecting 8 or more GPUs requires NVSwitch — a dedicated GPU interconnect switch chip that lets all connected GPUs communicate directly at full NVLink bandwidth.

DGX H100 system configuration:

8× H100 SXM5 GPUs
4× NVSwitch 4.0
All GPU pairs directly connected at 900 GB/s
NVLink All-to-All total bandwidth: 7.2 TB/s

DGX A100 vs DGX H100

DGX A100:

GPUs: 8× A100 80GB
NVLink total bandwidth: 4.8 TB/s
GPU memory: 640GB
AI performance: 5 PFLOPS (FP16)

DGX H100:

GPUs: 8× H100 80GB
NVLink total bandwidth: 7.2 TB/s
GPU memory: 640GB
AI performance: 32 PFLOPS (FP8)
Approximately 6.4x performance over DGX A100

InfiniBand: Inter-Node Connectivity

NVLink handles intra-node connectivity; InfiniBand (IB) networking connects multiple server nodes. NVIDIA ConnectX-7 NICs with InfiniBand NDR (400 Gb/s) minimize inter-server communication latency.

Large-scale LLM training requires connecting thousands of GPUs. Meta's Llama 3 training used 16,000 H100s, all interconnected by a massive InfiniBand fabric.

6. Detailed AI GPU Comparison

NVIDIA A100 (80GB HBM2e)

Released in 2020, the A100 remains the standard for many AI workloads. Offers FP16 312 TFLOPS, BF16 312 TFLOPS, and TF32 156 TFLOPS.

The SXM4 form factor supports NVLink 3.0 for up to 8 GPUs; a PCIe 4.0 version also exists. MIG (Multi-Instance GPU) partitions the A100 into up to 7 independent instances.

Approximate cloud hourly rate: ~$32.77/h for AWS p4d.24xlarge (8× A100).

NVIDIA H100 (80GB HBM3)

Released 2022. Currently the most widely deployed high-end AI training GPU.

SXM5 version:

FP16/BF16 Tensor Core: 989 TFLOPS
FP8 Tensor Core: 1,979 TFLOPS
Memory: 80GB HBM3, 3,350 GB/s
TDP: 700W
NVLink 4.0: 900 GB/s

PCIe version:

FP16/BF16 Tensor Core: 756 TFLOPS
Memory: 80GB HBM3, 2,000 GB/s
TDP: 350W

H100 vs A100:

Tensor Core performance: 3.2x (FP16)
FP8 vs INT8: 6x
Memory bandwidth: 1.7x (HBM3)
NVLink bandwidth: 1.5x

Approximate cloud hourly rate: ~$98.32/h for AWS p5.48xlarge (8× H100).

NVIDIA H200 (141GB HBM3e)

A memory-upgraded H100. Compute performance is identical to H100, but memory capacity and bandwidth are dramatically improved.

Memory: 141GB HBM3e (76% more than H100)
Bandwidth: 4,800 GB/s (43% more than H100)
Up to 2x faster throughput vs H100 for LLM inference
Fits large models (70B+ LLMs) on a single GPU

NVIDIA B100 / B200 (Blackwell)

Announced 2024. Still in early deployment.

B200 SXM:

FP4 Tensor Core: 20 PFLOPS
FP8 Tensor Core: 9 PFLOPS
FP16/BF16 Tensor Core: 4.5 PFLOPS
Memory: 192GB HBM3e, 8,000 GB/s
TDP: 1,000W

B200 vs H100:

FP8 performance: 4.5x
Memory: 2.4x
Bandwidth: 2.4x

NVIDIA GB200 NVL72 (Rack-Scale AI)

The GB200 integrates a Grace CPU (ARM-based) and B200 GPU into a single superchip package. GB200 NVL72 integrates 36 Grace CPUs and 72 B200 GPUs into a single rack system.

GB200 NVL72 specs:

GPUs: 72× B200
CPUs: 36× Grace Hopper Superchip
GPU memory: 13.8TB HBM3e
NVLink 5.0 All-to-All connectivity
AI performance: 1.4 ExaFLOPS (FP4), 720 PFLOPS (FP8)
Total power: 120kW

This effectively operates as one enormous GPU. A single rack can handle large models like Llama 3 405B at high throughput.

GeForce RTX 4090 (Consumer)

The best consumer-grade GPU for AI startups and individual researchers.

CUDA cores: 16,384
FP32 performance: 82.6 TFLOPS
FP16 Tensor Core: ~330 TFLOPS (approximate)
Memory: 24GB GDDR6X, 1,008 GB/s
TDP: 450W
Price: ~$1,599 (MSRP)

vs H100 SXM:

Tensor Core: ~1/3 the performance
Memory: 24GB vs 80GB
Bandwidth: 1,008 vs 3,350 GB/s
ECC: Not supported
NVLink: Not supported (PCIe only)
Price: ~1/20th (H100 is $30,000+)

AMD MI300X

AMD's datacenter AI GPU.

Compute Units (CU): 304
FP16 performance: 1,307 TFLOPS
BF16 performance: 1,307 TFLOPS
FP8 performance: 2,614 TOPS
Memory: 192GB HBM3, 5,300 GB/s
TDP: 750W

The MI300X has a significant memory capacity (192GB vs 80GB) and bandwidth (5,300 vs 3,350 GB/s) advantage over the H100. It particularly excels in LLM inference.

GPU Performance Comparison Table

GPU	FP16 TFLOPS	Memory	Bandwidth	TDP	Year
A100 SXM	312	80GB HBM2e	2,000 GB/s	400W	2020
RTX 4090	~330	24GB GDDR6X	1,008 GB/s	450W	2022
H100 SXM	989	80GB HBM3	3,350 GB/s	700W	2022
MI300X	1,307	192GB HBM3	5,300 GB/s	750W	2023
H200 SXM	989	141GB HBM3e	4,800 GB/s	700W	2024
B200 SXM	4,500	192GB HBM3e	8,000 GB/s	1,000W	2024

7. AMD GPU for AI

The ROCm Ecosystem

AMD's AI software stack is ROCm (Radeon Open Compute) — an open-source platform compatible with CUDA. Compatibility with major frameworks like PyTorch and TensorFlow has improved significantly in recent years.

ROCm's CUDA-equivalent components:

HIP (Heterogeneous-compute Interface for Portability): CUDA C++ equivalent
rocBLAS: cuBLAS equivalent (matrix operations)
MIOpen: cuDNN equivalent (deep learning primitives)
rccl: NCCL equivalent (GPU communication)

Installing PyTorch with ROCm support:

# Install PyTorch with ROCm support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0

MI300X Key Features

The MI300X is AMD's current flagship AI GPU. It uses the CDNA 3 architecture with advanced packaging (MCM: Multi-Chip Module) that integrates GPU dies and HBM stacks in 3D.

The MI300X's 192GB HBM3 and 5,300 GB/s bandwidth can outperform the H100 in some LLM inference scenarios — especially memory-bound workloads (large batches, long sequences).

Microsoft Azure, Oracle Cloud, and others have begun offering MI300X instances. Major tech companies including Meta and Microsoft are actively adopting AMD GPUs.

AMD vs NVIDIA Software Ecosystem

Honestly, the NVIDIA CUDA software ecosystem still dominates AI today.

NVIDIA-exclusive libraries: cuDNN, cuBLAS, TensorRT, NCCL, NVTX, etc.
FlashAttention was originally CUDA-only (ROCm port came later)
Much research code assumes CUDA
ROCm is rapidly closing the gap but isn't fully compatible yet

Choosing AMD GPUs for production may require additional engineering time to address software compatibility issues.

8. Cloud GPU Service Comparison

AWS GPU Instances

p3 series (V100):

p3.2xlarge: 1× V100, $3.06/h
p3.16xlarge: 8× V100, $24.48/h

p4d series (A100):

p4d.24xlarge: 8× A100, 320GB HBM2, $32.77/h

p5 series (H100):

p5.48xlarge: 8× H100, 640GB HBM3, $98.32/h

AWS Trainium (Trn1):

AWS-proprietary AI training chip (Trainium 2)
Trn1.32xlarge: 16× Trainium, $21.50/h
Better price-performance for LLM training vs H100

AWS Inferentia (Inf2):

Inference-only chip
Inf2.48xlarge: 12× Inferentia2, $12.98/h
Optimized for Llama 2 70B inference

Spot instances can save 60-90% vs on-demand. Checkpointing is essential since instances can be interrupted.

Google Cloud GPU Instances

A100 instances:

a2-highgpu-1g: 1× A100 (40GB), $3.67/h
a2-megagpu-16g: 16× A100, $55.74/h

H100 instances:

a3-highgpu-8g: 8× H100, ~$19-25/h (varies by region)

Google TPU v4/v5:

TPU v4: AI training-optimized ASIC, 400 TFlops/chip
TPU v5e: Optimized for large-scale inference
TPU v5p: Latest training-focused, 459 TFlops/chip
Best compatibility with Google's JAX framework

Azure GPU Instances

ND H100 v5:

Standard_ND96isr_H100_v5: 8× H100
Inter-node connectivity via InfiniBand NDR

NCas_T4_v3:

T4-based inference instances
Standard_NC64as_T4_v3: 4× T4, $4.35/h

Lambda Labs, CoreWeave, Vast.ai

Cloud startups offer GPUs more cheaply than AWS/GCP/Azure.

Lambda Labs:

H100 SXM5 8× instance: $26.80/h (73% cheaper than AWS p5)
Lambda Cloud is tailored for AI researchers

CoreWeave:

Professional GPU cloud
H100 single: $2.89/h
Large cluster configurations available

Vast.ai:

GPU marketplace (individuals/companies renting out GPUs)
H100 at ~$2-3/h (market price)
Suitable for experimental training where security sensitivity is low

Cloud GPU Cost Estimation

LLM training cost example (Llama 3 8B, 8× A100, 100B tokens):

Training time estimate:
- Chinchilla law: 8B params x 100B tokens
- ~7-10 days on an 8× A100 cluster
- p4d.24xlarge: $32.77/h x 24h x 8 days = ~$6,291

Cost-saving strategies:
1. Spot instances: 70% savings -> ~$1,887
2. Lambda Labs: $11.60/h x 24h x 8 days = ~$2,227
3. CoreWeave: even cheaper possible

9. GPU Selection Guide

Training vs Inference

Important for training:

High-performance Tensor Cores (BF16/FP8)
Sufficient memory (batch size, gradients, optimizer state)
NVLink bandwidth (multi-GPU gradient synchronization)
ECC memory (stability)

Important for inference:

Memory bandwidth (KV cache read speed)
Memory capacity (model + KV cache)
INT8/FP8/FP4 support
MIG (isolated execution of multiple small models)

GPU Requirements by Model Size

Memory required at FP16:

Model Size	Parameter Memory	Training Memory	GPUs Needed
7B	14GB	~56GB	1× H100 (80GB)
13B	26GB	~104GB	2× H100
70B	140GB	~560GB	8× H100
405B	810GB	~3.2TB	40+ H100
1T	2TB	~8TB	100+ H100

Training memory = parameters x 4 (FP16 params, gradients, Adam 2 moments) + activations

Memory-saving techniques:

Gradient Checkpointing: dramatically reduces activation memory (20-30% speed trade-off)
FSDP/ZeRO: distributes parameters, gradients, optimizer state across GPUs
Flash Attention: attention computation O(N²) → O(N) memory
FP8 training: halves memory usage

Recommendations by Budget

Individual researchers (~$2,000):

RTX 4090 (24GB, ~$1,600): fine-tuning small models, LoRA training
7B model QLoRA fine-tuning is feasible
FlashAttention supported (Ada Lovelace architecture)

Startup team (~$10,000-50,000):

4-8× RTX 4090: small LLM experiments
Or used A100 40GB/80GB × 1-4
Hybrid cloud strategy recommended

Mid-size AI team (~$1M+):

8× H100 (DGX H100 level): $320,000+
Or Lambda Labs/CoreWeave cloud
Can train 100B+ parameter models

Large research institutions/enterprises:

Hundreds to thousands of H100/H200 GPUs
GB200 NVL72 rack systems
Dedicated InfiniBand network

Consumer vs Datacenter

Feature	RTX 4090	H100 SXM
Memory	24GB GDDR6X	80GB HBM3
Bandwidth	1,008 GB/s	3,350 GB/s
FP16 Perf	~330 TFLOPS	989 TFLOPS
ECC	Not supported	Supported
NVLink	Not supported	Supported
MIG	Not supported	Supported
TDP	450W	700W
Price	~$1,600	~$30,000+
Warranty	3yr consumer	Enterprise

10. GPU Monitoring and Optimization

Using nvidia-smi

nvidia-smi is the primary CLI tool for monitoring NVIDIA GPUs.

# Basic GPU status
nvidia-smi

# Real-time monitoring (1 second refresh)
watch -n 1 nvidia-smi

# CSV output for logging
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,\
pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,\
utilization.memory,memory.total,memory.free,memory.used \
--format=csv -l 1 > gpu_log.csv

# Per-process GPU memory usage
nvidia-smi pmon -s m

# GPU topology (NVLink connections)
nvidia-smi topo -m

PyTorch GPU Utilization Optimization

import torch

# Check GPU memory usage
print(f"Allocated memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Reserved memory: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Detailed memory analysis
print(torch.cuda.memory_summary())

# DataLoader optimization
dataloader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=4,           # Tune to CPU core count
    pin_memory=True,         # Pin CPU RAM for faster GPU transfer
    prefetch_factor=2,       # Number of batches to prefetch
    persistent_workers=True  # Reuse worker processes
)

# CUDA Streams for overlapping operations
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()

with torch.cuda.stream(stream1):
    result1 = model_part1(data1)

with torch.cuda.stream(stream2):
    result2 = model_part2(data2)  # Runs in parallel with stream1

torch.cuda.synchronize()

GPU Utilization Optimization Checklist

Common causes of low GPU utilization (below 70%) and fixes:

Data loading bottleneck: Increase num_workers, set pin_memory=True
Batch size too small: Use Gradient Accumulation to increase effective batch size
Python GIL bottleneck: Use CUDA Graphs to minimize CPU overhead
Memory fragmentation: Call torch.cuda.empty_cache() periodically

# CUDA Graphs to minimize CPU overhead
static_input = torch.randn(batch_size, input_size, device='cuda')
static_target = torch.randn(batch_size, output_size, device='cuda')

# Warmup
for _ in range(3):
    output = model(static_input)
    loss = criterion(output, static_target)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# Capture CUDA Graph
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    output = model(static_input)
    loss = criterion(output, static_target)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# Run with real data
for real_input, real_target in dataloader:
    static_input.copy_(real_input)
    static_target.copy_(real_target)
    g.replay()  # Replay CUDA Graph (near-zero CPU overhead)

Thermal Management

Datacenter GPUs generate hundreds of watts of heat. Thermal management directly impacts performance and longevity.

H100 SXM TDP: 700W, max temperature: 83°C
Thermal throttling automatically limits performance when exceeded
DGX systems support direct liquid cooling

# Real-time GPU temperature monitoring
nvidia-smi dmon -s t

# Fan speed control (consumer GPUs)
nvidia-settings -a "[gpu:0]/GPUFanControlState=1" \
                -a "[fan:0]/GPUTargetFanSpeed=80"

# Power limit (prevent overheating at some performance cost)
sudo nvidia-smi -pl 300  # Limit power to 300W

Multi-GPU Setup

For 4+ GPU configurations, correct settings matter.

# Check GPU connection topology
nvidia-smi topo -m
# Shows NVLink-connected GPU pairs and PCIe connections

# Enable NCCL debug logging
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

# Verify NVLink P2P
nvidia-smi nvlink --status

# NUMA optimization (multi-socket servers)
numactl --cpunodebind=0 --membind=0 python train.py  # Same NUMA node as GPU 0-3

# PyTorch distributed training basics
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    dist.init_process_group(
        backend='nccl',   # NVIDIA GPU: nccl; AMD: gloo or rccl
        init_method='env://',
        world_size=world_size,
        rank=rank
    )
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

# Launch with torchrun:
# torchrun --nproc_per_node=8 --nnodes=2 --rdzv_id=100 \
#          --rdzv_backend=c10d --rdzv_endpoint=host:29400 train.py

Conclusion: Practical Principles for GPU Selection

The most important thing in GPU selection is accurately understanding your own workload.

Memory capacity first: If your model and batch don't fit in GPU memory, nothing else matters.
Bandwidth vs compute: Training tends to be compute-bound; large model inference tends to be memory-bound.
Cloud-first strategy: If unsure, start with cloud to understand requirements before investing in on-premises hardware.
Ecosystem matters: NVIDIA's CUDA ecosystem remains overwhelmingly mature. AMD ROCm is catching up fast.
Power consumption: On-premises GPU cluster electricity costs are on par with hardware costs.

AI infrastructure is evolving rapidly. Blackwell makes FP4 quantization a reality; rack-scale systems like GB200 NVL72 are redefining the training and inference of large models. Keep a close eye on hardware developments and make choices optimized for your requirements.