Split View: AMD GPU & ROCm 완전 분석: CUDA의 대안은 가능한가?

AMD GPU & ROCm 완전 분석: CUDA의 대안은 가능한가?

AMD가 다시 도전한다
1. AMD GPU 아키텍처: RDNA vs CDNA
2. ROCm: AMD의 CUDA 대응 소프트웨어 스택
- 소프트웨어 스택 비교
3. HIP: CUDA 코드를 AMD에서 실행하기
4. AMD Compute Unit vs NVIDIA SM: 내부 비교
- 아키텍처 세부 비교
- Wave (Wavefront) vs Warp
5. PyTorch on AMD (ROCm)
- 설치 및 기본 사용법
- BF16 및 Flash Attention 지원
6. LLM 서빙 on AMD: vLLM과 llama.cpp
7. AMD의 강점과 현실적인 약점
8. AMD 시스템 설정 실전 가이드
- Docker 환경 구성 (권장)
- 베어메탈 설치
9. AMD vs NVIDIA: 2024-2025 현실적인 비교
마치며

AMD가 다시 도전한다

2020년대 초반까지만 해도 ML 워크로드에서 AMD GPU를 사용한다는 것은 일종의 고통을 자처하는 일이었습니다. ROCm은 불안정했고, 지원되는 라이브러리가 적었으며, 드라이버 문제가 잦았습니다. "CUDA가 아니면 불가능"이라는 인식이 지배적이었습니다.

그런데 2023-2024년을 거치면서 상황이 극적으로 달라졌습니다. AMD MI300X는 192GB HBM3 메모리를 탑재해 단일 GPU로 70B 파라미터 모델을 FP16으로 돌릴 수 있게 했고, ROCm 6.x부터는 PyTorch와 vLLM의 안정성이 크게 향상되었습니다. Microsoft, Meta, Hugging Face 등이 AMD GPU 지원을 공식화하면서 생태계도 빠르게 성장하고 있습니다.

이 글에서는 AMD GPU 아키텍처의 내부를 해부하고, ROCm 소프트웨어 스택의 작동 원리를 설명하며, 실제 LLM 서빙 시나리오에서 NVIDIA와 어떻게 비교되는지 솔직하게 분석합니다.

1. AMD GPU 아키텍처: RDNA vs CDNA

AMD의 GPU 라인업은 용도에 따라 두 개의 근본적으로 다른 아키텍처로 나뉩니다.

RDNA: 게임 최적화 아키텍처

RDNA 계열 (소비자 GPU):
- RX 7900 XTX (RDNA 3): 24GB GDDR6, 960 GB/s 대역폭
- RX 7900 XT (RDNA 3): 20GB GDDR6, 800 GB/s 대역폭
- 그래픽 렌더링 최적화 (rasterization, ray tracing)
- 게임 성능 최대화를 위한 캐시 구조
- ML 워크로드 지원: 가능하지만 공식 ROCm 지원이 제한적

CDNA: 컴퓨트 최적화 아키텍처 (AI/HPC용)

CDNA는 "Compute DNA"의 약자로, AMD가 AI/HPC 워크로드를 위해 별도로 설계한 아키텍처입니다. NVIDIA의 데이터센터 GPU(A100, H100)에 직접 대응합니다.

CDNA 계열 (데이터센터 GPU):
┌─────────────────────────────────────────────────────────────────┐
│ AMD MI300X (CDNA 3, 2023년 출시)                                │
│                                                                 │
│  • 192GB HBM3 메모리 (업계 최대!)                               │
│    → H100 SXM의 80GB 대비 2.4배                                 │
│  • 5.3 TB/s 메모리 대역폭                                       │
│    → H100 SXM의 3.35 TB/s 대비 1.58배                          │
│  • 304 Compute Units                                            │
│  • 1,307 TFLOPS FP16                                            │
│  • 655 TFLOPS FP32                                              │
│  • MCM (Multi-Chip Module): GPU + CPU HBM 통합                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

MI300X의 핵심 혁신은 MCM(Multi-Chip Module) 설계입니다. GPU 다이(die)와 CPU HBM 다이를 하나의 패키지에 통합해 초고속 GPU-CPU 메모리 접근을 가능하게 합니다.

MI300X vs H100: 핵심 스펙 비교

                    AMD MI300X          NVIDIA H100 SXM
메모리:             192GB HBM3          80GB HBM3
메모리 대역폭:       5.3 TB/s            3.35 TB/s
FP16 성능:          1,307 TFLOPS        1,979 TFLOPS
FP8 성능:           2,614 TFLOPS        3,958 TFLOPS
Tensor Core:        MFMA               4세대 Tensor Core
TDP:                750W                700W
가격 (추정):        ~$15,000-20,000     ~$30,000-40,000
메모리 용량 우위:    ✅ 2.4배 우세       -
컴퓨트 우위:        -                   ✅ ~1.5배 우세
대역폭 우위:        ✅ 1.58배 우세      -

LLM 추론에서는 메모리 대역폭과 용량이 더 중요한 경우가 많기 때문에, MI300X는 특히 큰 모델 서빙에서 강점을 보입니다.

2. ROCm: AMD의 CUDA 대응 소프트웨어 스택

CUDA가 NVIDIA의 가장 강력한 경쟁 우위 중 하나라면, ROCm은 AMD가 이를 따라잡으려는 전략적 투자입니다.

소프트웨어 스택 비교

NVIDIA 스택:                    AMD 스택:
┌──────────────────────┐        ┌──────────────────────┐
│   PyTorch / JAX      │        │   PyTorch / JAX      │
│   TensorFlow         │        │   TensorFlow         │
└──────────┬───────────┘        └──────────┬───────────┘
           ↓                               ↓
┌──────────────────────┐        ┌──────────────────────┐
│    CUDA Runtime      │        │   ROCm Runtime (HIP) │
└──────────┬───────────┘        └──────────┬───────────┘
           ↓                               ↓
┌──────────────────────┐        ┌──────────────────────┐
│   cuDNN / cuBLAS     │        │  MIOpen / rocBLAS    │
│   cuSPARSE           │        │  rocSPARSE           │
│   cuFFT              │        │  rocFFT              │
└──────────┬───────────┘        └──────────┬───────────┘
           ↓                               ↓
┌──────────────────────┐        ┌──────────────────────┐
│   NVCC 컴파일러       │        │   hipcc 컴파일러       │
│   PTX (IR)           │        │   GCN ISA / AMDGCN   │
└──────────┬───────────┘        └──────────┬───────────┘
           ↓                               ↓
┌──────────────────────┐        ┌──────────────────────┐
│    NVIDIA GPU        │        │     AMD GPU          │
└──────────────────────┘        └──────────────────────┘

ROCm의 설계 원칙은 CUDA와의 최대 호환성입니다. PyTorch 코드에서 torch.cuda로 시작하는 API가 AMD GPU에서도 그대로 작동하는 것이 목표입니다.

3. HIP: CUDA 코드를 AMD에서 실행하기

HIP(Heterogeneous-compute Interface for Portability)은 AMD가 개발한 C++ 기반 프로그래밍 인터페이스입니다. CUDA 코드와 거의 동일한 문법을 사용합니다.

CUDA와 HIP 코드 비교

// CUDA 코드 (NVIDIA):
#include <cuda_runtime.h>

__global__ void vector_add(float* a, float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    float *d_a, *d_b, *d_c;
    int n = 1024 * 1024;
    size_t size = n * sizeof(float);

    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_c, size);

    // 데이터 복사 및 커널 실행
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    vector_add<<<n/256, 256>>>(d_a, d_b, d_c, n);
    cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

    cudaFree(d_a);
    return 0;
}

// HIP 코드 (AMD): CUDA와 거의 동일!
#include <hip/hip_runtime.h>

__global__ void vector_add(float* a, float* b, float* c, int n) {
    int idx = hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    float *d_a, *d_b, *d_c;
    int n = 1024 * 1024;
    size_t size = n * sizeof(float);

    hipMalloc(&d_a, size);           // cudaMalloc → hipMalloc
    hipMalloc(&d_b, size);
    hipMalloc(&d_c, size);

    hipMemcpy(d_a, h_a, size, hipMemcpyHostToDevice);  // 접두사만 변경
    vector_add<<<n/256, 256>>>(d_a, d_b, d_c, n);      // 동일한 <<<>>> 문법
    hipMemcpy(h_c, d_c, size, hipMemcpyDeviceToHost);

    hipFree(d_a);
    return 0;
}

HIPIFY: 자동 코드 변환 도구

AMD는 CUDA 코드를 HIP으로 자동 변환하는 HIPIFY 도구를 제공합니다:

# HIPIFY를 사용한 CUDA → HIP 변환
hipify-perl cuda_kernel.cu > hip_kernel.hip

# 또는 clang 기반 HIPIFY
hipify-clang cuda_kernel.cu -- -I/usr/local/cuda/include

# 변환률: 단순한 CUDA 코드는 90%+ 자동 변환
# 커스텀 CUDA 인트린직(intrinsic)은 수동 변환 필요

# HIP 코드 컴파일
hipcc hip_kernel.hip -o hip_kernel

HIP의 설계 철학: Write Once, Run Anywhere

HIP으로 작성한 코드는 AMD와 NVIDIA 양쪽에서 실행됩니다:

// 동일한 HIP 코드가 두 플랫폼에서 동작:
// AMD: hipcc -arch=gfx942 kernel.hip   (MI300X)
// NVIDIA: hipcc --platform=nvidia kernel.hip  (CUDA로 컴파일)

// 플랫폼 감지 코드
#ifdef __HIP_PLATFORM_AMD__
    // AMD 전용 최적화
    __builtin_amdgcn_s_sleep(1);
#elif defined(__HIP_PLATFORM_NVIDIA__)
    // NVIDIA 전용 코드
    __nanosleep(1000);
#endif

4. AMD Compute Unit vs NVIDIA SM: 내부 비교

아키텍처 세부 비교

NVIDIA H100 SM (Streaming Multiprocessor):
┌───────────────────────────────────────────────┐
│  128 CUDA Cores (FP32 연산 유닛)               │
│  64 FP64 Cores                                │
│  4 Tensor Cores (4세대, FP8/FP16/BF16/INT8)   │
│  8 LD/ST 유닛 (Load/Store)                    │
│  228KB L1 캐시 / Shared Memory (설정 가능)     │
│  65,536개 32-bit 레지스터                      │
│                                               │
│  단일 SM FP16 성능: ~60 TFLOPS (피크)          │
└───────────────────────────────────────────────┘

AMD MI300X CU (Compute Unit):
┌───────────────────────────────────────────────┐
│  64 Stream Processors (SIMD 벡터 유닛)         │
│  64 FP64 유닛                                 │
│  4 Matrix Cores (MFMA: Matrix Fused Multiply-Add) │
│  16 LD/ST 유닛                                │
│  64KB L1 캐시                                 │
│  32KB LDS (Local Data Share = Shared Memory)  │
│  65,536개 32-bit 레지스터                      │
│                                               │
│  MFMA는 AMD의 Tensor Core 대응 기능:           │
│  v_mfma_f32_16x16x16f16 (FP16 행렬 곱)        │
└───────────────────────────────────────────────┘

Wave (Wavefront) vs Warp

NVIDIA는 32개 스레드 묶음을 Warp라고 부르지만, AMD는 64개 스레드 묶음을 Wavefront (Wave64)라고 부릅니다:

NVIDIA Warp:
- 32개 스레드가 동시 실행 (SIMT: Single Instruction, Multiple Threads)
- 모든 스레드가 동일한 명령어 실행

AMD Wavefront (Wave64):
- 64개 스레드가 동시 실행
- Wave32 모드도 지원 (RDNA3, MI300X에서)
- 더 큰 wavefront = 데이터 병렬성 ↑, 분기(branch) 발산 시 비효율 ↑

// HIP 커널에서 wavefront 크기 확인
__global__ void check_wavefront() {
    // AMD에서는 warpSize = 64 (또는 Wave32 모드에서 32)
    // NVIDIA에서는 항상 warpSize = 32
    int lane = threadIdx.x % warpSize;
    printf("warpSize: %d, my lane: %d\n", warpSize, lane);
}

5. PyTorch on AMD (ROCm)

설치 및 기본 사용법

# ROCm 지원 PyTorch 설치
# ROCm 6.0 기준 (2024년 기준 최신 안정 버전)
pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/rocm6.0

# ROCm 환경 확인
python -c "
import torch
print('PyTorch version:', torch.__version__)
print('ROCm available:', torch.cuda.is_available())  # True on AMD!
print('ROCm version:', torch.version.hip)
print('GPU count:', torch.cuda.device_count())
print('GPU name:', torch.cuda.get_device_name(0))
"
# 출력 예시:
# PyTorch version: 2.3.0+rocm6.0
# ROCm available: True  ← AMD에서도 cuda.is_available()이 True!
# ROCm version: 6.0.0
# GPU count: 1
# GPU name: AMD Instinct MI300X

# AMD GPU에서 텐서 연산
import torch
device = torch.device("cuda")  # "cuda"를 쓰면 AMD GPU에서도 동작!

x = torch.randn(1000, 1000, device=device)
y = torch.randn(1000, 1000, device=device)
z = torch.matmul(x, y)
print(z.shape)  # torch.Size([1000, 1000])

AMD가 "cuda" 네임스페이스를 유지하는 이유: 수백만 개의 PyTorch 코드베이스가 torch.cuda를 사용합니다. 이를 torch.hip 또는 torch.rocm으로 바꾸면 호환성이 깨집니다. AMD는 의도적으로 동일한 API를 사용해 코드 수정 없이 기존 코드가 AMD GPU에서 돌아가도록 했습니다.

BF16 및 Flash Attention 지원

import torch
import torch.nn.functional as F

# BF16 (BFloat16) 지원 확인
device = torch.device("cuda")
a = torch.randn(512, 512, dtype=torch.bfloat16, device=device)
b = torch.randn(512, 512, dtype=torch.bfloat16, device=device)
c = torch.matmul(a, b)  # MI300X에서 BF16 MFMA 사용

# Flash Attention on AMD (ROCm)
# flash-attn 패키지 설치 (ROCm 빌드 필요)
# pip install flash-attn --no-build-isolation

from flash_attn import flash_attn_func

q = torch.randn(2, 512, 8, 64, dtype=torch.float16, device=device)
k = torch.randn(2, 512, 8, 64, dtype=torch.float16, device=device)
v = torch.randn(2, 512, 8, 64, dtype=torch.float16, device=device)

# Flash Attention은 ROCm에서도 지원
out = flash_attn_func(q, k, v, dropout_p=0.0, causal=True)

6. LLM 서빙 on AMD: vLLM과 llama.cpp

vLLM on AMD MI300X

vLLM은 2024년부터 AMD ROCm을 공식 지원합니다. PagedAttention과 continuous batching이 AMD GPU에서도 동작합니다.

# ROCm용 vLLM 설치
pip install vllm  # ROCm 환경에서 자동으로 ROCm 버전 설치

# 또는 소스에서 빌드
git clone https://github.com/vllm-project/vllm
cd vllm
pip install -e . --no-build-isolation  # ROCm 환경에서 자동 감지

# MI300X에서 Llama 3.1 70B 서빙
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 1 \
  --dtype float16 \
  --device cuda  # AMD GPU에서도 "cuda" 사용!

# 더 큰 모델: MI300X 192GB에서 405B도 INT8로 가능
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 4 \  # 4x MI300X = 768GB
  --quantization fp8 \
  --device cuda

# vLLM Python API
from vllm import LLM, SamplingParams

# AMD MI300X에서 70B 모델 FP16으로 로드 (192GB VRAM 덕분에 가능!)
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    dtype="float16",
    tensor_parallel_size=1,  # 단일 MI300X로 충분
    gpu_memory_utilization=0.85,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
)

prompts = ["Explain the attention mechanism in transformers."]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

llama.cpp on AMD

# ROCm 지원 llama.cpp 빌드
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_HIPBLAS=ON \
  -DCMAKE_HIP_ARCHITECTURES="gfx942"  # MI300X 아키텍처
cmake --build build --config Release -j

# MI300X에서 Llama 3.1 70B Q4_K_M 실행
./build/bin/llama-cli \
  -m models/llama-3.1-70b-q4_k_m.gguf \
  -p "Explain ROCm architecture" \
  -n 200 \
  --n-gpu-layers 999
# 예상: ~15-20 tok/s (Q4_K_M, 단일 MI300X)
# FP16로 로드 시: ~8-10 tok/s (더 큰 모델, 더 나은 품질)

# AMD RX 7900 XTX (소비자 GPU)에서는:
./build/bin/llama-cli -m models/llama-3.1-8b-q4_k_m.gguf ...
# 예상: ~50-60 tok/s (RX 7900 XTX 24GB)

실전 성능 비교표

GPU	메모리	Llama 3.1 70B FP16	Llama 3.1 70B INT8	비고
NVIDIA H100 SXM	80GB HBM3	~2,800 tok/s	~3,200 tok/s	배치 처리 기준
AMD MI300X	192GB HBM3	~2,200 tok/s	~2,800 tok/s	단일 GPU로 FP16 가능!
NVIDIA A100 80GB	80GB HBM2e	~1,400 tok/s	~1,600 tok/s	-
AMD RX 7900 XTX	24GB GDDR6	OOM	~600 tok/s	Q4 필요
NVIDIA RTX 4090	24GB GDDR6X	OOM	~700 tok/s	Q4 필요

배치 처리 처리량 기준, 단일 요청 latency는 별도 측정 필요

7. AMD의 강점과 현실적인 약점

강점 1: 압도적인 메모리 용량

MI300X 메모리 활용 시나리오:
┌─────────────────────────────────────────────────────────────────┐
│ 단일 MI300X 192GB에서 가능한 것들:                               │
│                                                                 │
│ • Llama 3.1 70B FP16 로드:  ~140GB → 가능! (52GB 여유)         │
│   → NVIDIA H100 SXM 단일 카드로는 불가 (80GB)                   │
│                                                                 │
│ • Llama 3.1 70B + 긴 컨텍스트 KV 캐시:                         │
│   모델 ~140GB + KV 캐시 32K 컨텍스트 ~20GB = 160GB → 가능!     │
│                                                                 │
│ • Mixtral 8x7B MoE FP16: ~93GB → 단일 카드로 가능!            │
│                                                                 │
│ • 연구용 experimental 모델들 (100B+): 양자화 없이 실험 가능     │
└─────────────────────────────────────────────────────────────────┘

강점 2: 우수한 메모리 대역폭

LLM 추론의 핵심 병목이 메모리 대역폭임을 앞서 설명했습니다. MI300X의 5.3 TB/s는 H100의 3.35 TB/s보다 약 58% 높습니다:

# 이론적 최대 토큰 생성 속도 추정 (메모리 대역폭 제한 모델)
# 배치 사이즈 1, 단일 토큰 생성 시

# 70B FP16 모델 = 140GB
# 각 토큰 생성마다 모든 weight를 한 번씩 읽어야 함

def estimate_max_toks_per_sec(memory_bw_gbps, model_size_gb):
    """메모리 대역폭 제한 하에서 이론적 최대 속도"""
    return memory_bw_gbps / model_size_gb

# MI300X: 5,300 GB/s / 140 GB = ~38 tok/s (이론적 상한)
mi300x_est = estimate_max_toks_per_sec(5300, 140)
print(f"MI300X 이론 상한: {mi300x_est:.1f} tok/s")  # ~37.9

# H100 SXM: 3,350 GB/s / 140 GB = ~24 tok/s (이론적 상한)
# 실제로는 H100이 더 빠른데, compute 효율이 높기 때문
h100_est = estimate_max_toks_per_sec(3350, 140)
print(f"H100 이론 상한: {h100_est:.1f} tok/s")   # ~23.9

약점 1: CUDA 에코시스템 격차

이것이 AMD의 가장 큰 문제입니다. CUDA의 성숙도는 15년간의 최적화 역사를 담고 있습니다:

CUDA 에코시스템 (2024년):               ROCm 에코시스템 (2024년):
- PyTorch: 완전 지원 ✅                 - PyTorch: 지원 ✅ (안정성 개선 중)
- JAX: 완전 지원 ✅                     - JAX: 실험적 지원 ⚠️
- TensorFlow: 완전 지원 ✅              - TensorFlow: 공식 지원 ✅
- FlashAttention: 최적화됨 ✅          - FlashAttention: 지원하지만 더 느림 ⚠️
- cuDNN kernels: 15년 최적화 ✅        - MIOpen: 최적화 진행 중 ⚠️
- Triton: 완전 지원 ✅                  - Triton on ROCm: 지원 (성능 격차 있음) ⚠️
- BitsAndBytes: 완전 지원 ✅           - BitsAndBytes on ROCm: 지원 ✅
- DeepSpeed: 완전 지원 ✅              - DeepSpeed on ROCm: 지원 ✅
- vLLM: 완전 지원 ✅                    - vLLM on ROCm: 공식 지원 ✅ (2024~)

약점 2: 소비자 GPU의 ROCm 지원 제한

AMD ROCm의 공식 지원 플랫폼은 Linux + MI 시리즈 데이터센터 GPU에 집중되어 있습니다:

# ROCm 6.0 공식 지원 GPU (2024년 기준):
# ✅ AMD Instinct MI300X
# ✅ AMD Instinct MI250X
# ✅ AMD Instinct MI210
# ✅ AMD Instinct MI100
# ⚠️ RX 7900 XTX: 비공식 지원 (많은 라이브러리 동작하지만 보장 안됨)
# ❌ RX 7800 XT 이하: 지원 불안정

# 소비자 GPU에서 ROCm 강제 사용
export HSA_OVERRIDE_GFX_VERSION=11.0.0  # RX 7900 XTX용
export ROCR_VISIBLE_DEVICES=0
rocminfo  # 감지된 GPU 정보 출력

약점 3: 드라이버 안정성

Windows에서의 AMD GPU ROCm 지원은 2024년 기준 여전히 제한적입니다. 대부분의 ML 워크로드는 Ubuntu 22.04 LTS + ROCm 조합이 권장됩니다.

8. AMD 시스템 설정 실전 가이드

Docker 환경 구성 (권장)

# ROCm Docker 이미지 사용 (가장 안정적인 방법)
docker pull rocm/pytorch:rocm6.0_ubuntu22.04_py3.10_pytorch_2.1.1

# GPU 접근 권한 부여하며 컨테이너 실행
docker run -it \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  --security-opt seccomp=unconfined \
  --cap-add=SYS_PTRACE \
  -v $(pwd):/workspace \
  rocm/pytorch:rocm6.0_ubuntu22.04_py3.10_pytorch_2.1.1 \
  /bin/bash

# 컨테이너 내에서 확인
python -c "import torch; print(torch.cuda.is_available(), torch.version.hip)"

베어메탈 설치

# Ubuntu 22.04 LTS에서 ROCm 6.0 설치
# 1. AMD ROCm 패키지 레포지터리 추가
wget https://repo.radeon.com/amdgpu-install/6.0/ubuntu/jammy/amdgpu-install_6.0.60000-1_all.deb
sudo dpkg -i amdgpu-install_6.0.60000-1_all.deb

# 2. ROCm 설치
sudo amdgpu-install --usecase=hiplibsdk,rocm,ml

# 3. 사용자를 render 및 video 그룹에 추가
sudo usermod -aG render,video $USER

# 4. 재로그인 후 확인
rocminfo | grep "Name:"
# Agent 2: gfx942  (MI300X)

# 5. PyTorch ROCm 설치
pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/rocm6.0

9. AMD vs NVIDIA: 2024-2025 현실적인 비교

워크로드별 권장 사항

워크로드	AMD MI300X	NVIDIA H100	추천
70B 모델 단일 GPU 서빙	✅ FP16 직접 로드 가능	❌ 80GB 한계	AMD
405B 모델 서빙	2개 카드로 가능	최소 3개 필요	AMD
대규모 배치 처리량	좋음	매우 좋음	NVIDIA
단일 요청 레이턴시	좋음	매우 좋음	NVIDIA
모델 파인튜닝	가능	더 성숙	NVIDIA
가격 효율	30-40% 저렴	비쌈	AMD
소프트웨어 안정성	빠르게 개선 중	매우 성숙	NVIDIA
Windows 지원	제한적	완전 지원	NVIDIA

실제 채택 사례 (2024년)

Microsoft Azure: MI300X를 AI 서비스에 도입 (Azure의 AMD GPU 인스턴스 출시)
Meta: 일부 AI 워크로드에 MI300X 도입 검토
Hugging Face: ROCm 지원 모델 허브 및 라이브러리 개선
Oracle Cloud: OCI Compute MI300X 인스턴스 제공

결론: AMD를 선택해야 하는 시나리오

AMD MI300X가 더 나은 경우:

단일 GPU에서 가능한 한 큰 모델을 돌리고 싶을 때
메모리 용량이 처리량보다 중요한 실험 환경
NVIDIA 대비 30-40% 예산 절감이 필요할 때
이미 AMD 하드웨어 계약이 있을 때

NVIDIA H100을 유지해야 하는 경우:

프로덕션 안정성이 최우선일 때
모든 최신 ML 라이브러리를 바로 사용해야 할 때
Windows 기반 개발 환경이 필요할 때
팀 전체가 CUDA 전문성을 보유할 때

마치며

AMD는 2020년대 초반의 "그냥 CUDA 쓰세요"라는 평가에서 벗어나, 진지한 CUDA 대안으로 부상했습니다. MI300X의 192GB HBM3은 단순한 스펙 자랑이 아니라 실질적인 사용 시나리오를 열어줍니다. FP16 70B 모델을 단일 카드에서 양자화 없이 서빙하는 것, 이것은 현재 NVIDIA 단일 카드로는 불가능한 일입니다.

소프트웨어 생태계 측면에서는 여전히 NVIDIA가 앞서 있지만, ROCm 6.x와 vLLM의 공식 AMD 지원은 격차를 빠르게 좁히고 있습니다. 2026년 현재, AMD는 "사용 가능하지만 불안정"에서 "실용적이고 경쟁력 있는" 수준으로 올라섰습니다.

ML 인프라 팀이라면 NVIDIA H100을 기본으로 유지하면서 특정 대규모 모델 서빙 워크로드에 MI300X를 도입하는 하이브리드 전략을 고려할 만합니다. 경쟁이 심화될수록 엔지니어에게 선택지가 늘어나는 것은 좋은 일입니다.

AMD GPU & ROCm Deep Dive: Can It Challenge CUDA for LLM Inference?

AMD Is Back in the Game
1. AMD GPU Architecture: RDNA vs CDNA
2. ROCm: AMD's Answer to the CUDA Software Stack
- Software Stack Comparison
3. HIP: Running CUDA Code on AMD
4. AMD Compute Unit vs NVIDIA SM: Internal Comparison
- Architecture Detail Comparison
- Wavefront vs Warp
5. PyTorch on AMD (ROCm)
- Installation and Basic Usage
- BF16 and Flash Attention Support
6. LLM Serving on AMD: vLLM and llama.cpp
7. AMD's Strengths and Honest Weaknesses
8. Practical Setup: AMD ROCm Environment
- Docker Setup (Recommended)
- Bare Metal Installation
9. AMD vs NVIDIA: A 2024-2025 Realistic Assessment
Conclusion

AMD Is Back in the Game

Until the early 2020s, using AMD GPUs for ML workloads meant voluntarily embracing pain. ROCm was unstable, library support was thin, and driver issues were frequent. The prevailing consensus was "if it's not CUDA, it's not viable."

That changed dramatically through 2023-2024. The AMD MI300X ships with 192GB of HBM3 memory — enabling a single GPU to run a 70B parameter model in FP16. ROCm 6.x brought meaningful stability improvements for PyTorch and vLLM. Microsoft, Meta, and Hugging Face have all formalized AMD GPU support, and the ecosystem is growing fast.

This post dissects AMD GPU architecture internals, explains how the ROCm software stack works, and honestly compares AMD against NVIDIA in real LLM serving scenarios.

1. AMD GPU Architecture: RDNA vs CDNA

AMD's GPU lineup splits into two fundamentally different architectures depending on the intended use case.

RDNA: Gaming-Optimized Architecture

RDNA Family (Consumer GPUs):
- RX 7900 XTX (RDNA 3): 24GB GDDR6, 960 GB/s bandwidth
- RX 7900 XT (RDNA 3): 20GB GDDR6, 800 GB/s bandwidth
- Optimized for graphics rendering (rasterization, ray tracing)
- Cache hierarchy tuned for maximum gaming performance
- ML workload support: possible, but official ROCm support is limited

CDNA: Compute-Optimized Architecture (AI/HPC)

CDNA stands for "Compute DNA" — AMD's separate architecture designed specifically for AI and HPC workloads. It directly competes with NVIDIA's datacenter GPU line (A100, H100).

CDNA Family (Datacenter GPUs):
┌─────────────────────────────────────────────────────────────────┐
│ AMD MI300X (CDNA 3, released 2023)                              │
│                                                                 │
│  • 192GB HBM3 memory (industry record!)                        │
│    → 2.4x more than H100 SXM's 80GB                            │
│  • 5.3 TB/s memory bandwidth                                   │
│    → 1.58x more than H100 SXM's 3.35 TB/s                     │
│  • 304 Compute Units                                           │
│  • 1,307 TFLOPS FP16                                           │
│  • 655 TFLOPS FP32                                             │
│  • MCM (Multi-Chip Module): GPU + CPU HBM integrated           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

MI300X's key innovation is its MCM (Multi-Chip Module) design. GPU dies and CPU HBM dies are co-packaged, enabling ultra-fast GPU-CPU memory access.

MI300X vs H100: Core Spec Comparison

                    AMD MI300X          NVIDIA H100 SXM
Memory:             192GB HBM3          80GB HBM3
Memory Bandwidth:   5.3 TB/s            3.35 TB/s
FP16 Performance:   1,307 TFLOPS        1,979 TFLOPS
FP8 Performance:    2,614 TFLOPS        3,958 TFLOPS
AI Accelerator:     MFMA               4th Gen Tensor Core
TDP:                750W                700W
Estimated Price:    ~$15,000-20,000     ~$30,000-40,000
Memory advantage:   ✅ 2.4x larger      -
Compute advantage:  -                   ✅ ~1.5x faster
Bandwidth advantage: ✅ 1.58x higher    -

For LLM inference, memory bandwidth and capacity often matter more than raw compute, giving MI300X a meaningful advantage particularly for large model serving.

2. ROCm: AMD's Answer to the CUDA Software Stack

If CUDA is NVIDIA's most powerful competitive moat, ROCm is AMD's strategic investment to close the gap.

Software Stack Comparison

NVIDIA Stack:                   AMD Stack:
┌──────────────────────┐        ┌──────────────────────┐
│   PyTorch / JAX      │        │   PyTorch / JAX      │
│   TensorFlow         │        │   TensorFlow         │
└──────────┬───────────┘        └──────────┬───────────┘
           ↓                               ↓
┌──────────────────────┐        ┌──────────────────────┐
│    CUDA Runtime      │        │   ROCm Runtime (HIP) │
└──────────┬───────────┘        └──────────┬───────────┘
           ↓                               ↓
┌──────────────────────┐        ┌──────────────────────┐
│   cuDNN / cuBLAS     │        │  MIOpen / rocBLAS    │
│   cuSPARSE           │        │  rocSPARSE           │
│   cuFFT              │        │  rocFFT              │
└──────────┬───────────┘        └──────────┬───────────┘
           ↓                               ↓
┌──────────────────────┐        ┌──────────────────────┐
│   NVCC compiler      │        │   hipcc compiler     │
│   PTX (IR)           │        │   GCN ISA / AMDGCN   │
└──────────┬───────────┘        └──────────┬───────────┘
           ↓                               ↓
┌──────────────────────┐        ┌──────────────────────┐
│    NVIDIA GPU        │        │     AMD GPU          │
└──────────────────────┘        └──────────────────────┘

ROCm's design principle is maximum CUDA compatibility. The goal is for PyTorch code using torch.cuda APIs to work unchanged on AMD GPUs.

3. HIP: Running CUDA Code on AMD

HIP (Heterogeneous-compute Interface for Portability) is AMD's C++ programming interface. It uses nearly identical syntax to CUDA.

CUDA vs HIP Code Comparison

// CUDA code (NVIDIA):
#include <cuda_runtime.h>

__global__ void vector_add(float* a, float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    float *d_a, *d_b, *d_c;
    int n = 1024 * 1024;
    size_t size = n * sizeof(float);

    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_c, size);

    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    vector_add<<<n/256, 256>>>(d_a, d_b, d_c, n);
    cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

    cudaFree(d_a);
    return 0;
}

// HIP code (AMD): nearly identical to CUDA!
#include <hip/hip_runtime.h>

__global__ void vector_add(float* a, float* b, float* c, int n) {
    int idx = hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    float *d_a, *d_b, *d_c;
    int n = 1024 * 1024;
    size_t size = n * sizeof(float);

    hipMalloc(&d_a, size);           // cudaMalloc → hipMalloc
    hipMalloc(&d_b, size);
    hipMalloc(&d_c, size);

    hipMemcpy(d_a, h_a, size, hipMemcpyHostToDevice);  // prefix change only
    vector_add<<<n/256, 256>>>(d_a, d_b, d_c, n);      // same <<<>>> syntax
    hipMemcpy(h_c, d_c, size, hipMemcpyDeviceToHost);

    hipFree(d_a);
    return 0;
}

HIPIFY: Automatic Code Conversion

AMD provides the HIPIFY tool for automatic CUDA → HIP conversion:

# Convert CUDA to HIP using HIPIFY
hipify-perl cuda_kernel.cu > hip_kernel.hip

# Or using clang-based HIPIFY
hipify-clang cuda_kernel.cu -- -I/usr/local/cuda/include

# Conversion rate: simple CUDA code converts 90%+ automatically
# Custom CUDA intrinsics require manual conversion

# Compile HIP code
hipcc hip_kernel.hip -o hip_kernel

HIP's Design Philosophy: Write Once, Run Anywhere

Code written in HIP runs on both AMD and NVIDIA hardware:

// Same HIP code works on both platforms:
// AMD:    hipcc -arch=gfx942 kernel.hip   (MI300X)
// NVIDIA: hipcc --platform=nvidia kernel.hip  (compiles via CUDA)

// Platform detection
#ifdef __HIP_PLATFORM_AMD__
    // AMD-specific optimization
    __builtin_amdgcn_s_sleep(1);
#elif defined(__HIP_PLATFORM_NVIDIA__)
    // NVIDIA-specific code path
    __nanosleep(1000);
#endif

4. AMD Compute Unit vs NVIDIA SM: Internal Comparison

Architecture Detail Comparison

NVIDIA H100 SM (Streaming Multiprocessor):
┌───────────────────────────────────────────────┐
│  128 CUDA Cores (FP32 compute units)          │
│  64 FP64 Cores                                │
│  4 Tensor Cores (4th gen, FP8/FP16/BF16/INT8) │
│  8 LD/ST units (Load/Store)                   │
│  228KB L1 cache / Shared Memory (configurable)│
│  65,536 x 32-bit registers                    │
│                                               │
│  Single SM FP16 peak: ~60 TFLOPS             │
└───────────────────────────────────────────────┘

AMD MI300X CU (Compute Unit):
┌───────────────────────────────────────────────┐
│  64 Stream Processors (SIMD vector units)     │
│  64 FP64 units                                │
│  4 Matrix Cores (MFMA: Matrix Fused Multiply-Add) │
│  16 LD/ST units                               │
│  64KB L1 cache                                │
│  32KB LDS (Local Data Share = Shared Memory)  │
│  65,536 x 32-bit registers                    │
│                                               │
│  MFMA is AMD's equivalent of Tensor Cores:   │
│  v_mfma_f32_16x16x16f16 (FP16 matrix multiply)│
└───────────────────────────────────────────────┘

Wavefront vs Warp

NVIDIA calls its 32-thread execution group a Warp, while AMD uses 64-thread groups called Wavefronts (Wave64):

NVIDIA Warp:
- 32 threads execute simultaneously (SIMT: Single Instruction, Multiple Threads)
- All threads execute the same instruction

AMD Wavefront (Wave64):
- 64 threads execute simultaneously
- Wave32 mode also supported on RDNA3 and MI300X
- Larger wavefront = higher data parallelism, but more inefficiency on branch divergence

// Check wavefront size in a HIP kernel
__global__ void check_wavefront() {
    // On AMD: warpSize = 64 (or 32 in Wave32 mode)
    // On NVIDIA: always warpSize = 32
    int lane = threadIdx.x % warpSize;
    printf("warpSize: %d, my lane: %d\n", warpSize, lane);
}

5. PyTorch on AMD (ROCm)

Installation and Basic Usage

# Install PyTorch with ROCm support
# ROCm 6.0 (stable as of 2024)
pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/rocm6.0

# Verify ROCm environment
python -c "
import torch
print('PyTorch version:', torch.__version__)
print('ROCm available:', torch.cuda.is_available())  # True on AMD!
print('ROCm version:', torch.version.hip)
print('GPU count:', torch.cuda.device_count())
print('GPU name:', torch.cuda.get_device_name(0))
"
# Example output:
# PyTorch version: 2.3.0+rocm6.0
# ROCm available: True  ← True even on AMD!
# ROCm version: 6.0.0
# GPU count: 1
# GPU name: AMD Instinct MI300X

# Tensor operations on AMD GPU
import torch
device = torch.device("cuda")  # "cuda" works on AMD GPU via ROCm!

x = torch.randn(1000, 1000, device=device)
y = torch.randn(1000, 1000, device=device)
z = torch.matmul(x, y)
print(z.shape)  # torch.Size([1000, 1000])

Why AMD kept the "cuda" namespace: Millions of PyTorch codebases use torch.cuda. Renaming it to torch.hip or torch.rocm would break compatibility. AMD deliberately mirrored the same API so existing code runs on AMD GPUs without modification.

BF16 and Flash Attention Support

import torch
import torch.nn.functional as F

# BF16 (BFloat16) support check
device = torch.device("cuda")
a = torch.randn(512, 512, dtype=torch.bfloat16, device=device)
b = torch.randn(512, 512, dtype=torch.bfloat16, device=device)
c = torch.matmul(a, b)  # Uses BF16 MFMA on MI300X

# Flash Attention on AMD (ROCm)
# Install flash-attn with ROCm build
# pip install flash-attn --no-build-isolation

from flash_attn import flash_attn_func

q = torch.randn(2, 512, 8, 64, dtype=torch.float16, device=device)
k = torch.randn(2, 512, 8, 64, dtype=torch.float16, device=device)
v = torch.randn(2, 512, 8, 64, dtype=torch.float16, device=device)

# Flash Attention supported on ROCm
out = flash_attn_func(q, k, v, dropout_p=0.0, causal=True)

6. LLM Serving on AMD: vLLM and llama.cpp

vLLM on AMD MI300X

vLLM officially supports AMD ROCm since 2024. PagedAttention and continuous batching work on AMD GPUs.

# Install vLLM for ROCm
pip install vllm  # Automatically picks ROCm version in ROCm environment

# Or build from source
git clone https://github.com/vllm-project/vllm
cd vllm
pip install -e . --no-build-isolation  # Auto-detects ROCm environment

# Serve Llama 3.1 70B on MI300X
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 1 \
  --dtype float16 \
  --device cuda  # "cuda" works on AMD via ROCm!

# Larger model: 405B in INT8 on 4x MI300X (768GB total)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 4 \
  --quantization fp8 \
  --device cuda

# vLLM Python API on AMD
from vllm import LLM, SamplingParams

# Load 70B in FP16 on single MI300X (possible because of 192GB VRAM!)
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    dtype="float16",
    tensor_parallel_size=1,  # Single MI300X is enough
    gpu_memory_utilization=0.85,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
)

prompts = ["Explain the attention mechanism in transformers."]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

llama.cpp on AMD

# Build llama.cpp with ROCm support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_HIPBLAS=ON \
  -DCMAKE_HIP_ARCHITECTURES="gfx942"  # MI300X architecture
cmake --build build --config Release -j

# Run Llama 3.1 70B Q4_K_M on MI300X
./build/bin/llama-cli \
  -m models/llama-3.1-70b-q4_k_m.gguf \
  -p "Explain ROCm architecture" \
  -n 200 \
  --n-gpu-layers 999
# Expected: ~15-20 tok/s (Q4_K_M, single MI300X)
# With FP16 load: ~8-10 tok/s (better quality, larger model)

# On consumer AMD RX 7900 XTX:
./build/bin/llama-cli -m models/llama-3.1-8b-q4_k_m.gguf ...
# Expected: ~50-60 tok/s (RX 7900 XTX 24GB)

Real Performance Comparison Table

GPU	Memory	Llama 3.1 70B FP16	Llama 3.1 70B INT8	Notes
NVIDIA H100 SXM	80GB HBM3	~2,800 tok/s	~3,200 tok/s	Batch throughput
AMD MI300X	192GB HBM3	~2,200 tok/s	~2,800 tok/s	Single GPU, FP16 possible!
NVIDIA A100 80GB	80GB HBM2e	~1,400 tok/s	~1,600 tok/s	-
AMD RX 7900 XTX	24GB GDDR6	OOM	~600 tok/s	Q4 required
NVIDIA RTX 4090	24GB GDDR6X	OOM	~700 tok/s	Q4 required

Batch processing throughput; single-request latency measured separately

7. AMD's Strengths and Honest Weaknesses

Strength 1: Dominant Memory Capacity

What a single MI300X 192GB enables:
┌─────────────────────────────────────────────────────────────────┐
│ Llama 3.1 70B FP16 load: ~140GB → fits! (52GB headroom)        │
│   → Single H100 SXM cannot hold this (80GB limit)              │
│                                                                 │
│ Llama 3.1 70B + long context KV cache:                         │
│   Model ~140GB + KV cache 32K context ~20GB = 160GB → fits!    │
│                                                                 │
│ Mixtral 8x7B MoE FP16: ~93GB → single card possible!          │
│                                                                 │
│ Experimental 100B+ research models: test without quantization  │
└─────────────────────────────────────────────────────────────────┘

Strength 2: Superior Memory Bandwidth

LLM inference is memory bandwidth limited (as established earlier). MI300X's 5.3 TB/s is ~58% higher than H100's 3.35 TB/s:

# Theoretical maximum token generation speed (memory bandwidth limited)
# Batch size 1, single token generation

# 70B FP16 model = 140GB
# Each token generation must read all weights once

def estimate_max_toks_per_sec(memory_bw_gbps, model_size_gb):
    """Theoretical maximum under memory bandwidth limit"""
    return memory_bw_gbps / model_size_gb

# MI300X: 5,300 GB/s / 140 GB = ~38 tok/s (theoretical ceiling)
mi300x_est = estimate_max_toks_per_sec(5300, 140)
print(f"MI300X theoretical ceiling: {mi300x_est:.1f} tok/s")  # ~37.9

# H100 SXM: 3,350 GB/s / 140 GB = ~24 tok/s (theoretical ceiling)
# In practice H100 is faster due to higher compute efficiency
h100_est = estimate_max_toks_per_sec(3350, 140)
print(f"H100 theoretical ceiling: {h100_est:.1f} tok/s")   # ~23.9

Weakness 1: CUDA Ecosystem Gap

This is AMD's biggest challenge. CUDA's maturity represents 15 years of continuous optimization:

CUDA Ecosystem (2024):                  ROCm Ecosystem (2024):
- PyTorch: full support ✅              - PyTorch: supported ✅ (stability improving)
- JAX: full support ✅                  - JAX: experimental support ⚠️
- TensorFlow: full support ✅           - TensorFlow: official support ✅
- FlashAttention: highly optimized ✅  - FlashAttention: supported but slower ⚠️
- cuDNN kernels: 15 years optimized ✅ - MIOpen: optimization ongoing ⚠️
- Triton: full support ✅               - Triton on ROCm: supported (perf gap) ⚠️
- BitsAndBytes: full support ✅        - BitsAndBytes on ROCm: supported ✅
- DeepSpeed: full support ✅           - DeepSpeed on ROCm: supported ✅
- vLLM: full support ✅                 - vLLM on ROCm: official support ✅ (2024~)

Weakness 2: Limited Consumer GPU ROCm Support

AMD ROCm's official platform is Linux + MI-series datacenter GPUs:

# ROCm 6.0 officially supported GPUs (as of 2024):
# ✅ AMD Instinct MI300X
# ✅ AMD Instinct MI250X
# ✅ AMD Instinct MI210
# ✅ AMD Instinct MI100
# ⚠️ RX 7900 XTX: unofficial support (many libraries work but no guarantee)
# ❌ RX 7800 XT and below: unstable support

# Force ROCm on consumer GPU
export HSA_OVERRIDE_GFX_VERSION=11.0.0  # For RX 7900 XTX
export ROCR_VISIBLE_DEVICES=0
rocminfo  # Print detected GPU info

Weakness 3: Driver Stability

Windows support for AMD GPU ROCm remains limited as of 2024. Ubuntu 22.04 LTS + ROCm is the recommended combination for ML workloads.

8. Practical Setup: AMD ROCm Environment

Docker Setup (Recommended)

# Use ROCm Docker image (most stable approach)
docker pull rocm/pytorch:rocm6.0_ubuntu22.04_py3.10_pytorch_2.1.1

# Run container with GPU access
docker run -it \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  --security-opt seccomp=unconfined \
  --cap-add=SYS_PTRACE \
  -v $(pwd):/workspace \
  rocm/pytorch:rocm6.0_ubuntu22.04_py3.10_pytorch_2.1.1 \
  /bin/bash

# Inside container, verify
python -c "import torch; print(torch.cuda.is_available(), torch.version.hip)"

Bare Metal Installation

# Install ROCm 6.0 on Ubuntu 22.04 LTS
# Step 1: Add AMD ROCm package repository
wget https://repo.radeon.com/amdgpu-install/6.0/ubuntu/jammy/amdgpu-install_6.0.60000-1_all.deb
sudo dpkg -i amdgpu-install_6.0.60000-1_all.deb

# Step 2: Install ROCm
sudo amdgpu-install --usecase=hiplibsdk,rocm,ml

# Step 3: Add user to render and video groups
sudo usermod -aG render,video $USER

# Step 4: Re-login and verify
rocminfo | grep "Name:"
# Agent 2: gfx942  (MI300X)

# Step 5: Install PyTorch ROCm
pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/rocm6.0

9. AMD vs NVIDIA: A 2024-2025 Realistic Assessment

Workload-Specific Recommendations

Workload	AMD MI300X	NVIDIA H100	Recommendation
70B model single-GPU serving	✅ FP16 fits natively	❌ 80GB limit	AMD
405B model serving	2 cards sufficient	Minimum 3 required	AMD
Large-batch throughput	Good	Excellent	NVIDIA
Single-request latency	Good	Excellent	NVIDIA
Model fine-tuning	Viable	More mature	NVIDIA
Price efficiency	30-40% cheaper	Expensive	AMD
Software stability	Improving rapidly	Very mature	NVIDIA
Windows support	Limited	Full support	NVIDIA

Real-World Adoption (2024)

Microsoft Azure: Deployed MI300X for AI services (AMD GPU instances launched)
Meta: Evaluating MI300X for certain AI workloads
Hugging Face: Improving ROCm support across model hub and libraries
Oracle Cloud: OCI Compute MI300X instances available

Conclusion: When to Choose AMD

AMD MI300X is the better choice when:

You want to run the largest possible model on a single GPU
Memory capacity matters more than throughput in experimental settings
You need 30-40% budget savings vs NVIDIA
You already have AMD hardware contracts in place

Stick with NVIDIA H100 when:

Production stability is the top priority
You need immediate access to every latest ML library
Windows-based development environment is required
Your team has deep CUDA expertise

Conclusion

AMD has moved from "just use CUDA" territory to a serious contender. The MI300X's 192GB HBM3 is not just a spec flex — it enables concrete use cases. Running FP16 70B models on a single card without quantization is something no NVIDIA single card can do today.

The software ecosystem still trails NVIDIA, but ROCm 6.x and vLLM's official AMD support are rapidly narrowing the gap. As of 2026, AMD has graduated from "available but unstable" to "practical and competitive."

For ML infrastructure teams, a hybrid strategy is worth considering: maintain NVIDIA H100 as the default while deploying MI300X for specific large-model serving workloads where memory capacity is the binding constraint. Competition benefits engineers — the more real alternatives exist, the better.