1. GPU 메모리 구성 분석
2. FP32 vs FP16 vs BF16 vs FP8 수치 표현 비교
3. Mixed Precision Training 원리
- 3.1 세 가지 핵심 기법
- 3.2 BF16 학습의 장점
4. PyTorch AMP 공식 문서 기능 분석 및 코드 예시
5. Gradient Checkpointing (Activation Checkpointing) 분석
6. Gradient Accumulation 기법
7. torch.cuda.memory_summary() 활용 메모리 프로파일링
8. NVIDIA Transformer Engine과 FP8 Training
9. 실전 사례: 대형 모델을 제한된 GPU에서 학습하기
10. 마무리
References

1. GPU 메모리 구성 분석

Deep Learning 모델을 학습할 때, GPU 메모리는 단순히 모델 파라미터만 저장하는 것이 아니다. 전체 학습 과정에서 GPU 메모리를 점유하는 구성 요소는 크게 네 가지로 나뉜다. 이 구성 요소를 정확히 이해해야 메모리 최적화 전략을 제대로 수립할 수 있다.

1.1 Model Parameters

모델 파라미터는 Neural Network의 Weight와 Bias를 의미한다. 학습 전체 과정에서 GPU 메모리에 상주하며, 파라미터 수와 데이터 타입에 따라 메모리 사용량이 결정된다.

FP32 기준: 파라미터 하나당 4 bytes
FP16/BF16 기준: 파라미터 하나당 2 bytes

예를 들어, 7B (70억) 파라미터 모델을 FP32로 저장하면 약 28GB, FP16으로 저장하면 약 14GB의 메모리가 필요하다.

1.2 Gradients

Gradient는 Loss Function에 대한 각 파라미터의 편미분 값이다. Backpropagation 과정에서 계산되며, 각 파라미터마다 하나의 Gradient 값이 대응되므로 파라미터와 동일한 크기의 메모리를 차지한다.

FP32 학습 시: 파라미터 수 x 4 bytes
Mixed Precision 학습 시: 파라미터 수 x 2 bytes (FP16으로 저장)

1.3 Optimizer States

Optimizer States는 메모리에서 가장 큰 비중을 차지하는 경우가 많다. SGD는 Momentum 하나만 유지하지만, Adam/AdamW Optimizer는 각 파라미터에 대해 **First Moment (Mean)**과 Second Moment (Variance) 두 가지 상태를 추가로 유지한다.

Adam Optimizer의 FP32 학습 시 Optimizer States 메모리:

파라미터 수를 Φ라 하면:
- Master Weights (FP32): 4Φ bytes
- Momentum (FP32): 4Φ bytes
- Variance (FP32): 4Φ bytes
- 합계: 12Φ bytes

7B 모델에 Adam을 사용하면 Optimizer States만으로 약 84GB가 필요하다. 이것이 대형 모델 학습에서 Optimizer States 최적화가 핵심인 이유다.

1.4 Activations

Activation은 Forward Pass 중 각 Layer의 출력값으로, Backpropagation에서 Gradient를 계산하기 위해 저장해 두어야 한다. Activation 메모리는 다음 요인에 비례한다:

Batch Size: 클수록 더 많은 메모리 필요
Sequence Length: Transformer 모델에서 특히 중요 (Attention의 경우 O(n^2))
Hidden Dimension: 모델이 클수록 증가
Layer 수: 깊은 모델일수록 증가

대형 모델에서는 Activation 메모리가 파라미터 메모리를 초과하는 경우도 빈번하다.

1.5 전체 메모리 요약

Mixed Precision Training으로 Adam Optimizer를 사용할 때, 파라미터 수 Φ에 대한 총 메모리 요구량은 다음과 같다:

구성 요소	데이터 타입	메모리
Model Parameters (FP16 copy)	FP16	2Φ bytes
Gradients (FP16)	FP16	2Φ bytes
Master Weights	FP32	4Φ bytes
Optimizer Momentum	FP32	4Φ bytes
Optimizer Variance	FP32	4Φ bytes
합계		16Φ bytes

여기에 Activation 메모리와 임시 버퍼, Memory Fragmentation 등으로 인해 실제로는 약 20-30% 추가 오버헤드가 발생한다.

2. FP32 vs FP16 vs BF16 vs FP8 수치 표현 비교

NVIDIA GPU에서 지원하는 부동소수점 형식들은 각각 고유한 특성을 가진다. 학습 안정성과 성능 사이의 Trade-off를 이해하기 위해 각 형식의 비트 구성과 표현 범위를 비교한다.

2.1 비트 구성

모든 IEEE 754 기반 부동소수점 수는 Sign (부호), Exponent (지수), Mantissa (가수) 세 부분으로 구성된다.

Format	Sign	Exponent	Mantissa	총 비트	메모리
FP32	1	8	23	32	4 bytes
FP16	1	5	10	16	2 bytes
BF16	1	8	7	16	2 bytes
FP8 (E4M3)	1	4	3	8	1 byte
FP8 (E5M2)	1	5	2	8	1 byte

2.2 Dynamic Range와 Precision

각 형식의 표현 가능 범위와 정밀도를 비교하면 다음과 같다:

FP32 (Single Precision)

Dynamic Range: ~1.2 x 10^-38 ~ 3.4 x 10^38
정밀도: 소수점 이하 약 7자리
Deep Learning의 기본 데이터 타입으로, 최고의 정밀도를 제공한다.

FP16 (Half Precision)

Dynamic Range: ~6.1 x 10^-5 ~ 6.55 x 10^4
정밀도: 소수점 이하 약 3자리
Dynamic Range가 좁아서 Gradient Underflow/Overflow가 발생하기 쉽다. 반드시 Loss Scaling과 함께 사용해야 한다.

BF16 (Brain Floating Point 16)

Dynamic Range: ~1.2 x 10^-38 ~ 3.4 x 10^38 (FP32와 동일)
정밀도: 소수점 이하 약 2자리
FP32와 동일한 8비트 Exponent를 가지므로 Dynamic Range가 동일하다. 정밀도는 낮지만 Loss Scaling 없이도 안정적으로 학습할 수 있어, 최근 대형 모델 학습의 표준이 되고 있다. NVIDIA Ampere (A100) 이상에서 지원한다.

FP8 E4M3

Dynamic Range: ~±448
정밀도: 가장 낮음
Forward Pass에서 주로 사용된다. 좁은 Range를 보완하기 위해 Per-tensor Scaling이 필수다.

FP8 E5M2

Dynamic Range: ~±57,344
정밀도: E4M3보다 더 낮지만 Range가 넓음
Backward Pass (Gradient 계산)에서 주로 사용된다. Gradient의 넓은 Dynamic Range를 수용하기 위함이다.

2.3 각 형식의 적합한 사용 시나리오

FP32  → Optimizer States, Master Weights (정밀한 누적 연산 필요)
BF16  → Forward/Backward 연산 (넓은 Range, Loss Scaling 불필요)
FP16  → Forward/Backward 연산 (Tensor Core 활용, Loss Scaling 필수)
FP8   → Hopper/Blackwell GPU에서의 최대 성능 (Transformer Engine 사용)

3. Mixed Precision Training 원리

Mixed Precision Training은 NVIDIA가 2017년 논문 "Mixed Precision Training" (Micikevicius et al.)에서 제안한 기법으로, 학습 과정에서 FP16(또는 BF16)과 FP32를 혼합하여 사용한다. 핵심 목표는 Tensor Core를 활용한 연산 속도 향상과 메모리 절감을 달성하면서도, FP32 수준의 학습 정확도를 유지하는 것이다.

3.1 세 가지 핵심 기법

NVIDIA 공식 문서에 따르면 Mixed Precision Training은 세 가지 핵심 기법으로 구성된다:

(1) Master Weights (FP32 사본 유지)

FP16으로 연산을 수행하더라도, 모델 파라미터의 FP32 Master Copy를 별도로 유지한다. Weight Update 시 FP32 Master Weights에 Gradient를 적용한 뒤, 다시 FP16으로 변환하여 다음 Forward Pass에 사용한다. 이렇게 하는 이유는 FP16의 낮은 정밀도로는 작은 Gradient가 Weight Update에 반영되지 않는 문제 (Swamping Problem)가 발생하기 때문이다.

Forward Pass: FP16 Weights → FP16 Activations → FP16 Loss
Backward Pass: FP16 Gradients 계산
Weight Update: FP32 Master Weights += Learning Rate × FP32 Gradients
FP16 Weights = cast(FP32 Master Weights)

(2) Loss Scaling

FP16의 Dynamic Range(~6.1 x 10^-5 ~ 6.55 x 10^4)는 Gradient 값을 담기에 부족할 수 있다. 실제 학습에서 Gradient 값은 10^-10보다 작은 경우도 흔한데, FP16이 표현할 수 있는 최소 양수 값은 약 6 x 10^-8이다. 이로 인해 Gradient Underflow가 발생하여 작은 Gradient가 0으로 처리되는 문제가 생긴다.

Loss Scaling은 이 문제를 해결한다:

Forward Pass에서 계산된 Loss 값에 Scale Factor를 곱한다.
Chain Rule에 의해, Backpropagation으로 계산되는 모든 Gradient에 동일한 Scale Factor가 곱해진다.
Weight Update 전에 Gradient를 Scale Factor로 나누어 원래 크기로 복원한다.

(3) Dynamic Loss Scaling

고정된 Scale Factor 대신 학습 과정에서 동적으로 조절하는 것이 Dynamic Loss Scaling이다. NVIDIA의 구현 방식은 다음과 같다:

초기 Scale Factor를 크게 설정한다 (예: 2^24 = 16,777,216).
매 Iteration에서 Gradient에 inf/NaN이 있는지 검사한다.
Overflow가 발생하지 않으면: 현재 Scale Factor를 유지하고, N번의 연속 성공 후 Scale Factor를 2배로 증가시킨다 (기본 N=2000).
Overflow가 발생하면: 해당 Iteration의 Weight Update를 건너뛰고, Scale Factor를 절반으로 줄인다.

이 메커니즘은 학습 과정의 Gradient 분포 변화에 자동으로 적응한다. 학습 후반부에서 Gradient 크기가 줄어들면 Scale Factor가 자연스럽게 증가하여 Underflow를 방지한다.

3.2 BF16 학습의 장점

BF16은 FP32와 동일한 Dynamic Range를 가지므로, Loss Scaling이 필요 없다. 이는 구현을 크게 단순화하며, Overflow/Underflow 관련 문제를 원천적으로 제거한다. Google의 TPU에서 먼저 채택되었으며, NVIDIA Ampere (A100) GPU부터 하드웨어 레벨에서 지원된다.

다만 BF16은 Mantissa가 7비트로 FP16(10비트)보다 적어 정밀도가 낮다. 일부 모델에서는 BF16 학습 시 정밀도 부족으로 수렴 문제가 발생할 수 있으므로, 반드시 Master Weights는 FP32로 유지해야 한다.

4. PyTorch AMP 공식 문서 기능 분석 및 코드 예시

PyTorch는 torch.amp 모듈을 통해 Automatic Mixed Precision (AMP)을 공식 지원한다. AMP는 두 가지 핵심 컴포넌트로 구성된다: torch.amp.autocast와 torch.amp.GradScaler.

주의: PyTorch 2.x부터 torch.cuda.amp.autocast와 torch.cuda.amp.GradScaler는 deprecated 되었다. 대신 torch.amp.autocast("cuda", ...)와 torch.amp.GradScaler("cuda", ...)를 사용해야 한다.

4.1 torch.amp.autocast

autocast는 Context Manager 또는 Decorator로 사용되며, 해당 영역 내의 연산을 자동으로 적절한 Precision으로 실행한다. 모든 연산을 일괄적으로 FP16으로 변환하는 것이 아니라, 연산 종류에 따라 최적의 데이터 타입을 선택한다.

FP16으로 실행: Conv, Linear, MatMul 등 (Tensor Core 활용 가능한 연산)
FP32로 유지: Softmax, LayerNorm, Loss 계산 등 (수치 안정성이 중요한 연산)

4.2 torch.amp.GradScaler

GradScaler는 Loss Scaling을 자동으로 관리한다. Dynamic Loss Scaling의 전체 과정(Scale 적용, Overflow 검사, Scale Factor 조절, Weight Update Skip)을 추상화하여 사용자가 직접 관리할 필요가 없게 한다.

4.3 기본 사용 패턴

import torch
from torch.amp import autocast, GradScaler

model = MyModel().cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
scaler = GradScaler("cuda")

for epoch in range(num_epochs):
    for batch in dataloader:
        inputs, targets = batch
        inputs = inputs.cuda()
        targets = targets.cuda()

        optimizer.zero_grad()

        # autocast 영역: Forward Pass를 Mixed Precision으로 실행
        with autocast("cuda"):
            outputs = model(inputs)
            loss = loss_fn(outputs, targets)

        # GradScaler: Loss를 스케일링한 뒤 Backward Pass 실행
        scaler.scale(loss).backward()

        # GradScaler: Gradient Unscale → Overflow 검사 → Optimizer Step
        scaler.step(optimizer)

        # Scale Factor 업데이트
        scaler.update()

4.4 Gradient Clipping과 함께 사용하기

Mixed Precision에서 Gradient Clipping을 적용하려면 scaler.unscale_() 을 명시적으로 호출한 뒤 Clipping을 수행해야 한다:

scaler.scale(loss).backward()

# Gradient Unscale을 먼저 수행
scaler.unscale_(optimizer)

# 원래 크기로 복원된 Gradient에 Clipping 적용
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Optimizer Step (내부에서 inf/NaN 검사 후 진행)
scaler.step(optimizer)
scaler.update()

4.5 Multiple GPU (DistributedDataParallel)에서의 사용

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

model = DDP(model, device_ids=[local_rank])
scaler = GradScaler("cuda")

with autocast("cuda"):
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

DDP와 AMP를 함께 사용할 때, GradScaler는 각 GPU에서 독립적으로 동작한다. DDP의 AllReduce가 Scaled Gradient에 대해 수행되므로, Unscale은 AllReduce 이후에 일어나야 한다. PyTorch의 구현은 이를 자동으로 처리한다.

4.6 BF16 사용 시

BF16을 사용하면 Loss Scaling이 필요 없으므로, GradScaler 없이 autocast만 사용하면 된다:

with autocast("cuda", dtype=torch.bfloat16):
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)

loss.backward()
optimizer.step()
optimizer.zero_grad()

이 방식은 코드가 훨씬 간결하며, Dynamic Loss Scaling에 의한 Step Skip도 발생하지 않아 학습이 더 안정적일 수 있다.

5. Gradient Checkpointing (Activation Checkpointing) 분석

Gradient Checkpointing은 메모리와 연산 시간 사이의 Trade-off를 활용하는 기법이다. Forward Pass 중 모든 Intermediate Activation을 저장하는 대신, 일부만 저장하고 나머지는 Backward Pass에서 재계산한다.

5.1 원리

일반적인 학습 과정에서는 Forward Pass의 모든 Layer 출력(Activation)을 메모리에 저장해 둔다. Backpropagation에서 Gradient 계산에 필요하기 때문이다. Gradient Checkpointing은 이 과정을 변경한다:

Forward Pass: Checkpoint로 지정된 구간의 입력만 저장하고, 중간 Activation은 저장하지 않는다.
Backward Pass: 저장된 입력으로부터 해당 구간의 Forward Pass를 다시 실행하여 Activation을 재계산한 뒤, Gradient를 계산한다.

이를 통해 Activation 메모리를 O(n)에서 O(sqrt(n))으로 줄일 수 있다 (n은 Layer 수). 대신 Forward 연산이 약 33% 증가한다 (한 번의 추가 Forward Pass).

5.2 PyTorch 구현: torch.utils.checkpoint

PyTorch는 torch.utils.checkpoint 모듈을 통해 두 가지 API를 제공한다:

checkpoint 함수

개별 함수 또는 Layer에 Checkpointing을 적용한다:

from torch.utils.checkpoint import checkpoint

class TransformerBlock(nn.Module):
    def __init__(self, ...):
        super().__init__()
        self.attention = MultiHeadAttention(...)
        self.ffn = FeedForward(...)
        self.norm1 = nn.LayerNorm(...)
        self.norm2 = nn.LayerNorm(...)

    def forward(self, x):
        # 이 블록의 중간 Activation을 저장하지 않음
        # Backward 시 재계산
        return checkpoint(self._forward, x, use_reentrant=False)

    def _forward(self, x):
        x = x + self.attention(self.norm1(x))
        x = x + self.ffn(self.norm2(x))
        return x

checkpoint_sequential 함수

Sequential 구조의 여러 Layer를 그룹으로 묶어 Checkpointing을 적용한다:

from torch.utils.checkpoint import checkpoint_sequential

class DeepModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            *[TransformerBlock(...) for _ in range(24)]
        )

    def forward(self, x):
        # 24개 Layer를 4개 그룹(6개씩)으로 나누어 Checkpointing
        segments = 4
        return checkpoint_sequential(self.layers, segments, x,
                                     use_reentrant=False)

5.3 Reentrant vs Non-reentrant Checkpointing

PyTorch는 두 가지 Checkpointing 모드를 제공한다:

Reentrant (use_reentrant=True): 기존 방식. 전체 함수를 Backward에서 재실행한다. 일부 제약이 있다 (Autograd의 특정 기능과 호환되지 않을 수 있음).
Non-reentrant (use_reentrant=False): 개선된 방식. 필요한 Intermediate Activation만 재계산한다. PyTorch 공식 문서에서 권장하는 방식이며, 향후 기본값이 될 예정이다.

5.4 실전 적용 전략

모든 Layer에 Checkpointing을 적용하면 연산 오버헤드가 크다. 효과적인 전략은 다음과 같다:

Attention Layer에만 적용: Self-Attention은 Activation 메모리가 O(n^2)으로 가장 크다.
매 N번째 Layer에 적용: 예를 들어 매 2번째 Layer마다 Checkpoint를 설정하면, 메모리 절감과 연산 오버헤드의 균형을 맞출 수 있다.
Hugging Face Transformers: model.gradient_checkpointing_enable() 한 줄로 활성화 가능하다.

6. Gradient Accumulation 기법

Gradient Accumulation은 물리적 Batch Size의 한계를 극복하기 위한 기법이다. GPU 메모리가 제한되어 큰 Batch를 사용할 수 없을 때, 여러 개의 작은 Micro-batch를 순차적으로 처리하고 Gradient를 누적한 뒤 한 번에 Weight Update를 수행한다.

6.1 원리

Effective Batch Size = Micro-batch Size x Accumulation Steps

예를 들어, Micro-batch Size가 8이고 Accumulation Steps가 4이면, Effective Batch Size는 32가 된다.

핵심은 PyTorch에서 loss.backward()를 호출하면 Gradient가 .grad 속성에 **누적(합산)**된다는 점이다. optimizer.zero_grad()를 호출하기 전까지 Gradient가 초기화되지 않으므로, 여러 번의 Backward Pass 결과를 자연스럽게 축적할 수 있다.

6.2 구현

accumulation_steps = 4
optimizer.zero_grad()

for i, (inputs, targets) in enumerate(dataloader):
    inputs, targets = inputs.cuda(), targets.cuda()

    with autocast("cuda"):
        outputs = model(inputs)
        # Loss를 Accumulation Steps로 나누어 평균을 구함
        loss = loss_fn(outputs, targets) / accumulation_steps

    scaler.scale(loss).backward()

    # Accumulation Steps마다 한 번 Weight Update
    if (i + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

중요: Loss를 accumulation_steps로 나누는 것을 잊으면 안 된다. Gradient가 누적되므로, 나누지 않으면 Effective Learning Rate가 accumulation_steps배만큼 커지는 효과가 생긴다.

6.3 주의사항

Batch Normalization: BatchNorm은 Micro-batch 단위로 통계를 계산하므로, 전체 Effective Batch에 대한 통계와 다를 수 있다. 이 경우 GroupNorm이나 LayerNorm을 사용하는 것이 바람직하다.
Learning Rate Scheduling: Step 기반 Scheduler를 사용할 때, Accumulation Steps를 고려하여 Scheduler의 호출 주기를 조정해야 한다.
DDP와 조합: DDP에서 Gradient Accumulation을 사용할 때, Accumulation 중에는 AllReduce를 비활성화하여 통신 오버헤드를 줄일 수 있다. model.no_sync() Context Manager를 활용한다.

for i, (inputs, targets) in enumerate(dataloader):
    # 마지막 Accumulation Step이 아니면 AllReduce 건너뛰기
    context = model.no_sync() if (i + 1) % accumulation_steps != 0 else nullcontext()
    with context:
        with autocast("cuda"):
            outputs = model(inputs)
            loss = loss_fn(outputs, targets) / accumulation_steps
        scaler.scale(loss).backward()

    if (i + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

7. torch.cuda.memory_summary() 활용 메모리 프로파일링

메모리 최적화의 첫 단계는 현재 메모리 사용 현황을 정확히 파악하는 것이다. PyTorch는 다양한 CUDA 메모리 프로파일링 도구를 제공한다.

7.1 torch.cuda.memory_summary()

가장 기본적이면서도 유용한 도구다. GPU 메모리 할당 현황을 상세하게 보여준다:

import torch

model = MyLargeModel().cuda()
inputs = torch.randn(32, 3, 224, 224).cuda()

# Forward Pass 후 메모리 상태 확인
with autocast("cuda"):
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)

print(torch.cuda.memory_summary(device=0, abbreviated=False))

출력 예시 (일부):

|===========================================================================|
|                  PyTorch CUDA memory summary                              |
|===========================================================================|
|            CUDA OOMs: 0                                                   |
|        cudaMallocs:   234                                                 |
|---------------------------------------------------------------------------+
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------+
| Allocated memory      |  4096 MiB  |  6234 MiB  | 12345 MiB  |  8249 MiB  |
| Active memory         |  3800 MiB  |  5900 MiB  | 11000 MiB  |  7200 MiB  |
| Requested memory      |  3750 MiB  |  5850 MiB  | 10800 MiB  |  7050 MiB  |
| GPU reserved memory   |  6400 MiB  |  6400 MiB  |  6400 MiB  |     0 MiB  |
|---------------------------------------------------------------------------+

주요 Metric 해석:

Allocated memory: 현재 할당된 메모리 (텐서가 차지하는 실제 메모리)
Active memory: 실제 사용 중인 메모리
GPU reserved memory: PyTorch가 CUDA에 예약한 전체 메모리 (Caching Allocator)
Peak Usage: 학습 중 최대 메모리 사용량 (OOM 발생 여부 판단의 핵심)

7.2 개별 메모리 조회 함수

특정 시점의 메모리 사용량을 코드 내에서 추적할 때 유용하다:

# 현재 할당된 메모리 (bytes)
allocated = torch.cuda.memory_allocated(device=0)

# 현재 예약된 메모리 (bytes)
reserved = torch.cuda.memory_reserved(device=0)

# 학습 시작 이후 최대 할당 메모리
max_allocated = torch.cuda.max_memory_allocated(device=0)

print(f"Allocated: {allocated / 1024**3:.2f} GB")
print(f"Reserved:  {reserved / 1024**3:.2f} GB")
print(f"Peak:      {max_allocated / 1024**3:.2f} GB")

# Peak 통계 초기화 (구간별 측정 시)
torch.cuda.reset_peak_memory_stats(device=0)

7.3 Memory Snapshot으로 상세 분석

PyTorch 2.x부터 제공되는 Memory Snapshot은 메모리 할당/해제 이벤트를 시간순으로 기록하여 시각화할 수 있다:

# 메모리 기록 시작
torch.cuda.memory._record_memory_history(max_entries=100000)

# 학습 코드 실행
train_one_epoch(model, dataloader, optimizer)

# 스냅샷 저장
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")

# 기록 중지
torch.cuda.memory._record_memory_history(enabled=None)

저장된 스냅샷은 PyTorch 공식 Memory Visualizer (https://pytorch.org/memory_viz) 에 업로드하여 시각적으로 분석할 수 있다.

7.4 OOM 디버깅 전략

Out of Memory가 발생할 때, 다음 순서로 원인을 파악한다:

torch.cuda.memory_summary()로 전체 현황 파악
max_memory_allocated()로 Peak 메모리 확인
Memory Snapshot으로 어느 연산에서 메모리가 급증하는지 확인
Batch Size, Sequence Length, Hidden Dimension 중 어떤 요인이 지배적인지 분석
적절한 최적화 기법 적용 (Mixed Precision, Gradient Checkpointing, Gradient Accumulation)

8. NVIDIA Transformer Engine과 FP8 Training

NVIDIA Transformer Engine (TE)은 Hopper (H100) 이상의 GPU에서 FP8 Precision을 활용하여 Transformer 모델의 학습과 추론을 가속하는 라이브러리다. FP8 학습은 FP16/BF16 대비 최대 2배의 Throughput 향상을 달성할 수 있다.

8.1 FP8의 두 가지 형식

H100 GPU는 두 가지 FP8 형식을 하드웨어 레벨에서 지원한다:

E4M3 (4-bit Exponent, 3-bit Mantissa): Range ~±448. 상대적으로 높은 정밀도. Forward Pass의 Activation과 Weight에 사용.
E5M2 (5-bit Exponent, 2-bit Mantissa): Range ~±57,344. 넓은 Dynamic Range. Backward Pass의 Gradient에 사용.

이 분리 전략은 각 단계의 수치 특성에 맞는 형식을 선택하여 정확도 손실을 최소화한다.

8.2 Per-tensor Scaling

FP8의 좁은 Dynamic Range를 보완하기 위해 Transformer Engine은 Per-tensor Scaling을 적용한다. 각 텐서에 대해 개별적인 Scale Factor를 유지하여, 텐서의 값 분포를 FP8의 표현 범위에 맞게 조정한다.

Transformer Engine은 여러 Scaling 전략을 제공한다:

Delayed Scaling: 이전 Iteration의 통계를 기반으로 Scale Factor를 결정 (기본값)
Current Scaling (Just-in-time): 현재 텐서의 값에 기반하여 즉시 Scale Factor를 계산
Block Scaling: 텐서를 블록 단위로 나누어 각 블록에 개별 Scale Factor를 적용

8.3 Transformer Engine 사용 예시

import transformer_engine.pytorch as te
from transformer_engine.common.recipe import DelayedScaling, Format

# FP8 학습 레시피 설정
fp8_recipe = DelayedScaling(
    margin=0,
    fp8_format=Format.HYBRID,  # Forward: E4M3, Backward: E5M2
    amax_history_len=1024,
    amax_compute_algo="max",
)

# Transformer Engine의 Layer 사용
model = te.TransformerLayer(
    hidden_size=4096,
    ffn_hidden_size=16384,
    num_attention_heads=32,
    layer_number=1,
)

# FP8 학습 루프
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)

loss.backward()
optimizer.step()

8.4 Blackwell 세대의 진화

NVIDIA Blackwell (B200) GPU에서는 FP8에 더해 다음 형식들이 추가로 지원된다:

MXFP8: Microscaling FP8. 더 세밀한 블록 단위 Scaling 지원
NVFP4: 4-bit Floating Point. 추론에서 주로 사용되며, 메모리 효율이 극대화됨

Transformer Engine 2.x부터 이러한 새로운 형식들을 Recipe 모듈을 통해 통합적으로 지원한다.

9. 실전 사례: 대형 모델을 제한된 GPU에서 학습하기

지금까지 살펴본 기법들을 종합하여, 제한된 GPU 환경에서 대형 모델을 학습하는 실전 전략을 정리한다.

9.1 시나리오: 24GB GPU에서 7B 모델 Fine-tuning

A10G (24GB VRAM) 한 장으로 7B 파라미터 모델을 Fine-tuning하는 상황을 가정한다.

메모리 요구량 분석 (FP32 기준):

구성 요소	메모리
Model Parameters (FP32)	28 GB
Gradients (FP32)	28 GB
Adam Optimizer States	56 GB
Activations (batch=1)	~2 GB
합계	~114 GB

FP32로는 절대 불가능하다. 단계별로 최적화를 적용해보겠다.

9.2 단계별 최적화 적용

Step 1: Mixed Precision Training (BF16)

from torch.amp import autocast

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf",
                                              torch_dtype=torch.bfloat16)

구성 요소	메모리
Model Parameters (BF16)	14 GB
Gradients (BF16)	14 GB
Adam Master Weights (FP32)	28 GB
Adam Momentum (FP32)	28 GB
Adam Variance (FP32)	28 GB
합계	~112 GB

Optimizer States 때문에 여전히 부족하다.

Step 2: LoRA 적용 (학습 가능 파라미터 축소)

LoRA (Low-Rank Adaptation)로 전체 파라미터의 약 0.1-1%만 학습 대상으로 설정한다. 7B 모델에서 Rank=16 LoRA를 적용하면 학습 파라미터는 약 20M 수준이다.

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)

구성 요소	메모리
전체 모델 (BF16, frozen)	14 GB
LoRA Parameters (BF16)	~0.04 GB
LoRA Gradients (BF16)	~0.04 GB
Adam States (FP32, LoRA만)	~0.24 GB
합계	~14.3 GB

24GB GPU에 충분히 들어간다. 이제 Batch Size를 키울 여유가 생겼다.

Step 3: Gradient Checkpointing 추가

남은 약 10GB의 여유 메모리에서 Activation을 최대한 절약하여 Batch Size를 키운다:

model.gradient_checkpointing_enable()

Step 4: Gradient Accumulation으로 Effective Batch Size 확보

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,  # Effective Batch Size = 32
    bf16=True,
    gradient_checkpointing=True,
    learning_rate=2e-4,
    num_train_epochs=3,
    logging_steps=10,
)

9.3 최적화 기법 조합 가이드

GPU VRAM	모델 규모	권장 조합
8 GB	~1B	BF16 + LoRA(r=8) + GC + GA
16 GB	~3B	BF16 + LoRA(r=16) + GC + GA
24 GB	~7B	BF16 + LoRA(r=16) + GC + GA
40 GB	~13B	BF16 + LoRA(r=32) + GC + GA
80 GB	~7B Full FT	BF16 + GC + GA
8x80 GB	~70B	BF16 + FSDP/DeepSpeed ZeRO-3 + GC

GC = Gradient Checkpointing, GA = Gradient Accumulation, Full FT = Full Fine-tuning

9.4 메모리 프로파일링 통합 스크립트

학습 전에 다음 스크립트로 메모리 사용량을 사전 점검할 수 있다:

import torch
from torch.amp import autocast

def profile_memory(model, dummy_input, dtype=torch.bfloat16):
    """학습 시 GPU 메모리 사용량을 프로파일링한다."""
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.empty_cache()

    model = model.cuda()
    dummy_input = dummy_input.cuda()

    # Forward Pass
    with autocast("cuda", dtype=dtype):
        outputs = model(dummy_input)
        if isinstance(outputs, dict):
            loss = outputs["loss"]
        else:
            loss = outputs.sum()

    print("=== After Forward Pass ===")
    print(f"Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"Reserved:  {torch.cuda.memory_reserved() / 1024**3:.2f} GB")

    # Backward Pass
    loss.backward()

    print("\n=== After Backward Pass ===")
    print(f"Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"Reserved:  {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
    print(f"Peak:      {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

    print("\n=== Full Memory Summary ===")
    print(torch.cuda.memory_summary(abbreviated=True))

# 사용 예시
# profile_memory(model, dummy_input)

10. 마무리

GPU 메모리 최적화는 단일 기법이 아니라 여러 기법의 조합으로 이루어진다. 핵심을 정리하면 다음과 같다:

메모리 구성 이해가 먼저다: Parameters, Gradients, Optimizer States, Activations 각각이 얼마나 메모리를 차지하는지 파악해야 어디를 최적화할지 결정할 수 있다.
Mixed Precision은 기본이다: BF16을 사용하면 Loss Scaling 없이도 안정적으로 메모리를 절반으로 줄이고 연산 속도를 높일 수 있다. Ampere 이상의 GPU에서는 BF16을 기본으로 사용하자.
Gradient Checkpointing은 Activation 메모리를 공략한다: 약 33%의 연산 오버헤드로 Activation 메모리를 대폭 절감한다. 대형 모델 학습에서는 거의 필수다.
Gradient Accumulation은 Batch Size 문제를 해결한다: 메모리 추가 부담 없이 Effective Batch Size를 키울 수 있다.
FP8은 차세대 표준이다: Hopper/Blackwell GPU에서 Transformer Engine을 통해 FP8 학습이 가능하며, FP16/BF16 대비 추가적인 성능 향상을 제공한다.
프로파일링을 습관화하자: torch.cuda.memory_summary()와 Memory Snapshot을 통해 메모리 사용 현황을 정기적으로 점검하면, OOM 문제를 사전에 예방할 수 있다.