분산 학습 & GPU 인프라 2026 딥다이브 — DeepSpeed, FSDP2, Megatron-LM, Ray Train, JAX, TorchTitan, Blackwell GB200, MI325X, TPU v5p 총정리

2026년 5월, LLM 학습 인프라는 마침내 "스택"이라 부를 만한 모양을 갖췄다. NVIDIA Blackwell GB200 NVL72 랙(72-GPU NVLink 도메인)이 본격 출하되고 있고, AMD Instinct MI325X는 256GB HBM3E로 GPT-4급 모델의 단일 노드 풀-파라미터 SFT를 가능하게 한다. 소프트웨어 측면에서는 PyTorch FSDP2가 정식 stable이 되며 torch.compile과 처음으로 안정적으로 결합되었고, NVIDIA Megatron-Core가 외부 학습 프레임워크(NeMo, TorchTitan, MosaicML Composer)에서 1급 시민으로 받아들여졌다. 이번 글은 분산 학습 프레임워크, 병렬화 전략, 하드웨어 선택, 실패 모드까지 한 번에 정리한다.

왜 2026년에 분산 학습을 다시 정리해야 하나

2024년만 해도 H100 80GB가 사실상 표준이었고 토큰당 학습 비용은 모델 크기에 비례해 가파르게 올라갔다. 2026년 5월 현재 B200(192GB HBM3E)과 GB200 NVL72의 등장으로 같은 모델을 학습하는 데 드는 wall-clock이 30~50% 줄었고, MI325X와 Gaudi 3가 가격/성능에서 NVIDIA 단일 공급 의존을 깨기 시작했다. 동시에 FSDP2 + torch.compile, DeepSpeed ZeRO++ stage 3, Megatron-Core의 컨텍스트 병렬(CP), 모두 production-ready 상태가 되었다. 단순히 어떤 프레임워크가 빠르냐가 아니라, 어떤 조합이 fault-tolerance까지 포함해 합리적이냐의 시대다.

데이터 병렬, 텐서 병렬, 파이프라인 병렬, 시퀀스 병렬

분산 학습의 4대 축은 데이터 병렬(DP/DDP), 텐서 병렬(TP), 파이프라인 병렬(PP), 시퀀스 병렬(SP)이다. DDP는 동일한 모델 복제본을 여러 GPU에 두고 그래디언트를 All-Reduce하는 가장 단순한 방식이고, 모델이 단일 GPU 메모리에 들어갈 때 가장 효율적이다. TP는 행렬 곱셈을 차원 방향으로 쪼개 어텐션의 QKV·MLP를 여러 GPU에 분산시키고, PP는 모델 레이어를 노드 그룹별로 자르며, SP는 시퀀스 길이 방향으로 활성화를 쪼개 긴 컨텍스트(128K 이상) 학습의 메모리를 줄인다. 실제 70B+ LLM 학습은 TP=8, PP=8, DP=16 형태의 3D 병렬화에 SP를 곁들이는 형태가 표준이다.

전문가 병렬(EP)과 MoE 학습의 까다로움

Mixtral 8x22B, DeepSeek-V3 같은 MoE 모델은 라우터가 토큰을 K개의 전문가로 보내고, 전문가들이 GPU에 분산되어 있을 때 All-to-All 통신이 발생한다. 이 All-to-All은 NCCL의 ncclSend/ncclRecv를 짝지어 구현하는데, 토큰 불균형(어떤 전문가는 토큰이 몰리고 어떤 전문가는 비어버리는 현상)이 발생하면 GPU 활용률이 30%까지 떨어진다. 2026년의 표준은 Switch Transformer류의 capacity factor를 1.25로 잡고 보조 로드 밸런싱 손실을 0.01 정도로 추가하는 것이고, Megatron-Core와 DeepSpeed MoE 모두 expert parallelism을 지원한다.

ZeRO 1/2/3과 FSDP의 등가성

DeepSpeed ZeRO 단계는 옵티마이저 상태(stage 1) → 그래디언트(stage 2) → 파라미터(stage 3)를 순차적으로 샤딩하는 전략이다. PyTorch FSDP1과 FSDP2는 본질적으로 ZeRO stage 3를 PyTorch 네이티브로 구현한 것에 가깝고, 동일한 모델·동일한 클러스터·동일한 옵티마이저 설정이라면 수렴 곡선이 거의 일치한다. FSDP2의 차별점은 per-parameter sharding(FSDP1의 flat-parameter 대신), torch.compile과의 안정적 결합, 그리고 dtensor 기반의 더 명확한 모델 분산이다. 2026년 대부분의 신규 학습 잡은 DeepSpeed에서 FSDP2로 이전했다.

FSDP2 + torch.compile 실전 예제

import torch
import torch.distributed as dist
from torch.distributed.fsdp import FSDPModule, fully_shard, MixedPrecisionPolicy
from torch.distributed.device_mesh import init_device_mesh

dist.init_process_group(backend="nccl")
mesh = init_device_mesh("cuda", (8,))  # 8-way data parallel

model = build_llama3_70b()  # nn.Module
mp_policy = MixedPrecisionPolicy(param_dtype=torch.bfloat16, reduce_dtype=torch.float32)

for layer in model.layers:
    fully_shard(layer, mesh=mesh, mp_policy=mp_policy)
fully_shard(model, mesh=mesh, mp_policy=mp_policy)

model = torch.compile(model, mode="reduce-overhead", fullgraph=False)

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, fused=True)
for batch in dataloader:
    out = model(batch.input_ids)
    loss = out.loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad(set_to_none=True)

위 코드의 핵심은 fully_shard를 레이어별로 호출해 per-block all-gather/reduce-scatter를 가능하게 한다는 점이다. FSDP1의 monolithic FlatParameter 대신 FSDP2는 각 nn.Parameter를 DTensor로 변환한다.

DeepSpeed ZeRO-3 + ZeRO-Infinity 설정

DeepSpeed는 여전히 ZeRO++ 통신 최적화와 NVMe offload(ZeRO-Infinity) 면에서 강점을 가진다. 다음은 405B급 모델을 H100 32-GPU 한 노드에서 학습할 때 쓰던 전형적 구성이다.

zero_optimization:
  stage: 3
  offload_optimizer:
    device: cpu
    pin_memory: true
  offload_param:
    device: nvme
    nvme_path: /mnt/local_nvme
    buffer_count: 5
    buffer_size: 1e9
  stage3_max_live_parameters: 1e9
  stage3_max_reuse_distance: 1e9
  stage3_prefetch_bucket_size: 5e8
  stage3_param_persistence_threshold: 1e6
  contiguous_gradients: true
  reduce_bucket_size: 5e8
  allgather_bucket_size: 5e8
bf16:
  enabled: true
gradient_clipping: 1.0
train_micro_batch_size_per_gpu: 1
gradient_accumulation_steps: 16

NVMe offload는 Gen5 NVMe 디스크(7GB/s 이상)가 있을 때만 의미가 있다. Gen4 또는 SATA SSD라면 throughput이 너무 낮아 GPU가 굶는다.

NVIDIA Megatron-LM과 Megatron-Core

Megatron-LM은 NVIDIA가 2019년부터 유지해온 reference LLM 학습 코드베이스이고, 2024년 이후 핵심 분산 primitives가 Megatron-Core 라이브러리로 추출되어 NeMo, TorchTitan, MosaicML 등에 임베드된다. TP/PP/SP/CP/EP 모두 1급 시민으로 지원하며 fp8 학습이 안정화된 최초의 오픈소스 스택이다. 다음은 70B 모델 학습 런처 예시.

torchrun --nproc_per_node=8 --nnodes=64 \
  --rdzv_id=meg70b --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500 \
  pretrain_gpt.py \
  --num-layers 80 --hidden-size 8192 --num-attention-heads 64 \
  --seq-length 8192 --max-position-embeddings 8192 \
  --micro-batch-size 1 --global-batch-size 1024 \
  --tensor-model-parallel-size 8 \
  --pipeline-model-parallel-size 8 \
  --sequence-parallel \
  --context-parallel-size 2 \
  --fp8-format hybrid --fp8-margin 0 \
  --use-distributed-optimizer \
  --recompute-activations --recompute-granularity selective \
  --tokenizer-type Llama3Tokenizer \
  --data-path /data/tokens \
  --save /checkpoints/meg70b --save-interval 500 \
  --train-iters 100000

TorchTitan, Composer, Lightning Fabric

Meta가 공개한 TorchTitan은 "Megatron 같은 reference 학습 코드가 PyTorch 네이티브로도 가능하다"는 메시지를 내세운다. FSDP2 + Tensor Parallel + Pipeline Parallel + Selective Activation Checkpointing을 순수 PyTorch DTensor API로 구현했고, Llama 70B/405B 학습 레시피가 포함된다. Databricks가 인수한 MosaicML Composer는 BPE 토크나이저부터 LR 스케줄러까지 추상화한 trainer이고, MPT/DBRX 학습에 그대로 사용됐다. Lightning Fabric은 Trainer 추상화를 깨고 임의의 PyTorch 코드에 분산 학습을 부분 도입하는 데 적합하다.

Ray Train과 Ray Tune의 위치

Anyscale의 Ray Train은 PyTorch/JAX/Hugging Face Accelerate 워커를 Ray Actor 위에서 오케스트레이션한다. 자체 분산 알고리즘을 구현하지는 않고 트레이너를 잘 띄우고 죽이는 역할에 집중한다. 클러스터가 1000-GPU급으로 커지면 Slurm/Kubernetes만으로는 elastic 재시작과 fault recovery가 빈약해지는데, Ray가 그 자리를 채운다. 다음은 Ray Train으로 FSDP2 학습을 띄우는 예제다.

import ray
from ray.train.torch import TorchTrainer, TorchConfig
from ray.train import ScalingConfig, RunConfig, FailureConfig

def train_func(config):
    import torch.distributed as dist
    from torch.distributed.fsdp import fully_shard
    dist.init_process_group("nccl")
    model = build_model()
    fully_shard(model)
    train_loop(model, config)

trainer = TorchTrainer(
    train_func,
    train_loop_config={"lr": 3e-4, "batch": 1},
    scaling_config=ScalingConfig(num_workers=512, use_gpu=True),
    torch_config=TorchConfig(backend="nccl", timeout_s=1800),
    run_config=RunConfig(
        storage_path="s3://ray-train/llama70b",
        failure_config=FailureConfig(max_failures=5),
    ),
)
result = trainer.fit()

Hugging Face Accelerate와 trl, axolotl, unsloth, torchtune, LLaMA-Factory

Accelerate는 사용자 코드에 거의 손대지 않고 단일/멀티 GPU/멀티 노드 학습을 켜는 얇은 래퍼이며, 내부적으로 DeepSpeed, FSDP, Megatron 백엔드를 모두 호출할 수 있다. trl(Transformers RL)은 SFT, DPO, GRPO, RLOO 같은 정렬(alignment) 학습용이고, axolotl은 YAML 한 장으로 LoRA/QLoRA 풀타임 학습을 굴리는 커뮤니티 표준 도구다. unsloth는 Triton 커널 직접 작성으로 7B 모델 LoRA 학습 속도를 2배로 끌어올린 무료 도구이고, PyTorch 공식 torchtune은 "프레임워크 없는 학습 레시피"를 지향한다. 알리바바 ms-swift와 LLaMA-Factory는 중국 생태계에서 표준화되어 있다.

JAX, Flax, Equinox, MaxText, Pax, Levanter

Google 진영의 분산 학습은 JAX의 pjit/jit + Sharding API로 거의 모든 병렬화를 표현한다. TP·PP·DP가 별도 라이브러리가 아니라 단일 Sharding 사양으로 통합된다. MaxText는 Google이 공개한 reference LLM 학습 코드(JAX, TPU + GPU 모두 지원), Pax는 Google 내부에서 PaLM/Gemini 학습에 사용한 프레임워크의 일부, Levanter는 Stanford CRFM이 만든 합성 데이터·재현성 우선 학습 라이브러리다. 다음은 JAX pjit 예시.

import jax
import jax.numpy as jnp
from jax.sharding import Mesh, NamedSharding, PartitionSpec as P

devices = jax.devices()  # 가령 256 TPU chips
mesh = Mesh(devices.reshape(8, 32), axis_names=("dp", "mp"))

def init_params(key):
    w = jax.random.normal(key, (16384, 16384))
    return w

w_sharding = NamedSharding(mesh, P(None, "mp"))
w = jax.device_put(init_params(jax.random.PRNGKey(0)), w_sharding)

@jax.jit
def train_step(w, x, y):
    logits = x @ w
    loss = jnp.mean((logits - y) ** 2)
    grads = jax.grad(lambda w: jnp.mean((x @ w - y) ** 2))(w)
    return w - 0.01 * grads, loss

혼합 정밀도 — fp16, bf16, fp8, NF4, MXFP4

Hopper(H100) 이후 bf16이 사실상 학습 기본형이 됐다. fp8(E4M3·E5M2) 학습은 Hopper에서 처음 실용화되었고, Blackwell B200/B100과 Megatron-Core·Transformer Engine의 조합으로 70B+에서도 안정 수렴이 보고된다. NF4(NormalFloat 4-bit)는 추론·QLoRA 양자화에서, MXFP4(microscaling fp4)는 Blackwell에서 새로 도입된 forward-only 활성화 양자화 형식이다. EleutherAI의 GPT-NeoX(Pythia 시리즈를 길러낸 DeepSpeed+Megatron 학습 프레임워크)와 같은 기존 코드베이스도 2026년에는 Megatron-Core 또는 TorchTitan으로 마이그레이션해 fp8을 켜는 흐름이다. Transformer Engine 활성화 예시는 다음과 같다.

import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling

fp8_recipe = DelayedScaling(
    fp8_format=Format.HYBRID,  # E4M3 forward, E5M2 backward
    amax_history_len=1024,
    amax_compute_algo="max",
)

with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    output = te_linear_layer(input_tensor)
    loss = compute_loss(output, target)
loss.backward()

긴 컨텍스트(128K1M token) 학습은 활성화 메모리가 파라미터 메모리를 압도한다. Activation checkpointing은 forward 중간 활성화를 버리고 backward에서 재계산하는 트레이드오프이며, 메모리는 절반 이하로 줄지만 wall-clock은 2533% 늘어난다. Megatron-Core의 selective recompute는 어텐션 같은 메모리 무거운 블록만 골라 재계산해 시간 손실을 5~10%로 줄였다.

CUDA Graphs, NCCL 튜닝, 클러스터 네트워크

CUDA Graphs는 forward/backward의 커널 발사 오버헤드를 통째로 캡쳐해 단일 그래프로 재생한다. micro-batch=1 같은 작은 배치에서 throughput을 1.3~1.5x 끌어올린다. NCCL 튜닝은 NCCL_ALGO(Ring/Tree/CollNet), NCCL_PROTO(LL/LL128/Simple), NCCL_BUFFSIZE, NCCL_NSOCKS_PERTHREAD를 클러스터 토폴로지에 맞춰 설정하는 것이 핵심이다. 네트워크 패브릭은 보통 세 가지다. NVIDIA Quantum-X800 InfiniBand(SHARP in-network reduction), Ethernet 기반 RoCE v2(PFC/ECN 튜닝 필요), HPE Slingshot 11(Frontier·El Capitan 검증), 그리고 AWS EFA(자체 SRD 프로토콜). 다음은 GB200 NVL72 클러스터에서 흔히 쓰는 환경변수다.

export NCCL_DEBUG=WARN
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TIMEOUT=22
export NCCL_IB_RETRY_CNT=7
export NCCL_SOCKET_IFNAME=ib0
export NCCL_NET_GDR_LEVEL=PHB
export NCCL_P2P_LEVEL=NVL
export NCCL_NVLS_ENABLE=1
export NCCL_BUFFSIZE=8388608
export NCCL_ALGO=NVLSTree

NVIDIA Blackwell B100/B200/GB200 NVL72

Blackwell B200은 192GB HBM3E, 8TB/s 메모리 대역, FP8 9 PFLOPS·FP4 18 PFLOPS를 제공한다. GB200 Grace Hopper Superchip은 NVLink-C2C로 Grace CPU와 직접 연결되고, GB200 NVL72 랙은 72개 B200을 단일 NVLink 도메인으로 묶어 1.4 EFLOPS FP4를 낸다. 이는 H100 DGX 라인업 대비 토큰당 학습 비용을 30~40% 낮추며, FSDP2와 결합 시 70B 모델의 1조 토큰 학습이 약 1주일 가량으로 압축된다(2025년 H100 80GB DGX 32-node 대비).

AMD Instinct MI300X/MI325X와 ROCm

MI300X는 192GB HBM3, MI325X는 256GB HBM3E를 탑재하고 단일 GPU 메모리가 가장 크다는 점이 학습 시 메모리 압박 완화에 직접적인 도움이 된다. 소프트웨어 측면에서 PyTorch는 ROCm 위에서 거의 동일한 API로 동작하고, Megatron-LM과 DeepSpeed가 ROCm을 정식 지원한다. 단점은 FlashAttention/Triton의 일부 최신 커널 지원이 NVIDIA 대비 6~12개월 늦다는 점이고, fp8 학습 안정성도 아직 NVIDIA 수준이 아니다.

Intel Gaudi 3와 SynapseAI

Intel Habana Gaudi 3는 128GB HBM2E, 1.835 PFLOPS BF16을 제공하며, 가격이 H100 대비 30~40% 저렴한 것이 핵심 매력이다. SynapseAI는 PyTorch 직접 통합 대신 PyTorch 모듈을 그래프 컴파일러로 흡수하는 방식이고, 익숙한 코드를 그대로 돌리기 어렵다. 그러나 Intel Tiber AI Cloud와 결합하면 BERT/Llama-7B 같은 표준 워크로드에서 가격/성능이 매력적이다.

AWS Trainium 2와 Google TPU v5p, v6e Trillium

AWS Trainium 2(Trn2)는 한 인스턴스에 64 Trainium 칩, 1.5TB HBM, EFA로 묶은 UltraCluster 구성이 주력이다. Anthropic Claude 학습이 Trainium 2 + GPU 하이브리드로 진행되고 있음이 공개된 바 있다. Google TPU v5p는 8960 chip pod, v6e Trillium은 4.7x peak compute를 H100 대비 개선했고, Gemini 학습의 주력이다. JAX와 결합 시 단일 sharding 사양으로 1만 칩까지 그대로 확장된다.

분산 체크포인팅과 실패 복구

학습 잡이 실패하지 않는 것이 아니라 빈번하게 실패한다고 가정해야 한다. 1000-GPU 클러스터에서 1주일 학습 시 평균 1.5회의 GPU/네트워크 장애가 발생한다는 보고가 일반적이다. PyTorch는 DCP(distributed checkpoint, torch.distributed.checkpoint)와 TorchSnapshot으로 비동기 분산 체크포인팅을 표준화했고, Megatron-LM은 자체 체크포인트 포맷(zarr 기반)으로 PP/TP 변경 후에도 같은 체크포인트를 로드할 수 있다. 학습 코드는 (1) 매 N step마다 비동기 저장, (2) NCCL_TIMEOUT 초과 시 graceful abort, (3) torchrun 또는 Ray가 elastic restart 후 마지막 체크포인트에서 재개하는 패턴을 따른다.

import torch.distributed.checkpoint as dcp
from torch.distributed.checkpoint.state_dict import get_state_dict, set_state_dict

def save_async(model, optimizer, step, ckpt_dir):
    state = {"model": model, "optim": optimizer, "step": step}
    storage_writer = dcp.FileSystemWriter(f"{ckpt_dir}/step_{step}", thread_count=8)
    return dcp.async_save(state, storage_writer=storage_writer)

def load_latest(model, optimizer, ckpt_dir):
    state = {"model": model, "optim": optimizer}
    storage_reader = dcp.FileSystemReader(ckpt_dir)
    dcp.load(state, storage_reader=storage_reader)
    return state.get("step", 0)

학습 중 실패 모드 — CUDA OOM, NCCL hang, divergent loss

가장 흔한 세 가지 실패는 CUDA OOM, NCCL collective hang, 그리고 발산 손실이다. OOM은 보통 activation memory가 sequence length 제곱으로 폭증해서 발생하므로, recompute granularity와 micro-batch를 먼저 조정한다. NCCL hang은 NCCL_DEBUG=INFO와 timeout_s 설정으로 어느 rank가 멈춰 있는지 확인하고, 네트워크 인터페이스/방화벽/대각선 통신 실패를 점검한다. 발산 손실은 학습률 너무 큼, gradient clipping 누락, fp8 amax 통계 갱신 실패가 흔한 원인이다.

GPU 클라우드는 2026년 1티어/2티어로 양분되었다. 1티어(CoreWeave, Lambda Labs, Crusoe, Together AI, Nebius)는 InfiniBand 또는 RoCE로 연결된 수천 GPU 슈퍼팟을 시간당 예약으로 판매한다. 2티어(RunPod, Vast.ai, LeptonAI, Fal)는 단일 노드소규모 클러스터를 분 단위로 빌려 주는 서비스에 가깝다. Modal은 코드 함수에서 GPU를 즉시 부팅하는 모델이고, 미세조정과 서빙용으로 인기다. 가격은 H100 80GB 1.993.5 USD/h, B200 4.5~~6.5 USD/h, MI300X 1.7~~2.8 USD/h 수준이다(2026년 5월 기준 호가).

비용·전력·탄소 — 토큰당 학습 비용의 분해

70B 모델 1조 토큰 학습 비용은 대략 (총 GPU-시간) × (시간당 가격)이다. H100 80GB 32-node DGX(256 GPU)에서 약 25일 학습 시 256 × 25 × 24 = 153,600 GPU-시간이고 시간당 3 USD라면 약 46만 USD다. B200 NVL72 한 랙(72 GPU)에서 같은 학습이 약 14일이면 72 × 14 × 24 = 24,192 GPU-시간, 시간당 6 USD라면 약 14.5만 USD로 떨어진다. 이 차이가 2026년에 신규 학습 잡이 압도적으로 Blackwell로 몰리는 이유다. 전력은 한 랙당 120kW 수준이므로 데이터센터의 PUE·냉각 설계와도 직접 연결된다.

한국 사례 — LG AI Research EXAONE과 Naver HyperCLOVA X

LG AI Research는 EXAONE 3.5/4.0 학습에 자체 H100 클러스터(추정 1000+ GPU)와 Megatron-LM 기반 스택을 사용하는 것으로 알려졌다. EXAONE 3.5 32B 모델은 13B token 한국어/영어 혼합 학습에서 정착된 한국어 토크나이저와 추가 인스트럭션 튜닝을 거쳤다. Naver HyperCLOVA X는 Samsung Heavy Industries/Naver Cloud 협업으로 한국 자체 IDC에서 학습되고 있으며, 한국형 데이터 컴플라이언스(PIPA)에 맞춰 오프쇼어 자료를 최소화하는 학습 파이프라인을 가진다. KT는 자체 1조 파라미터급 모델(Mi:dm)에서 Megatron + DeepSpeed 조합을 공개적으로 사용했다.

일본 사례 — Sakana AI, Preferred Networks PFCC, ABEJA

Sakana AI는 도쿄에서 "모델 진화(merge)" 방식으로 학습 비용을 우회하는 접근을 펼친다. 풀-스크래치 학습 대신 evolutionary merge로 기존 모델을 결합·진화시키며, 학습 인프라 부담이 적다. Preferred Networks는 자체 MN-Core/MN-Core 2 가속기와 H100 혼용 클러스터(PFCC)에서 PLaMo 시리즈를 학습한다. ABEJA는 Insight 시리즈 학습에 AWS Trainium 2를 채택해 공개적으로 reference architecture를 공유한 바 있다. 일본은 경제산업성(METI) "AI 슈퍼컴퓨터 보조금"이 학습 클러스터 도입의 큰 동력이다.

어떤 스택을 고를 것인가 — 의사결정 트리

70B 미만 SFT/LoRA: torchtune + unsloth 또는 axolotl. 70B 풀-파라미터 사전학습: TorchTitan(FSDP2 + TP + PP) 또는 Megatron-Core. MoE 405B+: Megatron-Core (EP/CP 강력) 또는 DeepSpeed MoE. JAX/TPU 환경: MaxText 또는 Levanter. 클러스터 운영: Ray Train(elastic) + Slurm 또는 Kubernetes. 정렬(SFT/DPO/GRPO): trl. 빠른 SOTA 추적: HF Accelerate로 백엔드 추상화 유지.

2027년 전망 — 그 다음

Blackwell B300, AMD MI355X, Trainium 3, TPU v6p가 2026년 후반~2027년 초에 줄지어 등장한다. 소프트웨어 측면에서는 (1) torch.compile + FSDP2가 default가 되며 DeepSpeed가 ZeRO++ 통신과 NVMe offload 같은 niche로 좁아질 가능성, (2) Megatron-Core가 더 많은 외부 trainer에 임베드되며 사실상 reference compute 레이어가 될 가능성, (3) JAX가 GPU에서도 점유율을 회복할 가능성이 있다. 분산 학습 인프라는 이제 "어떻게 동작시키느냐"보다 "어떻게 운영하느냐"의 문제로 옮겨가고 있다.

References

DeepSpeed: github.com/microsoft/DeepSpeed
PyTorch FSDP: pytorch.org/docs/stable/fsdp.html
Megatron-LM: github.com/NVIDIA/Megatron-LM
Megatron-Core: docs.nvidia.com/megatron-core
Ray Train: docs.ray.io/en/latest/train
Hugging Face Accelerate: huggingface.co/docs/accelerate
Lightning Fabric: lightning.ai/docs/fabric
TorchTitan: github.com/pytorch/torchtitan
MosaicML Composer: github.com/mosaicml/composer
JAX: jax.readthedocs.io
MaxText: github.com/google/maxtext
GPT-NeoX: github.com/EleutherAI/gpt-neox
axolotl: github.com/OpenAccess-AI-Collective/axolotl
unsloth: github.com/unslothai/unsloth
torchtune: github.com/pytorch/torchtune
trl: github.com/huggingface/trl
LLaMA-Factory: github.com/hiyouga/LLaMA-Factory
NVIDIA Blackwell: nvidia.com/en-us/data-center/blackwell-architecture
AMD Instinct MI300X: amd.com/en/products/accelerators/instinct/mi300/mi300x.html
Google TPU: cloud.google.com/tpu
AWS Trainium: aws.amazon.com/machine-learning/trainium
Lambda Labs: lambdalabs.com
CoreWeave: coreweave.com
Modal: modal.com
Together AI: together.ai
Crusoe: crusoe.ai
RunPod: runpod.io