Unsloth로 LLM 파인튜닝 완전 가이드 2025: QLoRA, 4bit 양자화, 2배 빠른 학습

서론: 왜 Unsloth인가?
1. LoRA/QLoRA 이론
2. 환경 설정
3. Unsloth 파인튜닝 단계별 가이드
- 3.1 모델 로딩
- 3.2 LoRA 어댑터 설정
4. 데이터 준비
5. 학습 설정
6. VRAM 최적화 기법
7. 모델 내보내기 및 변환
8. 평가 및 테스트
9. 고급 기법
10. 일반적인 문제와 해결법
11. 퀴즈
12. 참고 자료

서론: 왜 Unsloth인가?

LLM 파인튜닝의 가장 큰 진입 장벽은 **GPU 메모리(VRAM)**입니다. Llama 3.1 8B를 Full Fine-tuning하려면 약 60GB VRAM이 필요하고, 이는 A100 80GB 하나로도 빠듯합니다. QLoRA가 이 문제를 해결했지만, 학습 속도는 여전히 느렸습니다.

Unsloth는 이 두 가지 문제를 동시에 해결합니다:

비교 항목	HuggingFace PEFT	Axolotl	Unsloth
학습 속도	1x (기준)	1.1x	2x
메모리 사용	100%	95%	40%
설정 난이도	중간	높음	낮음
지원 모델	전체	전체	주요 모델
Flash Attention	별도 설치	내장	내장
커스텀 커널	없음	없음	Triton 커널

Unsloth의 핵심 비밀은 커스텀 Triton 커널입니다. Attention, MLP, Cross-Entropy Loss 등의 핵심 연산을 GPU에 최적화된 커스텀 커널로 대체하여 2배 빠른 학습과 60% 메모리 절약을 달성합니다.

지원 모델 (2025년 기준):

Llama 3 / 3.1 / 3.2 (8B, 70B)
Mistral / Mixtral
Phi-3 / Phi-3.5
Qwen 2 / 2.5
Gemma 2
Yi
DeepSeek V2

1. LoRA/QLoRA 이론

1.1 Full Fine-tuning vs LoRA vs QLoRA

Full Fine-tuning (모든 파라미터 업데이트)
┌──────────────────────┐
│   W (d x d)          │  <- 전체 가중치 업데이트
│   예: 4096 x 4096    │     = 16M 파라미터
│   = 64MB (FP16)      │
└──────────────────────┘

LoRA (Low-Rank Adaptation)
┌──────────────────────┐
│   W0 (고정) + B * A   │
│   W0: 4096 x 4096    │  <- 고정 (업데이트 안 함)
│   B: 4096 x 16       │  <- 학습 (65K 파라미터)
│   A: 16 x 4096       │  <- 학습 (65K 파라미터)
│   = 0.25MB (FP16)    │     총 130K 파라미터
└──────────────────────┘

QLoRA (Quantized LoRA)
┌──────────────────────┐
│   W0 (4bit) + B * A  │
│   W0: 4096 x 4096    │  <- 4bit 양자화 (8MB)
│   B: 4096 x 16       │  <- FP16 학습
│   A: 16 x 4096       │  <- FP16 학습
│   = 8.25MB total     │
└──────────────────────┘

1.2 Low-Rank Decomposition 원리

LoRA의 핵심 아이디어는 **가중치 업데이트 행렬이 실제로 저차원(low-rank)**이라는 관찰에 기반합니다.

원래의 가중치 업데이트:

W_new = W_old + delta_W

LoRA는 delta_W를 두 개의 작은 행렬의 곱으로 분해합니다:

delta_W = B * A
여기서:
  B는 d x r 행렬 (d=모델 차원, r=LoRA 랭크)
  A는 r x d 행렬
  r << d (예: r=16, d=4096)

파라미터 절약 효과:

# Full Fine-tuning 파라미터 수
d = 4096
full_params = d * d  # = 16,777,216 (16.7M)

# LoRA 파라미터 수
r = 16
lora_params = d * r + r * d  # = 131,072 (131K)

# 절약률
savings = 1 - (lora_params / full_params)
print(f"파라미터 절약: {savings:.2%}")  # 99.22%

1.3 4-bit NormalFloat 양자화 (NF4)

QLoRA에서 사용하는 NF4 양자화는 일반 4-bit와 다릅니다:

일반 4-bit INT 양자화:

균일하게 16개 구간으로 나눔
값 분포를 고려하지 않음

NF4 (NormalFloat4):

가중치가 정규분포를 따른다는 사실을 활용
정규분포의 분위수(quantile)에 맞춰 16개 값 설정
정보 이론적으로 최적에 가까운 양자화

# NF4 양자화 값 예시 (정규분포 분위수 기반)
nf4_values = [
    -1.0, -0.6962, -0.5251, -0.3949,
    -0.2844, -0.1848, -0.0911, 0.0,
    0.0796, 0.1609, 0.2461, 0.3379,
    0.4407, 0.5626, 0.7230, 1.0,
]

1.4 Double Quantization

QLoRA의 또 다른 혁신은 **이중 양자화(Double Quantization)**입니다:

가중치를 4-bit로 양자화 (NF4)
양자화 상수(scaling factor)를 다시 8-bit로 양자화
추가 메모리 절약: 블록당 32bit에서 8bit로

1.5 메모리 비교표

모델	Full FT (FP16)	LoRA (FP16)	QLoRA (4bit)
Llama 3 8B	~60GB	~18GB	~6GB
Llama 3 70B	~500GB	~160GB	~40GB
Mistral 7B	~52GB	~16GB	~5GB
Phi-3 3.8B	~28GB	~9GB	~3GB
Qwen 2 7B	~52GB	~16GB	~5GB

2. 환경 설정

2.1 GPU 요구사항

GPU	VRAM	학습 가능 모델 (QLoRA)
T4 (Colab Free)	16GB	7B~8B (seq_len 1024)
A10G	24GB	7B~13B
RTX 4090	24GB	7B~13B
A100 40GB	40GB	7B~70B
A100 80GB	80GB	70B+
Apple M2 Ultra	192GB	CPU 학습 (느림)

2.2 Google Colab 설정

# Colab에서 Unsloth 설치 (T4 GPU 기준)
# 런타임 -> 런타임 유형 변경 -> T4 GPU 선택

# 1. Unsloth 설치
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

# 2. GPU 확인
import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB")

2.3 로컬 환경 설정

# Conda 환경 생성
conda create -n unsloth python=3.11
conda activate unsloth

# PyTorch 설치 (CUDA 12.1)
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

# Unsloth 설치
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

# 설치 확인
python -c "from unsloth import FastLanguageModel; print('Unsloth OK')"

2.4 Docker 환경

FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

RUN apt-get update && apt-get install -y python3.11 python3-pip git

RUN pip install torch --index-url https://download.pytorch.org/whl/cu121
RUN pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
RUN pip install --no-deps trl peft accelerate bitsandbytes

WORKDIR /workspace
CMD ["python3"]

3. Unsloth 파인튜닝 단계별 가이드

3.1 모델 로딩

from unsloth import FastLanguageModel
import torch

# 모델과 토크나이저 로딩
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",  # 4bit 사전 양자화 모델
    max_seq_length=2048,    # 최대 시퀀스 길이
    dtype=None,             # 자동 감지 (A100: bfloat16, 기타: float16)
    load_in_4bit=True,      # 4bit 양자화 로드
)

# GPU 메모리 확인
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_mem / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

사전 양자화 모델 추천:

용도	모델	크기
일반 한국어	`unsloth/Meta-Llama-3.1-8B-bnb-4bit`	~5GB
한국어 특화	`beomi/Llama-3-Open-Ko-8B-bnb-4bit`	~5GB
코딩	`unsloth/Mistral-7B-v0.3-bnb-4bit`	~4.5GB
경량	`unsloth/Phi-3.5-mini-instruct-bnb-4bit`	~2.5GB
다국어	`unsloth/Qwen2.5-7B-bnb-4bit`	~4.5GB

3.2 LoRA 어댑터 설정

# LoRA 어댑터 추가
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                          # LoRA 랭크 (8, 16, 32, 64)
    target_modules=[               # LoRA를 적용할 모듈
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj",       # MLP
    ],
    lora_alpha=16,                 # LoRA alpha (보통 r과 같게)
    lora_dropout=0,                # Unsloth에서는 0이 최적
    bias="none",                   # bias 학습 안 함
    use_gradient_checkpointing="unsloth",  # Unsloth 최적화 체크포인팅
    random_state=3407,
    use_rslora=False,              # Rank-Stabilized LoRA (실험적)
    loftq_config=None,             # LoftQ 설정
)

# 학습 가능한 파라미터 확인
def print_trainable_parameters(model):
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"학습 가능: {trainable:,} / 전체: {total:,} = {trainable/total:.2%}")

print_trainable_parameters(model)
# 학습 가능: 41,943,040 / 전체: 8,030,261,248 = 0.52%

LoRA 랭크 선택 가이드:

LoRA r	파라미터 수	VRAM 추가	권장 사용
8	~21M	~80MB	간단한 태스크, VRAM 제한
16	~42M	~160MB	일반적 권장값
32	~84M	~320MB	복잡한 태스크
64	~168M	~640MB	대규모 데이터, 높은 표현력 필요
128	~336M	~1.3GB	실험적, Full FT에 근접

4. 데이터 준비

4.1 Chat Template 포매팅

# Alpaca 프롬프트 템플릿
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# 데이터셋 포매팅 함수
EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

4.2 데이터셋 로딩 및 변환

from datasets import load_dataset

# KoAlpaca 데이터셋 로딩
dataset = load_dataset("beomi/KoAlpaca-v1.1a", split="train")

# 포맷 변환
def format_koalpaca(examples):
    texts = []
    for instruction, output in zip(examples["instruction"], examples["output"]):
        text = alpaca_prompt.format(instruction, "", output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

dataset = dataset.map(format_koalpaca, batched=True)

# ShareGPT 형식 데이터 로딩 (다중 턴)
sharegpt_dataset = load_dataset("philschmid/sharegpt-raw", split="train")

def format_sharegpt(examples):
    texts = []
    for conversations in examples["conversations"]:
        text = ""
        for turn in conversations:
            if turn["from"] == "human":
                text += f"### Human:\n{turn['value']}\n\n"
            elif turn["from"] == "gpt":
                text += f"### Assistant:\n{turn['value']}\n\n"
        text += EOS_TOKEN
        texts.append(text)
    return {"text": texts}

# OpenAI Messages 형식 (Llama 3 chat template 사용)
def format_openai_messages(examples):
    texts = []
    for messages in examples["messages"]:
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
        )
        texts.append(text)
    return {"text": texts}

4.3 Max Sequence Length 고려사항

# 시퀀스 길이 분포 분석
def analyze_sequence_lengths(dataset, tokenizer):
    lengths = []
    for item in dataset:
        tokens = tokenizer.encode(item["text"])
        lengths.append(len(tokens))

    import numpy as np
    print(f"평균 길이: {np.mean(lengths):.0f}")
    print(f"중앙값: {np.median(lengths):.0f}")
    print(f"95 퍼센타일: {np.percentile(lengths, 95):.0f}")
    print(f"99 퍼센타일: {np.percentile(lengths, 99):.0f}")
    print(f"최대 길이: {max(lengths)}")

    # 권장 max_seq_length = 95 퍼센타일
    recommended = int(np.percentile(lengths, 95))
    print(f"\n권장 max_seq_length: {recommended}")
    return lengths

analyze_sequence_lengths(dataset, tokenizer)

5. 학습 설정

5.1 SFTTrainer 설정

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,          # 데이터 전처리 병렬 수
    packing=False,               # 짧은 시퀀스 패킹 (True: 메모리 효율)
    args=TrainingArguments(
        # === 기본 설정 ===
        output_dir="./outputs",
        num_train_epochs=3,

        # === 배치 & 메모리 ===
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # 유효 배치 = 2 * 4 = 8

        # === 학습률 ===
        learning_rate=2e-4,             # QLoRA 권장 학습률
        lr_scheduler_type="cosine",     # 코사인 스케줄러
        warmup_steps=5,                 # 워밍업 스텝

        # === 정밀도 ===
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),

        # === 로깅 ===
        logging_steps=1,
        logging_dir="./logs",
        report_to="wandb",             # Weights & Biases 연동

        # === 저장 ===
        save_strategy="steps",
        save_steps=100,
        save_total_limit=3,

        # === 최적화 ===
        optim="adamw_8bit",            # 8bit AdamW (메모리 절약)
        weight_decay=0.01,
        max_grad_norm=0.3,
        seed=3407,
    ),
)

5.2 학습 파라미터 상세 설명

학습률 (Learning Rate) 가이드:

시나리오	권장 학습률	이유
QLoRA 기본	2e-4	QLoRA 논문 권장값
큰 데이터셋 (100K+)	1e-4	과적합 방지
작은 데이터셋 (1K 이하)	5e-5 ~ 1e-4	세밀한 학습
도메인 적응	2e-5 ~ 5e-5	기존 지식 보존
Continued Pre-training	1e-5 ~ 5e-5	안정적 학습

배치 크기 vs Gradient Accumulation:

# 동일한 유효 배치 크기 8 달성하는 두 가지 방법

# 방법 1: 큰 배치 (VRAM 많이 필요)
per_device_train_batch_size = 8
gradient_accumulation_steps = 1
# 유효 배치 = 8 * 1 = 8, VRAM: ~12GB

# 방법 2: 작은 배치 + Gradient Accumulation (VRAM 적게 필요)
per_device_train_batch_size = 2
gradient_accumulation_steps = 4
# 유효 배치 = 2 * 4 = 8, VRAM: ~6GB
# 주의: 학습 속도는 조금 느려짐

5.3 Wandb 연동

import wandb

# Wandb 로그인 및 프로젝트 설정
wandb.login(key="your-wandb-api-key")
wandb.init(
    project="korean-llm-finetuning",
    name="llama3-8b-koalpaca-qlora",
    config={
        "model": "Meta-Llama-3.1-8B",
        "dataset": "KoAlpaca-v1.1a",
        "lora_r": 16,
        "learning_rate": 2e-4,
        "epochs": 3,
    },
)

5.4 학습 실행

# 학습 시작
trainer_stats = trainer.train()

# 학습 결과 출력
print(f"학습 시간: {trainer_stats.metrics['train_runtime']:.2f}초")
print(f"최종 Loss: {trainer_stats.metrics['train_loss']:.4f}")

# GPU 메모리 사용량 확인
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
print(f"최대 VRAM 사용: {used_memory} GB")

6. VRAM 최적화 기법

6.1 Gradient Checkpointing

# Unsloth 최적화 Gradient Checkpointing
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",  # 핵심! 30% VRAM 절약
)

# 일반 gradient checkpointing vs Unsloth
# "unsloth": Unsloth 최적화 버전 (더 빠르고 메모리 효율적)
# True: 표준 PyTorch gradient checkpointing
# False: 사용 안 함 (가장 빠르지만 메모리 많이 사용)

6.2 Flash Attention 2

# Unsloth는 Flash Attention 2를 자동으로 사용
# 별도 설정 불필요!

# 수동으로 확인하려면:
print(f"Flash Attention 사용: {hasattr(model.config, '_attn_implementation')}")

6.3 시퀀스 패킹

# 짧은 시퀀스를 하나로 묶어 GPU 활용률 향상
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    packing=True,           # 시퀀스 패킹 활성화
    max_seq_length=2048,    # 패킹된 전체 길이
)

# 패킹 효과:
# 패킹 OFF: [토큰토큰PAD PAD PAD PAD] [토큰PAD PAD PAD PAD PAD]
# 패킹 ON:  [토큰토큰토큰SEP토큰토큰토큰] -> GPU 활용률 증가

6.4 VRAM 사용량 표 (Unsloth QLoRA 기준)

모델	Batch=1	Batch=2	Batch=4	Batch=8
Llama 3 8B	4.2GB	5.8GB	8.5GB	14.2GB
Mistral 7B	3.8GB	5.2GB	7.8GB	13.0GB
Phi-3 3.8B	2.4GB	3.2GB	4.8GB	7.6GB
Qwen 2 7B	3.8GB	5.2GB	7.8GB	13.0GB
Llama 3 70B	36GB	42GB	56GB	OOM

* max_seq_length=2048, gradient_checkpointing="unsloth" 기준

7. 모델 내보내기 및 변환

7.1 LoRA 어댑터 저장

# LoRA 어댑터만 저장 (작은 크기)
model.save_pretrained("lora_adapter")
tokenizer.save_pretrained("lora_adapter")

# 저장된 파일 확인
import os
for f in os.listdir("lora_adapter"):
    size = os.path.getsize(f"lora_adapter/{f}") / 1024 / 1024
    print(f"  {f}: {size:.1f} MB")

# adapter_config.json: 0.0 MB
# adapter_model.safetensors: 160.0 MB  <- LoRA 가중치
# tokenizer.json: 17.1 MB

7.2 어댑터 병합 (Merge)

# LoRA 어댑터를 베이스 모델과 병합
merged_model = model.merge_and_unload()

# 병합된 모델 저장
merged_model.save_pretrained("merged_model")
tokenizer.save_pretrained("merged_model")

7.3 GGUF 변환 (llama.cpp용)

# Unsloth의 내장 GGUF 변환 기능
# 다양한 양자화 레벨 지원

# Q4_K_M: 가장 일반적 (품질/크기 균형)
model.save_pretrained_gguf(
    "model_gguf",
    tokenizer,
    quantization_method="q4_k_m",
)

# Q5_K_M: 더 높은 품질
model.save_pretrained_gguf(
    "model_q5",
    tokenizer,
    quantization_method="q5_k_m",
)

# Q8_0: 최고 품질 (크기 큼)
model.save_pretrained_gguf(
    "model_q8",
    tokenizer,
    quantization_method="q8_0",
)

# F16: 양자화 없음 (가장 큼)
model.save_pretrained_gguf(
    "model_f16",
    tokenizer,
    quantization_method="f16",
)

GGUF 양자화 비교:

양자화	파일 크기 (8B)	품질	추론 속도	권장
Q4_K_M	~4.5GB	좋음	빠름	일반 사용
Q5_K_M	~5.5GB	매우 좋음	보통	품질 중시
Q8_0	~8.0GB	우수	느림	최고 품질
F16	~16GB	원본	가장 느림	참고용

7.4 GPTQ 변환

# GPTQ 양자화 (GPU 추론용)
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
)

# 캘리브레이션 데이터 준비
calibration_data = [
    tokenizer(text, return_tensors="pt")
    for text in calibration_texts[:128]
]

# GPTQ 양자화 실행
gptq_model = AutoGPTQForCausalLM.from_pretrained(
    "merged_model",
    quantize_config=quantize_config,
)
gptq_model.quantize(calibration_data)
gptq_model.save_quantized("model_gptq")

7.5 Hugging Face Hub 업로드

# 모델을 Hugging Face Hub에 업로드

# LoRA 어댑터만 업로드
model.push_to_hub(
    "my-org/llama3-8b-korean-lora",
    token="hf_xxxxx",
    private=True,
)
tokenizer.push_to_hub(
    "my-org/llama3-8b-korean-lora",
    token="hf_xxxxx",
    private=True,
)

# GGUF 파일 업로드
model.push_to_hub_gguf(
    "my-org/llama3-8b-korean-gguf",
    tokenizer,
    quantization_method="q4_k_m",
    token="hf_xxxxx",
)

8. 평가 및 테스트

8.1 파인튜닝된 모델로 추론

# 추론 모드로 전환
FastLanguageModel.for_inference(model)

# 단일 프롬프트 추론
def generate_response(instruction, input_text=""):
    prompt = alpaca_prompt.format(instruction, input_text, "")
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.15,
        do_sample=True,
    )

    response = tokenizer.batch_decode(outputs)[0]
    # Response 부분만 추출
    response = response.split("### Response:\n")[-1]
    response = response.replace(tokenizer.eos_token, "").strip()
    return response

# 테스트
test_questions = [
    "한국의 전통 명절에 대해 설명해주세요.",
    "파이썬에서 데코레이터의 동작 원리를 설명해주세요.",
    "건강한 식습관을 위한 팁을 알려주세요.",
]

for q in test_questions:
    print(f"Q: {q}")
    print(f"A: {generate_response(q)}")
    print("-" * 80)

8.2 생성 파라미터 튜닝

# 생성 파라미터별 효과
generation_configs = {
    "정확한 답변 (factual)": {
        "temperature": 0.1,
        "top_p": 0.9,
        "repetition_penalty": 1.0,
    },
    "창의적 답변 (creative)": {
        "temperature": 0.8,
        "top_p": 0.95,
        "repetition_penalty": 1.15,
    },
    "균형잡힌 답변 (balanced)": {
        "temperature": 0.5,
        "top_p": 0.9,
        "repetition_penalty": 1.1,
    },
}

8.3 lm-eval-harness 벤치마크

# lm-eval-harness로 벤치마크 평가
pip install lm-eval

# 한국어 벤치마크 평가
lm_eval --model hf \
    --model_args pretrained=./merged_model \
    --tasks kobest_boolq,kobest_copa,kobest_hellaswag,kobest_sentineg,kobest_wic \
    --batch_size 4 \
    --output_path ./eval_results

# Python에서 실행
from lm_eval import evaluator

results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=./merged_model",
    tasks=["kobest_boolq", "kobest_copa", "kobest_hellaswag"],
    batch_size=4,
)

for task, metrics in results["results"].items():
    print(f"{task}: acc={metrics.get('acc', 'N/A')}")

9. 고급 기법

9.1 Multi-GPU 학습 (DeepSpeed ZeRO)

# deepspeed_config.json
"""
{
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "reduce_scatter": true
    },
    "bf16": {
        "enabled": true
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}
"""

# 실행
# deepspeed --num_gpus 4 train.py --deepspeed deepspeed_config.json

9.2 DPO 학습

from trl import DPOTrainer, DPOConfig
from unsloth import FastLanguageModel, PatchDPOTrainer

# DPO 패치 적용
PatchDPOTrainer()

# DPO 데이터셋 준비
dpo_dataset = load_dataset("argilla/ultrafeedback-binarized-preferences", split="train")

# DPO Trainer 설정
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,           # Unsloth에서는 None (자동 처리)
    args=DPOConfig(
        output_dir="./dpo_output",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=5e-7,       # DPO는 낮은 학습률
        num_train_epochs=1,
        beta=0.1,                 # DPO beta (KL divergence 가중치)
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
    ),
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)

dpo_trainer.train()

9.3 Continued Pre-training (도메인 적응)

# 도메인 특화 텍스트로 Continued Pre-training
from trl import SFTTrainer

# 도메인 텍스트 데이터 (의료, 법률, 금융 등)
domain_dataset = load_dataset("my-org/medical-korean-corpus", split="train")

# Continued Pre-training은 낮은 학습률 사용
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=domain_dataset,
    dataset_text_field="text",
    max_seq_length=4096,     # 긴 문서
    packing=True,            # 효율성 위해 패킹 사용
    args=TrainingArguments(
        output_dir="./cpt_output",
        learning_rate=2e-5,  # 매우 낮은 학습률
        num_train_epochs=1,  # 1 epoch면 충분
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        optim="adamw_8bit",
        warmup_ratio=0.1,
    ),
)

trainer.train()

10. 일반적인 문제와 해결법

10.1 OOM (Out of Memory) 오류

# 증상: CUDA out of memory
# RuntimeError: CUDA out of memory.
# Tried to allocate 256.00 MiB

# 해결법 순서:
# 1. batch_size 줄이기
per_device_train_batch_size = 1  # 최소값

# 2. gradient_accumulation_steps 늘리기
gradient_accumulation_steps = 8

# 3. max_seq_length 줄이기
max_seq_length = 1024  # 2048 -> 1024

# 4. LoRA rank 줄이기
r = 8  # 16 -> 8

# 5. gradient checkpointing 확인
use_gradient_checkpointing = "unsloth"

# 6. 캐시 비우기
torch.cuda.empty_cache()
import gc
gc.collect()

10.2 NaN Loss

# 증상: loss가 NaN으로 발산
# 원인: 학습률이 너무 높거나 데이터 문제

# 해결법:
# 1. 학습률 낮추기
learning_rate = 1e-5  # 2e-4 -> 1e-5

# 2. max_grad_norm 설정
max_grad_norm = 0.3  # gradient clipping

# 3. 데이터 검증
def check_data_issues(dataset, tokenizer):
    """데이터 문제 검사"""
    issues = []
    for i, item in enumerate(dataset):
        text = item["text"]
        # 빈 텍스트 확인
        if not text.strip():
            issues.append(f"[{i}] 빈 텍스트")
        # 너무 긴 텍스트
        tokens = tokenizer.encode(text)
        if len(tokens) > 4096:
            issues.append(f"[{i}] 너무 긴 텍스트: {len(tokens)} tokens")
        # 특수 문자만 있는 경우
        if not any(c.isalnum() for c in text):
            issues.append(f"[{i}] 유효한 텍스트 없음")
    return issues

10.3 Catastrophic Forgetting (파국적 망각)

# 증상: 파인튜닝 후 기존 지식이 사라짐
# 해결법:

# 1. 낮은 학습률 사용
learning_rate = 5e-5

# 2. 적은 epoch (1-3)
num_train_epochs = 1

# 3. 데이터에 일반 지식 혼합
# 원래 데이터 80% + 일반 지식 데이터 20%

# 4. LoRA rank 낮추기 (변경 폭 제한)
r = 8

# 5. 정규화 강화
weight_decay = 0.1

10.4 과적합 탐지

# 과적합 지표
# 1. Train Loss는 줄어드는데 Eval Loss가 증가
# 2. 학습 데이터를 거의 외우는 수준의 출력
# 3. 새로운 프롬프트에 대한 성능 저하

# 해결법:
# 1. 데이터 양 늘리기
# 2. 정규화 (dropout, weight_decay)
# 3. Early stopping
from transformers import EarlyStoppingCallback

trainer = SFTTrainer(
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    args=TrainingArguments(
        evaluation_strategy="steps",
        eval_steps=50,
        load_best_model_at_end=True,
    ),
)

11. 퀴즈

Q1. LoRA에서 r=16이면 원본 가중치 대비 몇 %의 파라미터만 학습하나요?

정답: 약 0.5% (99.5% 절약)

d=4096인 경우:

Full: 4096 x 4096 = 16,777,216
LoRA r=16: (4096 x 16) + (16 x 4096) = 131,072
비율: 131,072 / 16,777,216 = 0.78%

실제로 여러 모듈(q, k, v, o, gate, up, down)에 적용하므로 총 파라미터 대비 약 0.5% 수준입니다.

Q2. QLoRA의 NF4 양자화가 일반 INT4보다 나은 이유는?

정답: 가중치의 정규분포 특성을 활용한 최적 양자화

NF4는 신경망 가중치가 대체로 정규분포를 따른다는 점을 이용합니다. 정규분포의 분위수에 맞춰 16개 양자화 값을 배치하므로, 균일 분할인 INT4보다 정보 손실이 적습니다. 이론적으로 정규분포 데이터에 대해 최적에 가까운 양자화를 달성합니다.

Q3. Unsloth가 기존 HuggingFace PEFT보다 2배 빠른 핵심 이유는?

정답: 커스텀 Triton 커널

Unsloth는 Attention, MLP, Cross-Entropy Loss 등의 핵심 연산을 Triton으로 작성한 커스텀 GPU 커널로 대체합니다. 이 커널들은 메모리 접근 패턴을 최적화하고, 불필요한 중간 텐서 생성을 줄여 2배 빠른 학습과 60% 메모리 절약을 달성합니다.

Q4. Gradient Checkpointing의 원리와 트레이드오프는?

정답:

원리: Forward pass에서 중간 활성화값(activation)을 메모리에 저장하지 않고, Backward pass에서 필요할 때 다시 계산합니다.

트레이드오프:

장점: VRAM 사용량 약 30~50% 감소
단점: 재계산으로 인해 학습 시간 약 20~30% 증가

Unsloth의 커스텀 gradient checkpointing은 일반 PyTorch 구현보다 더 효율적이어서 시간 증가가 적습니다.

Q5. GGUF Q4_K_M과 Q8_0의 차이와 각각의 권장 사용 시나리오는?

정답:

Q4_K_M (4-bit Mixed):

파일 크기: 원본의 약 28% (8B 모델 기준 약 4.5GB)
품질: 원본 대비 약간의 성능 저하
속도: 빠름
권장: 일상 사용, 모바일/엣지 배포, VRAM/RAM 제한 환경

Q8_0 (8-bit):

파일 크기: 원본의 약 50% (8B 모델 기준 약 8GB)
품질: 원본에 매우 가까움
속도: Q4 대비 느림
권장: 품질 최우선, 충분한 메모리가 있는 환경, 정확한 추론이 필요한 서비스

12. 참고 자료

LoRA: Low-Rank Adaptation of Large Language Models - Hu et al., 2021
QLoRA: Efficient Finetuning of Quantized LLMs - Dettmers et al., 2023
Unsloth Documentation - github.com/unslothai/unsloth
PEFT: Parameter-Efficient Fine-Tuning - HuggingFace
TRL: Transformer Reinforcement Learning - HuggingFace
Flash Attention 2 - Dao et al., 2023
LLM.int8(): 8-bit Matrix Multiplication - Dettmers et al., 2022
llama.cpp - github.com/ggerganov/llama.cpp
GPTQ: Accurate Post-Training Quantization - Frantar et al., 2022
DeepSpeed ZeRO - Rajbhandari et al., 2020
Direct Preference Optimization - Rafailov et al., 2023
Scaling Data-Constrained Language Models - Muennighoff et al., 2023
Training Compute-Optimal Large Language Models (Chinchilla) - Hoffmann et al., 2022
The Llama 3 Herd of Models - Meta AI, 2024