Split View: Unsloth로 LLM 파인튜닝 완전 가이드 2025: QLoRA, 4bit 양자화, 2배 빠른 학습

Unsloth로 LLM 파인튜닝 완전 가이드 2025: QLoRA, 4bit 양자화, 2배 빠른 학습

서론: 왜 Unsloth인가?
1. LoRA/QLoRA 이론
2. 환경 설정
3. Unsloth 파인튜닝 단계별 가이드
- 3.1 모델 로딩
- 3.2 LoRA 어댑터 설정
4. 데이터 준비
5. 학습 설정
6. VRAM 최적화 기법
7. 모델 내보내기 및 변환
8. 평가 및 테스트
9. 고급 기법
10. 일반적인 문제와 해결법
11. 퀴즈
12. 참고 자료

서론: 왜 Unsloth인가?

LLM 파인튜닝의 가장 큰 진입 장벽은 **GPU 메모리(VRAM)**입니다. Llama 3.1 8B를 Full Fine-tuning하려면 약 60GB VRAM이 필요하고, 이는 A100 80GB 하나로도 빠듯합니다. QLoRA가 이 문제를 해결했지만, 학습 속도는 여전히 느렸습니다.

Unsloth는 이 두 가지 문제를 동시에 해결합니다:

비교 항목	HuggingFace PEFT	Axolotl	Unsloth
학습 속도	1x (기준)	1.1x	2x
메모리 사용	100%	95%	40%
설정 난이도	중간	높음	낮음
지원 모델	전체	전체	주요 모델
Flash Attention	별도 설치	내장	내장
커스텀 커널	없음	없음	Triton 커널

Unsloth의 핵심 비밀은 커스텀 Triton 커널입니다. Attention, MLP, Cross-Entropy Loss 등의 핵심 연산을 GPU에 최적화된 커스텀 커널로 대체하여 2배 빠른 학습과 60% 메모리 절약을 달성합니다.

지원 모델 (2025년 기준):

Llama 3 / 3.1 / 3.2 (8B, 70B)
Mistral / Mixtral
Phi-3 / Phi-3.5
Qwen 2 / 2.5
Gemma 2
Yi
DeepSeek V2

1. LoRA/QLoRA 이론

1.1 Full Fine-tuning vs LoRA vs QLoRA

Full Fine-tuning (모든 파라미터 업데이트)
┌──────────────────────┐
│   W (d x d)          │  <- 전체 가중치 업데이트
│   예: 4096 x 4096    │     = 16M 파라미터
│   = 64MB (FP16)      │
└──────────────────────┘

LoRA (Low-Rank Adaptation)
┌──────────────────────┐
│   W0 (고정) + B * A   │
│   W0: 4096 x 4096    │  <- 고정 (업데이트 안 함)
│   B: 4096 x 16       │  <- 학습 (65K 파라미터)
│   A: 16 x 4096       │  <- 학습 (65K 파라미터)
│   = 0.25MB (FP16)    │     총 130K 파라미터
└──────────────────────┘

QLoRA (Quantized LoRA)
┌──────────────────────┐
│   W0 (4bit) + B * A  │
│   W0: 4096 x 4096    │  <- 4bit 양자화 (8MB)
│   B: 4096 x 16       │  <- FP16 학습
│   A: 16 x 4096       │  <- FP16 학습
│   = 8.25MB total     │
└──────────────────────┘

1.2 Low-Rank Decomposition 원리

LoRA의 핵심 아이디어는 **가중치 업데이트 행렬이 실제로 저차원(low-rank)**이라는 관찰에 기반합니다.

원래의 가중치 업데이트:

W_new = W_old + delta_W

LoRA는 delta_W를 두 개의 작은 행렬의 곱으로 분해합니다:

delta_W = B * A
여기서:
  B는 d x r 행렬 (d=모델 차원, r=LoRA 랭크)
  A는 r x d 행렬
  r << d (예: r=16, d=4096)

파라미터 절약 효과:

# Full Fine-tuning 파라미터 수
d = 4096
full_params = d * d  # = 16,777,216 (16.7M)

# LoRA 파라미터 수
r = 16
lora_params = d * r + r * d  # = 131,072 (131K)

# 절약률
savings = 1 - (lora_params / full_params)
print(f"파라미터 절약: {savings:.2%}")  # 99.22%

1.3 4-bit NormalFloat 양자화 (NF4)

QLoRA에서 사용하는 NF4 양자화는 일반 4-bit와 다릅니다:

일반 4-bit INT 양자화:

균일하게 16개 구간으로 나눔
값 분포를 고려하지 않음

NF4 (NormalFloat4):

가중치가 정규분포를 따른다는 사실을 활용
정규분포의 분위수(quantile)에 맞춰 16개 값 설정
정보 이론적으로 최적에 가까운 양자화

# NF4 양자화 값 예시 (정규분포 분위수 기반)
nf4_values = [
    -1.0, -0.6962, -0.5251, -0.3949,
    -0.2844, -0.1848, -0.0911, 0.0,
    0.0796, 0.1609, 0.2461, 0.3379,
    0.4407, 0.5626, 0.7230, 1.0,
]

1.4 Double Quantization

QLoRA의 또 다른 혁신은 **이중 양자화(Double Quantization)**입니다:

가중치를 4-bit로 양자화 (NF4)
양자화 상수(scaling factor)를 다시 8-bit로 양자화
추가 메모리 절약: 블록당 32bit에서 8bit로

1.5 메모리 비교표

모델	Full FT (FP16)	LoRA (FP16)	QLoRA (4bit)
Llama 3 8B	~60GB	~18GB	~6GB
Llama 3 70B	~500GB	~160GB	~40GB
Mistral 7B	~52GB	~16GB	~5GB
Phi-3 3.8B	~28GB	~9GB	~3GB
Qwen 2 7B	~52GB	~16GB	~5GB

2. 환경 설정

2.1 GPU 요구사항

GPU	VRAM	학습 가능 모델 (QLoRA)
T4 (Colab Free)	16GB	7B~8B (seq_len 1024)
A10G	24GB	7B~13B
RTX 4090	24GB	7B~13B
A100 40GB	40GB	7B~70B
A100 80GB	80GB	70B+
Apple M2 Ultra	192GB	CPU 학습 (느림)

2.2 Google Colab 설정

# Colab에서 Unsloth 설치 (T4 GPU 기준)
# 런타임 -> 런타임 유형 변경 -> T4 GPU 선택

# 1. Unsloth 설치
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

# 2. GPU 확인
import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB")

2.3 로컬 환경 설정

# Conda 환경 생성
conda create -n unsloth python=3.11
conda activate unsloth

# PyTorch 설치 (CUDA 12.1)
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

# Unsloth 설치
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

# 설치 확인
python -c "from unsloth import FastLanguageModel; print('Unsloth OK')"

2.4 Docker 환경

FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

RUN apt-get update && apt-get install -y python3.11 python3-pip git

RUN pip install torch --index-url https://download.pytorch.org/whl/cu121
RUN pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
RUN pip install --no-deps trl peft accelerate bitsandbytes

WORKDIR /workspace
CMD ["python3"]

3. Unsloth 파인튜닝 단계별 가이드

3.1 모델 로딩

from unsloth import FastLanguageModel
import torch

# 모델과 토크나이저 로딩
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",  # 4bit 사전 양자화 모델
    max_seq_length=2048,    # 최대 시퀀스 길이
    dtype=None,             # 자동 감지 (A100: bfloat16, 기타: float16)
    load_in_4bit=True,      # 4bit 양자화 로드
)

# GPU 메모리 확인
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_mem / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

사전 양자화 모델 추천:

용도	모델	크기
일반 한국어	`unsloth/Meta-Llama-3.1-8B-bnb-4bit`	~5GB
한국어 특화	`beomi/Llama-3-Open-Ko-8B-bnb-4bit`	~5GB
코딩	`unsloth/Mistral-7B-v0.3-bnb-4bit`	~4.5GB
경량	`unsloth/Phi-3.5-mini-instruct-bnb-4bit`	~2.5GB
다국어	`unsloth/Qwen2.5-7B-bnb-4bit`	~4.5GB

3.2 LoRA 어댑터 설정

# LoRA 어댑터 추가
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                          # LoRA 랭크 (8, 16, 32, 64)
    target_modules=[               # LoRA를 적용할 모듈
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj",       # MLP
    ],
    lora_alpha=16,                 # LoRA alpha (보통 r과 같게)
    lora_dropout=0,                # Unsloth에서는 0이 최적
    bias="none",                   # bias 학습 안 함
    use_gradient_checkpointing="unsloth",  # Unsloth 최적화 체크포인팅
    random_state=3407,
    use_rslora=False,              # Rank-Stabilized LoRA (실험적)
    loftq_config=None,             # LoftQ 설정
)

# 학습 가능한 파라미터 확인
def print_trainable_parameters(model):
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"학습 가능: {trainable:,} / 전체: {total:,} = {trainable/total:.2%}")

print_trainable_parameters(model)
# 학습 가능: 41,943,040 / 전체: 8,030,261,248 = 0.52%

LoRA 랭크 선택 가이드:

LoRA r	파라미터 수	VRAM 추가	권장 사용
8	~21M	~80MB	간단한 태스크, VRAM 제한
16	~42M	~160MB	일반적 권장값
32	~84M	~320MB	복잡한 태스크
64	~168M	~640MB	대규모 데이터, 높은 표현력 필요
128	~336M	~1.3GB	실험적, Full FT에 근접

4. 데이터 준비

4.1 Chat Template 포매팅

# Alpaca 프롬프트 템플릿
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# 데이터셋 포매팅 함수
EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

4.2 데이터셋 로딩 및 변환

from datasets import load_dataset

# KoAlpaca 데이터셋 로딩
dataset = load_dataset("beomi/KoAlpaca-v1.1a", split="train")

# 포맷 변환
def format_koalpaca(examples):
    texts = []
    for instruction, output in zip(examples["instruction"], examples["output"]):
        text = alpaca_prompt.format(instruction, "", output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

dataset = dataset.map(format_koalpaca, batched=True)

# ShareGPT 형식 데이터 로딩 (다중 턴)
sharegpt_dataset = load_dataset("philschmid/sharegpt-raw", split="train")

def format_sharegpt(examples):
    texts = []
    for conversations in examples["conversations"]:
        text = ""
        for turn in conversations:
            if turn["from"] == "human":
                text += f"### Human:\n{turn['value']}\n\n"
            elif turn["from"] == "gpt":
                text += f"### Assistant:\n{turn['value']}\n\n"
        text += EOS_TOKEN
        texts.append(text)
    return {"text": texts}

# OpenAI Messages 형식 (Llama 3 chat template 사용)
def format_openai_messages(examples):
    texts = []
    for messages in examples["messages"]:
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
        )
        texts.append(text)
    return {"text": texts}

4.3 Max Sequence Length 고려사항

# 시퀀스 길이 분포 분석
def analyze_sequence_lengths(dataset, tokenizer):
    lengths = []
    for item in dataset:
        tokens = tokenizer.encode(item["text"])
        lengths.append(len(tokens))

    import numpy as np
    print(f"평균 길이: {np.mean(lengths):.0f}")
    print(f"중앙값: {np.median(lengths):.0f}")
    print(f"95 퍼센타일: {np.percentile(lengths, 95):.0f}")
    print(f"99 퍼센타일: {np.percentile(lengths, 99):.0f}")
    print(f"최대 길이: {max(lengths)}")

    # 권장 max_seq_length = 95 퍼센타일
    recommended = int(np.percentile(lengths, 95))
    print(f"\n권장 max_seq_length: {recommended}")
    return lengths

analyze_sequence_lengths(dataset, tokenizer)

5. 학습 설정

5.1 SFTTrainer 설정

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,          # 데이터 전처리 병렬 수
    packing=False,               # 짧은 시퀀스 패킹 (True: 메모리 효율)
    args=TrainingArguments(
        # === 기본 설정 ===
        output_dir="./outputs",
        num_train_epochs=3,

        # === 배치 & 메모리 ===
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # 유효 배치 = 2 * 4 = 8

        # === 학습률 ===
        learning_rate=2e-4,             # QLoRA 권장 학습률
        lr_scheduler_type="cosine",     # 코사인 스케줄러
        warmup_steps=5,                 # 워밍업 스텝

        # === 정밀도 ===
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),

        # === 로깅 ===
        logging_steps=1,
        logging_dir="./logs",
        report_to="wandb",             # Weights & Biases 연동

        # === 저장 ===
        save_strategy="steps",
        save_steps=100,
        save_total_limit=3,

        # === 최적화 ===
        optim="adamw_8bit",            # 8bit AdamW (메모리 절약)
        weight_decay=0.01,
        max_grad_norm=0.3,
        seed=3407,
    ),
)

5.2 학습 파라미터 상세 설명

학습률 (Learning Rate) 가이드:

시나리오	권장 학습률	이유
QLoRA 기본	2e-4	QLoRA 논문 권장값
큰 데이터셋 (100K+)	1e-4	과적합 방지
작은 데이터셋 (1K 이하)	5e-5 ~ 1e-4	세밀한 학습
도메인 적응	2e-5 ~ 5e-5	기존 지식 보존
Continued Pre-training	1e-5 ~ 5e-5	안정적 학습

배치 크기 vs Gradient Accumulation:

# 동일한 유효 배치 크기 8 달성하는 두 가지 방법

# 방법 1: 큰 배치 (VRAM 많이 필요)
per_device_train_batch_size = 8
gradient_accumulation_steps = 1
# 유효 배치 = 8 * 1 = 8, VRAM: ~12GB

# 방법 2: 작은 배치 + Gradient Accumulation (VRAM 적게 필요)
per_device_train_batch_size = 2
gradient_accumulation_steps = 4
# 유효 배치 = 2 * 4 = 8, VRAM: ~6GB
# 주의: 학습 속도는 조금 느려짐

5.3 Wandb 연동

import wandb

# Wandb 로그인 및 프로젝트 설정
wandb.login(key="your-wandb-api-key")
wandb.init(
    project="korean-llm-finetuning",
    name="llama3-8b-koalpaca-qlora",
    config={
        "model": "Meta-Llama-3.1-8B",
        "dataset": "KoAlpaca-v1.1a",
        "lora_r": 16,
        "learning_rate": 2e-4,
        "epochs": 3,
    },
)

5.4 학습 실행

# 학습 시작
trainer_stats = trainer.train()

# 학습 결과 출력
print(f"학습 시간: {trainer_stats.metrics['train_runtime']:.2f}초")
print(f"최종 Loss: {trainer_stats.metrics['train_loss']:.4f}")

# GPU 메모리 사용량 확인
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
print(f"최대 VRAM 사용: {used_memory} GB")

6. VRAM 최적화 기법

6.1 Gradient Checkpointing

# Unsloth 최적화 Gradient Checkpointing
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",  # 핵심! 30% VRAM 절약
)

# 일반 gradient checkpointing vs Unsloth
# "unsloth": Unsloth 최적화 버전 (더 빠르고 메모리 효율적)
# True: 표준 PyTorch gradient checkpointing
# False: 사용 안 함 (가장 빠르지만 메모리 많이 사용)

6.2 Flash Attention 2

# Unsloth는 Flash Attention 2를 자동으로 사용
# 별도 설정 불필요!

# 수동으로 확인하려면:
print(f"Flash Attention 사용: {hasattr(model.config, '_attn_implementation')}")

6.3 시퀀스 패킹

# 짧은 시퀀스를 하나로 묶어 GPU 활용률 향상
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    packing=True,           # 시퀀스 패킹 활성화
    max_seq_length=2048,    # 패킹된 전체 길이
)

# 패킹 효과:
# 패킹 OFF: [토큰토큰PAD PAD PAD PAD] [토큰PAD PAD PAD PAD PAD]
# 패킹 ON:  [토큰토큰토큰SEP토큰토큰토큰] -> GPU 활용률 증가

6.4 VRAM 사용량 표 (Unsloth QLoRA 기준)

모델	Batch=1	Batch=2	Batch=4	Batch=8
Llama 3 8B	4.2GB	5.8GB	8.5GB	14.2GB
Mistral 7B	3.8GB	5.2GB	7.8GB	13.0GB
Phi-3 3.8B	2.4GB	3.2GB	4.8GB	7.6GB
Qwen 2 7B	3.8GB	5.2GB	7.8GB	13.0GB
Llama 3 70B	36GB	42GB	56GB	OOM

* max_seq_length=2048, gradient_checkpointing="unsloth" 기준

7. 모델 내보내기 및 변환

7.1 LoRA 어댑터 저장

# LoRA 어댑터만 저장 (작은 크기)
model.save_pretrained("lora_adapter")
tokenizer.save_pretrained("lora_adapter")

# 저장된 파일 확인
import os
for f in os.listdir("lora_adapter"):
    size = os.path.getsize(f"lora_adapter/{f}") / 1024 / 1024
    print(f"  {f}: {size:.1f} MB")

# adapter_config.json: 0.0 MB
# adapter_model.safetensors: 160.0 MB  <- LoRA 가중치
# tokenizer.json: 17.1 MB

7.2 어댑터 병합 (Merge)

# LoRA 어댑터를 베이스 모델과 병합
merged_model = model.merge_and_unload()

# 병합된 모델 저장
merged_model.save_pretrained("merged_model")
tokenizer.save_pretrained("merged_model")

7.3 GGUF 변환 (llama.cpp용)

# Unsloth의 내장 GGUF 변환 기능
# 다양한 양자화 레벨 지원

# Q4_K_M: 가장 일반적 (품질/크기 균형)
model.save_pretrained_gguf(
    "model_gguf",
    tokenizer,
    quantization_method="q4_k_m",
)

# Q5_K_M: 더 높은 품질
model.save_pretrained_gguf(
    "model_q5",
    tokenizer,
    quantization_method="q5_k_m",
)

# Q8_0: 최고 품질 (크기 큼)
model.save_pretrained_gguf(
    "model_q8",
    tokenizer,
    quantization_method="q8_0",
)

# F16: 양자화 없음 (가장 큼)
model.save_pretrained_gguf(
    "model_f16",
    tokenizer,
    quantization_method="f16",
)

GGUF 양자화 비교:

양자화	파일 크기 (8B)	품질	추론 속도	권장
Q4_K_M	~4.5GB	좋음	빠름	일반 사용
Q5_K_M	~5.5GB	매우 좋음	보통	품질 중시
Q8_0	~8.0GB	우수	느림	최고 품질
F16	~16GB	원본	가장 느림	참고용

7.4 GPTQ 변환

# GPTQ 양자화 (GPU 추론용)
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
)

# 캘리브레이션 데이터 준비
calibration_data = [
    tokenizer(text, return_tensors="pt")
    for text in calibration_texts[:128]
]

# GPTQ 양자화 실행
gptq_model = AutoGPTQForCausalLM.from_pretrained(
    "merged_model",
    quantize_config=quantize_config,
)
gptq_model.quantize(calibration_data)
gptq_model.save_quantized("model_gptq")

7.5 Hugging Face Hub 업로드

# 모델을 Hugging Face Hub에 업로드

# LoRA 어댑터만 업로드
model.push_to_hub(
    "my-org/llama3-8b-korean-lora",
    token="hf_xxxxx",
    private=True,
)
tokenizer.push_to_hub(
    "my-org/llama3-8b-korean-lora",
    token="hf_xxxxx",
    private=True,
)

# GGUF 파일 업로드
model.push_to_hub_gguf(
    "my-org/llama3-8b-korean-gguf",
    tokenizer,
    quantization_method="q4_k_m",
    token="hf_xxxxx",
)

8. 평가 및 테스트

8.1 파인튜닝된 모델로 추론

# 추론 모드로 전환
FastLanguageModel.for_inference(model)

# 단일 프롬프트 추론
def generate_response(instruction, input_text=""):
    prompt = alpaca_prompt.format(instruction, input_text, "")
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.15,
        do_sample=True,
    )

    response = tokenizer.batch_decode(outputs)[0]
    # Response 부분만 추출
    response = response.split("### Response:\n")[-1]
    response = response.replace(tokenizer.eos_token, "").strip()
    return response

# 테스트
test_questions = [
    "한국의 전통 명절에 대해 설명해주세요.",
    "파이썬에서 데코레이터의 동작 원리를 설명해주세요.",
    "건강한 식습관을 위한 팁을 알려주세요.",
]

for q in test_questions:
    print(f"Q: {q}")
    print(f"A: {generate_response(q)}")
    print("-" * 80)

8.2 생성 파라미터 튜닝

# 생성 파라미터별 효과
generation_configs = {
    "정확한 답변 (factual)": {
        "temperature": 0.1,
        "top_p": 0.9,
        "repetition_penalty": 1.0,
    },
    "창의적 답변 (creative)": {
        "temperature": 0.8,
        "top_p": 0.95,
        "repetition_penalty": 1.15,
    },
    "균형잡힌 답변 (balanced)": {
        "temperature": 0.5,
        "top_p": 0.9,
        "repetition_penalty": 1.1,
    },
}

8.3 lm-eval-harness 벤치마크

# lm-eval-harness로 벤치마크 평가
pip install lm-eval

# 한국어 벤치마크 평가
lm_eval --model hf \
    --model_args pretrained=./merged_model \
    --tasks kobest_boolq,kobest_copa,kobest_hellaswag,kobest_sentineg,kobest_wic \
    --batch_size 4 \
    --output_path ./eval_results

# Python에서 실행
from lm_eval import evaluator

results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=./merged_model",
    tasks=["kobest_boolq", "kobest_copa", "kobest_hellaswag"],
    batch_size=4,
)

for task, metrics in results["results"].items():
    print(f"{task}: acc={metrics.get('acc', 'N/A')}")

9. 고급 기법

9.1 Multi-GPU 학습 (DeepSpeed ZeRO)

# deepspeed_config.json
"""
{
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "reduce_scatter": true
    },
    "bf16": {
        "enabled": true
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}
"""

# 실행
# deepspeed --num_gpus 4 train.py --deepspeed deepspeed_config.json

9.2 DPO 학습

from trl import DPOTrainer, DPOConfig
from unsloth import FastLanguageModel, PatchDPOTrainer

# DPO 패치 적용
PatchDPOTrainer()

# DPO 데이터셋 준비
dpo_dataset = load_dataset("argilla/ultrafeedback-binarized-preferences", split="train")

# DPO Trainer 설정
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,           # Unsloth에서는 None (자동 처리)
    args=DPOConfig(
        output_dir="./dpo_output",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=5e-7,       # DPO는 낮은 학습률
        num_train_epochs=1,
        beta=0.1,                 # DPO beta (KL divergence 가중치)
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
    ),
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)

dpo_trainer.train()

9.3 Continued Pre-training (도메인 적응)

# 도메인 특화 텍스트로 Continued Pre-training
from trl import SFTTrainer

# 도메인 텍스트 데이터 (의료, 법률, 금융 등)
domain_dataset = load_dataset("my-org/medical-korean-corpus", split="train")

# Continued Pre-training은 낮은 학습률 사용
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=domain_dataset,
    dataset_text_field="text",
    max_seq_length=4096,     # 긴 문서
    packing=True,            # 효율성 위해 패킹 사용
    args=TrainingArguments(
        output_dir="./cpt_output",
        learning_rate=2e-5,  # 매우 낮은 학습률
        num_train_epochs=1,  # 1 epoch면 충분
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        optim="adamw_8bit",
        warmup_ratio=0.1,
    ),
)

trainer.train()

10. 일반적인 문제와 해결법

10.1 OOM (Out of Memory) 오류

# 증상: CUDA out of memory
# RuntimeError: CUDA out of memory.
# Tried to allocate 256.00 MiB

# 해결법 순서:
# 1. batch_size 줄이기
per_device_train_batch_size = 1  # 최소값

# 2. gradient_accumulation_steps 늘리기
gradient_accumulation_steps = 8

# 3. max_seq_length 줄이기
max_seq_length = 1024  # 2048 -> 1024

# 4. LoRA rank 줄이기
r = 8  # 16 -> 8

# 5. gradient checkpointing 확인
use_gradient_checkpointing = "unsloth"

# 6. 캐시 비우기
torch.cuda.empty_cache()
import gc
gc.collect()

10.2 NaN Loss

# 증상: loss가 NaN으로 발산
# 원인: 학습률이 너무 높거나 데이터 문제

# 해결법:
# 1. 학습률 낮추기
learning_rate = 1e-5  # 2e-4 -> 1e-5

# 2. max_grad_norm 설정
max_grad_norm = 0.3  # gradient clipping

# 3. 데이터 검증
def check_data_issues(dataset, tokenizer):
    """데이터 문제 검사"""
    issues = []
    for i, item in enumerate(dataset):
        text = item["text"]
        # 빈 텍스트 확인
        if not text.strip():
            issues.append(f"[{i}] 빈 텍스트")
        # 너무 긴 텍스트
        tokens = tokenizer.encode(text)
        if len(tokens) > 4096:
            issues.append(f"[{i}] 너무 긴 텍스트: {len(tokens)} tokens")
        # 특수 문자만 있는 경우
        if not any(c.isalnum() for c in text):
            issues.append(f"[{i}] 유효한 텍스트 없음")
    return issues

10.3 Catastrophic Forgetting (파국적 망각)

# 증상: 파인튜닝 후 기존 지식이 사라짐
# 해결법:

# 1. 낮은 학습률 사용
learning_rate = 5e-5

# 2. 적은 epoch (1-3)
num_train_epochs = 1

# 3. 데이터에 일반 지식 혼합
# 원래 데이터 80% + 일반 지식 데이터 20%

# 4. LoRA rank 낮추기 (변경 폭 제한)
r = 8

# 5. 정규화 강화
weight_decay = 0.1

10.4 과적합 탐지

# 과적합 지표
# 1. Train Loss는 줄어드는데 Eval Loss가 증가
# 2. 학습 데이터를 거의 외우는 수준의 출력
# 3. 새로운 프롬프트에 대한 성능 저하

# 해결법:
# 1. 데이터 양 늘리기
# 2. 정규화 (dropout, weight_decay)
# 3. Early stopping
from transformers import EarlyStoppingCallback

trainer = SFTTrainer(
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    args=TrainingArguments(
        evaluation_strategy="steps",
        eval_steps=50,
        load_best_model_at_end=True,
    ),
)

11. 퀴즈

Q1. LoRA에서 r=16이면 원본 가중치 대비 몇 %의 파라미터만 학습하나요?

정답: 약 0.5% (99.5% 절약)

d=4096인 경우:

Full: 4096 x 4096 = 16,777,216
LoRA r=16: (4096 x 16) + (16 x 4096) = 131,072
비율: 131,072 / 16,777,216 = 0.78%

실제로 여러 모듈(q, k, v, o, gate, up, down)에 적용하므로 총 파라미터 대비 약 0.5% 수준입니다.

Q2. QLoRA의 NF4 양자화가 일반 INT4보다 나은 이유는?

정답: 가중치의 정규분포 특성을 활용한 최적 양자화

NF4는 신경망 가중치가 대체로 정규분포를 따른다는 점을 이용합니다. 정규분포의 분위수에 맞춰 16개 양자화 값을 배치하므로, 균일 분할인 INT4보다 정보 손실이 적습니다. 이론적으로 정규분포 데이터에 대해 최적에 가까운 양자화를 달성합니다.

Q3. Unsloth가 기존 HuggingFace PEFT보다 2배 빠른 핵심 이유는?

정답: 커스텀 Triton 커널

Unsloth는 Attention, MLP, Cross-Entropy Loss 등의 핵심 연산을 Triton으로 작성한 커스텀 GPU 커널로 대체합니다. 이 커널들은 메모리 접근 패턴을 최적화하고, 불필요한 중간 텐서 생성을 줄여 2배 빠른 학습과 60% 메모리 절약을 달성합니다.

Q4. Gradient Checkpointing의 원리와 트레이드오프는?

정답:

원리: Forward pass에서 중간 활성화값(activation)을 메모리에 저장하지 않고, Backward pass에서 필요할 때 다시 계산합니다.

트레이드오프:

장점: VRAM 사용량 약 30~50% 감소
단점: 재계산으로 인해 학습 시간 약 20~30% 증가

Unsloth의 커스텀 gradient checkpointing은 일반 PyTorch 구현보다 더 효율적이어서 시간 증가가 적습니다.

Q5. GGUF Q4_K_M과 Q8_0의 차이와 각각의 권장 사용 시나리오는?

정답:

Q4_K_M (4-bit Mixed):

파일 크기: 원본의 약 28% (8B 모델 기준 약 4.5GB)
품질: 원본 대비 약간의 성능 저하
속도: 빠름
권장: 일상 사용, 모바일/엣지 배포, VRAM/RAM 제한 환경

Q8_0 (8-bit):

파일 크기: 원본의 약 50% (8B 모델 기준 약 8GB)
품질: 원본에 매우 가까움
속도: Q4 대비 느림
권장: 품질 최우선, 충분한 메모리가 있는 환경, 정확한 추론이 필요한 서비스

12. 참고 자료

LoRA: Low-Rank Adaptation of Large Language Models - Hu et al., 2021
QLoRA: Efficient Finetuning of Quantized LLMs - Dettmers et al., 2023
Unsloth Documentation - github.com/unslothai/unsloth
PEFT: Parameter-Efficient Fine-Tuning - HuggingFace
TRL: Transformer Reinforcement Learning - HuggingFace
Flash Attention 2 - Dao et al., 2023
LLM.int8(): 8-bit Matrix Multiplication - Dettmers et al., 2022
llama.cpp - github.com/ggerganov/llama.cpp
GPTQ: Accurate Post-Training Quantization - Frantar et al., 2022
DeepSpeed ZeRO - Rajbhandari et al., 2020
Direct Preference Optimization - Rafailov et al., 2023
Scaling Data-Constrained Language Models - Muennighoff et al., 2023
Training Compute-Optimal Large Language Models (Chinchilla) - Hoffmann et al., 2022
The Llama 3 Herd of Models - Meta AI, 2024

Complete Guide to LLM Fine-tuning with Unsloth 2025: QLoRA, 4-bit Quantization, 2x Faster Training

Introduction: Why Unsloth?
1. LoRA/QLoRA Theory
2. Environment Setup
3. Unsloth Fine-tuning Step by Step
- 3.1 Model Loading
- 3.2 LoRA Adapter Configuration
4. Data Preparation
5. Training Configuration
6. VRAM Optimization Techniques
7. Model Export and Conversion
8. Evaluation and Testing
9. Advanced Techniques
10. Common Issues and Solutions
11. Quiz
12. References

Introduction: Why Unsloth?

The biggest barrier to LLM fine-tuning is GPU memory (VRAM). Full fine-tuning of Llama 3.1 8B requires about 60GB VRAM, which is tight even on a single A100 80GB. QLoRA solved this problem, but training speed remained slow.

Unsloth solves both problems simultaneously:

Comparison	HuggingFace PEFT	Axolotl	Unsloth
Training Speed	1x (baseline)	1.1x	2x
Memory Usage	100%	95%	40%
Setup Difficulty	Medium	High	Low
Model Support	All	All	Major models
Flash Attention	Separate install	Built-in	Built-in
Custom Kernels	None	None	Triton kernels

The secret behind Unsloth is custom Triton kernels. Core operations like Attention, MLP, and Cross-Entropy Loss are replaced with GPU-optimized custom kernels, achieving 2x faster training and 60% memory savings.

Supported Models (as of 2025):

Llama 3 / 3.1 / 3.2 (8B, 70B)
Mistral / Mixtral
Phi-3 / Phi-3.5
Qwen 2 / 2.5
Gemma 2
Yi
DeepSeek V2

1. LoRA/QLoRA Theory

1.1 Full Fine-tuning vs LoRA vs QLoRA

Full Fine-tuning (update all parameters)
+------------------------+
|   W (d x d)            |  <- Update entire weights
|   e.g.: 4096 x 4096   |     = 16M parameters
|   = 64MB (FP16)        |
+------------------------+

LoRA (Low-Rank Adaptation)
+------------------------+
|   W0 (frozen) + B * A  |
|   W0: 4096 x 4096     |  <- Frozen (no updates)
|   B: 4096 x 16         |  <- Trainable (65K params)
|   A: 16 x 4096         |  <- Trainable (65K params)
|   = 0.25MB (FP16)      |     Total 130K params
+------------------------+

QLoRA (Quantized LoRA)
+------------------------+
|   W0 (4bit) + B * A   |
|   W0: 4096 x 4096     |  <- 4-bit quantized (8MB)
|   B: 4096 x 16         |  <- FP16 trainable
|   A: 16 x 4096         |  <- FP16 trainable
|   = 8.25MB total        |
+------------------------+

1.2 Low-Rank Decomposition Principle

The core idea of LoRA is based on the observation that weight update matrices are actually low-rank.

The original weight update:

W_new = W_old + delta_W

LoRA decomposes delta_W into a product of two small matrices:

delta_W = B * A
where:
  B is a d x r matrix (d=model dimension, r=LoRA rank)
  A is a r x d matrix
  r << d (e.g., r=16, d=4096)

Parameter savings:

# Full Fine-tuning parameters
d = 4096
full_params = d * d  # = 16,777,216 (16.7M)

# LoRA parameters
r = 16
lora_params = d * r + r * d  # = 131,072 (131K)

# Savings ratio
savings = 1 - (lora_params / full_params)
print(f"Parameter savings: {savings:.2%}")  # 99.22%

1.3 4-bit NormalFloat Quantization (NF4)

NF4 quantization used in QLoRA differs from standard 4-bit:

Standard 4-bit INT quantization:

Uniformly divides into 16 intervals
Does not consider value distribution

NF4 (NormalFloat4):

Leverages the fact that weights follow a normal distribution
Sets 16 values aligned with normal distribution quantiles
Near-optimal quantization from an information theory perspective

# NF4 quantization values (based on normal distribution quantiles)
nf4_values = [
    -1.0, -0.6962, -0.5251, -0.3949,
    -0.2844, -0.1848, -0.0911, 0.0,
    0.0796, 0.1609, 0.2461, 0.3379,
    0.4407, 0.5626, 0.7230, 1.0,
]

1.4 Double Quantization

Another innovation of QLoRA is Double Quantization:

Quantize weights to 4-bit (NF4)
Quantize the quantization constants (scaling factors) to 8-bit
Additional memory savings: from 32bit to 8bit per block

1.5 Memory Comparison Table

Model	Full FT (FP16)	LoRA (FP16)	QLoRA (4bit)
Llama 3 8B	~60GB	~18GB	~6GB
Llama 3 70B	~500GB	~160GB	~40GB
Mistral 7B	~52GB	~16GB	~5GB
Phi-3 3.8B	~28GB	~9GB	~3GB
Qwen 2 7B	~52GB	~16GB	~5GB

2. Environment Setup

2.1 GPU Requirements

GPU	VRAM	Trainable Models (QLoRA)
T4 (Colab Free)	16GB	7B-8B (seq_len 1024)
A10G	24GB	7B-13B
RTX 4090	24GB	7B-13B
A100 40GB	40GB	7B-70B
A100 80GB	80GB	70B+
Apple M2 Ultra	192GB	CPU training (slow)

2.2 Google Colab Setup

# Install Unsloth on Colab (T4 GPU)
# Runtime -> Change runtime type -> Select T4 GPU

# 1. Install Unsloth
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

# 2. Verify GPU
import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB")

2.3 Local Environment Setup

# Create Conda environment
conda create -n unsloth python=3.11
conda activate unsloth

# Install PyTorch (CUDA 12.1)
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

# Install Unsloth
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

# Verify installation
python -c "from unsloth import FastLanguageModel; print('Unsloth OK')"

2.4 Docker Environment

FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

RUN apt-get update && apt-get install -y python3.11 python3-pip git

RUN pip install torch --index-url https://download.pytorch.org/whl/cu121
RUN pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
RUN pip install --no-deps trl peft accelerate bitsandbytes

WORKDIR /workspace
CMD ["python3"]

3. Unsloth Fine-tuning Step by Step

3.1 Model Loading

from unsloth import FastLanguageModel
import torch

# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",  # Pre-quantized 4bit model
    max_seq_length=2048,    # Maximum sequence length
    dtype=None,             # Auto-detect (A100: bfloat16, others: float16)
    load_in_4bit=True,      # Load with 4bit quantization
)

# Check GPU memory
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_mem / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

Recommended Pre-quantized Models:

Use Case	Model	Size
General Korean	`unsloth/Meta-Llama-3.1-8B-bnb-4bit`	~5GB
Korean-specific	`beomi/Llama-3-Open-Ko-8B-bnb-4bit`	~5GB
Coding	`unsloth/Mistral-7B-v0.3-bnb-4bit`	~4.5GB
Lightweight	`unsloth/Phi-3.5-mini-instruct-bnb-4bit`	~2.5GB
Multilingual	`unsloth/Qwen2.5-7B-bnb-4bit`	~4.5GB

3.2 LoRA Adapter Configuration

# Add LoRA adapter
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                          # LoRA rank (8, 16, 32, 64)
    target_modules=[               # Modules to apply LoRA
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj",       # MLP
    ],
    lora_alpha=16,                 # LoRA alpha (usually same as r)
    lora_dropout=0,                # 0 is optimal for Unsloth
    bias="none",                   # No bias training
    use_gradient_checkpointing="unsloth",  # Unsloth-optimized checkpointing
    random_state=3407,
    use_rslora=False,              # Rank-Stabilized LoRA (experimental)
    loftq_config=None,             # LoftQ configuration
)

# Check trainable parameters
def print_trainable_parameters(model):
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Trainable: {trainable:,} / Total: {total:,} = {trainable/total:.2%}")

print_trainable_parameters(model)
# Trainable: 41,943,040 / Total: 8,030,261,248 = 0.52%

LoRA Rank Selection Guide:

LoRA r	Parameters	VRAM Overhead	Recommended Use
8	~21M	~80MB	Simple tasks, VRAM-limited
16	~42M	~160MB	Generally recommended
32	~84M	~320MB	Complex tasks
64	~168M	~640MB	Large datasets, high expressiveness
128	~336M	~1.3GB	Experimental, close to Full FT

4. Data Preparation

4.1 Chat Template Formatting

# Alpaca prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Dataset formatting function
EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

4.2 Dataset Loading and Conversion

from datasets import load_dataset

# Load KoAlpaca dataset
dataset = load_dataset("beomi/KoAlpaca-v1.1a", split="train")

# Format conversion
def format_koalpaca(examples):
    texts = []
    for instruction, output in zip(examples["instruction"], examples["output"]):
        text = alpaca_prompt.format(instruction, "", output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

dataset = dataset.map(format_koalpaca, batched=True)

# OpenAI Messages format (using Llama 3 chat template)
def format_openai_messages(examples):
    texts = []
    for messages in examples["messages"]:
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
        )
        texts.append(text)
    return {"text": texts}

4.3 Max Sequence Length Considerations

# Analyze sequence length distribution
def analyze_sequence_lengths(dataset, tokenizer):
    lengths = []
    for item in dataset:
        tokens = tokenizer.encode(item["text"])
        lengths.append(len(tokens))

    import numpy as np
    print(f"Mean length: {np.mean(lengths):.0f}")
    print(f"Median: {np.median(lengths):.0f}")
    print(f"95th percentile: {np.percentile(lengths, 95):.0f}")
    print(f"99th percentile: {np.percentile(lengths, 99):.0f}")
    print(f"Max length: {max(lengths)}")

    recommended = int(np.percentile(lengths, 95))
    print(f"\nRecommended max_seq_length: {recommended}")
    return lengths

analyze_sequence_lengths(dataset, tokenizer)

5. Training Configuration

5.1 SFTTrainer Setup

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        # === Basic ===
        output_dir="./outputs",
        num_train_epochs=3,

        # === Batch & Memory ===
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # Effective batch = 2 * 4 = 8

        # === Learning Rate ===
        learning_rate=2e-4,             # QLoRA recommended LR
        lr_scheduler_type="cosine",
        warmup_steps=5,

        # === Precision ===
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),

        # === Logging ===
        logging_steps=1,
        logging_dir="./logs",
        report_to="wandb",

        # === Saving ===
        save_strategy="steps",
        save_steps=100,
        save_total_limit=3,

        # === Optimization ===
        optim="adamw_8bit",
        weight_decay=0.01,
        max_grad_norm=0.3,
        seed=3407,
    ),
)

5.2 Learning Rate Guide

Scenario	Recommended LR	Reason
QLoRA default	2e-4	QLoRA paper recommendation
Large dataset (100K+)	1e-4	Prevent overfitting
Small dataset (under 1K)	5e-5 to 1e-4	Fine-grained learning
Domain adaptation	2e-5 to 5e-5	Preserve existing knowledge
Continued Pre-training	1e-5 to 5e-5	Stable training

Batch Size vs Gradient Accumulation:

# Two ways to achieve effective batch size of 8

# Method 1: Large batch (needs more VRAM)
per_device_train_batch_size = 8
gradient_accumulation_steps = 1
# Effective batch = 8 * 1 = 8, VRAM: ~12GB

# Method 2: Small batch + Gradient Accumulation (less VRAM)
per_device_train_batch_size = 2
gradient_accumulation_steps = 4
# Effective batch = 2 * 4 = 8, VRAM: ~6GB
# Note: Training slightly slower

5.3 Training Execution

# Start training
trainer_stats = trainer.train()

# Print results
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f}s")
print(f"Final Loss: {trainer_stats.metrics['train_loss']:.4f}")

# Check GPU memory usage
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
print(f"Peak VRAM usage: {used_memory} GB")

6. VRAM Optimization Techniques

6.1 Gradient Checkpointing

# Unsloth-optimized Gradient Checkpointing
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",  # Key! 30% VRAM savings
)

# Gradient checkpointing options:
# "unsloth": Unsloth-optimized version (faster and more memory efficient)
# True: Standard PyTorch gradient checkpointing
# False: Disabled (fastest but uses most memory)

6.2 Sequence Packing

# Pack short sequences together for better GPU utilization
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    packing=True,           # Enable sequence packing
    max_seq_length=2048,    # Total packed length
)

# Packing effect:
# Packing OFF: [tokentokenPAD PAD PAD PAD] [tokenPAD PAD PAD PAD PAD]
# Packing ON:  [tokentokentokenSEPtokentokentoken] -> Better GPU utilization

6.3 VRAM Usage Table (Unsloth QLoRA)

Model	Batch=1	Batch=2	Batch=4	Batch=8
Llama 3 8B	4.2GB	5.8GB	8.5GB	14.2GB
Mistral 7B	3.8GB	5.2GB	7.8GB	13.0GB
Phi-3 3.8B	2.4GB	3.2GB	4.8GB	7.6GB
Qwen 2 7B	3.8GB	5.2GB	7.8GB	13.0GB
Llama 3 70B	36GB	42GB	56GB	OOM

* Based on max_seq_length=2048, gradient_checkpointing="unsloth"

7. Model Export and Conversion

7.1 Save LoRA Adapter

# Save LoRA adapter only (small size)
model.save_pretrained("lora_adapter")
tokenizer.save_pretrained("lora_adapter")

# Check saved files
import os
for f in os.listdir("lora_adapter"):
    size = os.path.getsize(f"lora_adapter/{f}") / 1024 / 1024
    print(f"  {f}: {size:.1f} MB")

# adapter_config.json: 0.0 MB
# adapter_model.safetensors: 160.0 MB  <- LoRA weights
# tokenizer.json: 17.1 MB

7.2 Merge Adapter

# Merge LoRA adapter with base model
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("merged_model")
tokenizer.save_pretrained("merged_model")

7.3 GGUF Conversion (for llama.cpp)

# Unsloth's built-in GGUF conversion
# Supports various quantization levels

# Q4_K_M: Most common (quality/size balance)
model.save_pretrained_gguf(
    "model_gguf",
    tokenizer,
    quantization_method="q4_k_m",
)

# Q5_K_M: Higher quality
model.save_pretrained_gguf(
    "model_q5",
    tokenizer,
    quantization_method="q5_k_m",
)

# Q8_0: Highest quality (larger size)
model.save_pretrained_gguf(
    "model_q8",
    tokenizer,
    quantization_method="q8_0",
)

# F16: No quantization (largest)
model.save_pretrained_gguf(
    "model_f16",
    tokenizer,
    quantization_method="f16",
)

GGUF Quantization Comparison:

Quantization	File Size (8B)	Quality	Inference Speed	Recommended
Q4_K_M	~4.5GB	Good	Fast	General use
Q5_K_M	~5.5GB	Very Good	Medium	Quality-focused
Q8_0	~8.0GB	Excellent	Slow	Highest quality
F16	~16GB	Original	Slowest	Reference

7.4 GPTQ Conversion

# GPTQ quantization (for GPU inference)
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
)

# Prepare calibration data
calibration_data = [
    tokenizer(text, return_tensors="pt")
    for text in calibration_texts[:128]
]

# Run GPTQ quantization
gptq_model = AutoGPTQForCausalLM.from_pretrained(
    "merged_model",
    quantize_config=quantize_config,
)
gptq_model.quantize(calibration_data)
gptq_model.save_quantized("model_gptq")

7.5 Upload to Hugging Face Hub

# Upload model to Hugging Face Hub

# Upload LoRA adapter only
model.push_to_hub(
    "my-org/llama3-8b-korean-lora",
    token="hf_xxxxx",
    private=True,
)
tokenizer.push_to_hub(
    "my-org/llama3-8b-korean-lora",
    token="hf_xxxxx",
    private=True,
)

# Upload GGUF file
model.push_to_hub_gguf(
    "my-org/llama3-8b-korean-gguf",
    tokenizer,
    quantization_method="q4_k_m",
    token="hf_xxxxx",
)

8. Evaluation and Testing

8.1 Inference with Fine-tuned Model

# Switch to inference mode
FastLanguageModel.for_inference(model)

# Single prompt inference
def generate_response(instruction, input_text=""):
    prompt = alpaca_prompt.format(instruction, input_text, "")
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.15,
        do_sample=True,
    )

    response = tokenizer.batch_decode(outputs)[0]
    response = response.split("### Response:\n")[-1]
    response = response.replace(tokenizer.eos_token, "").strip()
    return response

# Test
test_questions = [
    "Explain the traditional holidays of Korea.",
    "Explain how decorators work in Python.",
    "Give me tips for healthy eating habits.",
]

for q in test_questions:
    print(f"Q: {q}")
    print(f"A: {generate_response(q)}")
    print("-" * 80)

8.2 Generation Parameter Tuning

# Generation parameter effects
generation_configs = {
    "Factual answers": {
        "temperature": 0.1,
        "top_p": 0.9,
        "repetition_penalty": 1.0,
    },
    "Creative answers": {
        "temperature": 0.8,
        "top_p": 0.95,
        "repetition_penalty": 1.15,
    },
    "Balanced answers": {
        "temperature": 0.5,
        "top_p": 0.9,
        "repetition_penalty": 1.1,
    },
}

8.3 lm-eval-harness Benchmark

# Benchmark evaluation with lm-eval-harness
pip install lm-eval

# Korean benchmark evaluation
lm_eval --model hf \
    --model_args pretrained=./merged_model \
    --tasks kobest_boolq,kobest_copa,kobest_hellaswag,kobest_sentineg,kobest_wic \
    --batch_size 4 \
    --output_path ./eval_results

# Run from Python
from lm_eval import evaluator

results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=./merged_model",
    tasks=["kobest_boolq", "kobest_copa", "kobest_hellaswag"],
    batch_size=4,
)

for task, metrics in results["results"].items():
    print(f"{task}: acc={metrics.get('acc', 'N/A')}")

9. Advanced Techniques

9.1 Multi-GPU Training (DeepSpeed ZeRO)

# deepspeed_config.json
"""
{
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "reduce_scatter": true
    },
    "bf16": {
        "enabled": true
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}
"""

# Execute
# deepspeed --num_gpus 4 train.py --deepspeed deepspeed_config.json

9.2 DPO Training

from trl import DPOTrainer, DPOConfig
from unsloth import FastLanguageModel, PatchDPOTrainer

# Apply DPO patch
PatchDPOTrainer()

# Prepare DPO dataset
dpo_dataset = load_dataset("argilla/ultrafeedback-binarized-preferences", split="train")

# Configure DPO Trainer
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,           # None in Unsloth (auto-handled)
    args=DPOConfig(
        output_dir="./dpo_output",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=5e-7,       # DPO uses lower LR
        num_train_epochs=1,
        beta=0.1,                 # DPO beta (KL divergence weight)
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
    ),
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)

dpo_trainer.train()

9.3 Continued Pre-training (Domain Adaptation)

# Continued Pre-training with domain-specific text
from trl import SFTTrainer

# Domain text data (medical, legal, financial, etc.)
domain_dataset = load_dataset("my-org/medical-korean-corpus", split="train")

# Use lower learning rate for Continued Pre-training
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=domain_dataset,
    dataset_text_field="text",
    max_seq_length=4096,     # Long documents
    packing=True,            # Use packing for efficiency
    args=TrainingArguments(
        output_dir="./cpt_output",
        learning_rate=2e-5,  # Very low learning rate
        num_train_epochs=1,  # 1 epoch is sufficient
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        optim="adamw_8bit",
        warmup_ratio=0.1,
    ),
)

trainer.train()

10. Common Issues and Solutions

10.1 OOM (Out of Memory) Errors

# Symptom: CUDA out of memory
# RuntimeError: CUDA out of memory.

# Solutions in order:
# 1. Reduce batch_size
per_device_train_batch_size = 1  # Minimum

# 2. Increase gradient_accumulation_steps
gradient_accumulation_steps = 8

# 3. Reduce max_seq_length
max_seq_length = 1024  # 2048 -> 1024

# 4. Reduce LoRA rank
r = 8  # 16 -> 8

# 5. Verify gradient checkpointing
use_gradient_checkpointing = "unsloth"

# 6. Clear cache
torch.cuda.empty_cache()
import gc
gc.collect()

10.2 NaN Loss

# Symptom: loss diverges to NaN
# Cause: Learning rate too high or data issues

# Solutions:
# 1. Lower learning rate
learning_rate = 1e-5  # 2e-4 -> 1e-5

# 2. Set max_grad_norm
max_grad_norm = 0.3  # Gradient clipping

# 3. Validate data
def check_data_issues(dataset, tokenizer):
    """Check for data problems"""
    issues = []
    for i, item in enumerate(dataset):
        text = item["text"]
        if not text.strip():
            issues.append(f"[{i}] Empty text")
        tokens = tokenizer.encode(text)
        if len(tokens) > 4096:
            issues.append(f"[{i}] Text too long: {len(tokens)} tokens")
        if not any(c.isalnum() for c in text):
            issues.append(f"[{i}] No valid text content")
    return issues

10.3 Catastrophic Forgetting

# Symptom: Existing knowledge disappears after fine-tuning
# Solutions:

# 1. Use lower learning rate
learning_rate = 5e-5

# 2. Fewer epochs (1-3)
num_train_epochs = 1

# 3. Mix general knowledge into training data
# Original data 80% + general knowledge data 20%

# 4. Lower LoRA rank (limits change magnitude)
r = 8

# 5. Stronger regularization
weight_decay = 0.1

10.4 Overfitting Detection

# Overfitting indicators:
# 1. Train loss decreases but eval loss increases
# 2. Model output nearly memorizes training data
# 3. Performance degrades on new prompts

# Solutions:
# 1. Increase data volume
# 2. Regularization (dropout, weight_decay)
# 3. Early stopping
from transformers import EarlyStoppingCallback

trainer = SFTTrainer(
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    args=TrainingArguments(
        evaluation_strategy="steps",
        eval_steps=50,
        load_best_model_at_end=True,
    ),
)

11. Quiz

Q1. With LoRA r=16, what percentage of parameters are trained compared to the original weights?

Answer: About 0.5% (99.5% savings)

For d=4096:

Full: 4096 x 4096 = 16,777,216
LoRA r=16: (4096 x 16) + (16 x 4096) = 131,072
Ratio: 131,072 / 16,777,216 = 0.78%

In practice, applying to multiple modules (q, k, v, o, gate, up, down) results in about 0.5% of total parameters.

Q2. Why is QLoRA's NF4 quantization better than standard INT4?

Answer: Optimal quantization leveraging the normal distribution characteristics of weights

NF4 exploits the fact that neural network weights generally follow a normal distribution. By placing 16 quantization values at the quantiles of the normal distribution, it achieves less information loss than uniformly-spaced INT4. Theoretically, it achieves near-optimal quantization for normally distributed data.

Q3. What is the core reason Unsloth is 2x faster than HuggingFace PEFT?

Answer: Custom Triton kernels

Unsloth replaces core operations like Attention, MLP, and Cross-Entropy Loss with custom GPU kernels written in Triton. These kernels optimize memory access patterns and reduce unnecessary intermediate tensor creation, achieving 2x faster training and 60% memory savings.

Q4. What is the principle and tradeoff of Gradient Checkpointing?

Answer:

Principle: Instead of storing intermediate activations in memory during the forward pass, they are recomputed on-demand during the backward pass.

Tradeoff:

Benefit: VRAM usage reduced by approximately 30-50%
Cost: Training time increases by approximately 20-30% due to recomputation

Unsloth's custom gradient checkpointing is more efficient than the standard PyTorch implementation, resulting in less time overhead.

Q5. What are the differences between GGUF Q4_K_M and Q8_0, and when should each be used?

Answer:

Q4_K_M (4-bit Mixed):

File size: About 28% of original (approximately 4.5GB for 8B models)
Quality: Slight performance degradation from original
Speed: Fast
Recommended for: Daily use, mobile/edge deployment, limited VRAM/RAM environments

Q8_0 (8-bit):

File size: About 50% of original (approximately 8GB for 8B models)
Quality: Very close to original
Speed: Slower than Q4
Recommended for: Quality-first use cases, environments with sufficient memory, services requiring accurate inference

12. References

LoRA: Low-Rank Adaptation of Large Language Models - Hu et al., 2021
QLoRA: Efficient Finetuning of Quantized LLMs - Dettmers et al., 2023
Unsloth Documentation - github.com/unslothai/unsloth
PEFT: Parameter-Efficient Fine-Tuning - HuggingFace
TRL: Transformer Reinforcement Learning - HuggingFace
Flash Attention 2 - Dao et al., 2023
LLM.int8(): 8-bit Matrix Multiplication - Dettmers et al., 2022
llama.cpp - github.com/ggerganov/llama.cpp
GPTQ: Accurate Post-Training Quantization - Frantar et al., 2022
DeepSpeed ZeRO - Rajbhandari et al., 2020
Direct Preference Optimization - Rafailov et al., 2023
Scaling Data-Constrained Language Models - Muennighoff et al., 2023
Training Compute-Optimal Large Language Models (Chinchilla) - Hoffmann et al., 2022
The Llama 3 Herd of Models - Meta AI, 2024