Split View: Fine-tuning 실전 가이드: LoRA와 QLoRA로 나만의 모델 만들기

Fine-tuning 실전 가이드: LoRA와 QLoRA로 나만의 모델 만들기

왜 Full Fine-tuning 대신 LoRA인가?
LoRA 작동 원리: 직관적 설명
QLoRA: 더 적은 메모리로
실전 코드: Llama 3.1 8B 파인튜닝 (처음부터 끝까지)
데이터셋 만들기: 제일 중요한 부분
Unsloth: 2배 빠른 LoRA 학습
하이퍼파라미터 튜닝 가이드
흔한 실수와 해결법
파인튜닝 vs 프롬프트 엔지니어링: 언제 파인튜닝이 필요한가?
결론

왜 Full Fine-tuning 대신 LoRA인가?

파인튜닝에 관심을 갖기 시작하면 제일 먼저 벽에 부딪히는 게 있습니다. 바로 VRAM 요구사항입니다.

Full Fine-tuning Llama 3.1 70B:
- VRAM 필요량: ~560GB (FP32)
  → H100 80GB 기준 7개 필요
- 학습 시간: 며칠
- 클라우드 비용: 수천 달러

LoRA Fine-tuning Llama 3.1 70B:
- VRAM 필요량: ~48GB (4-bit QLoRA: 20GB!)
  → RTX 3090 1개로 가능
- 학습 시간: 몇 시간
- 클라우드 비용: $20-$100 (A100 대여 기준)

이 차이가 어떻게 가능한 걸까요? LoRA의 핵심 아이디어를 이해하면 그 이유가 명확해집니다.

LoRA 작동 원리: 직관적 설명

Full Fine-tuning은 모델의 모든 가중치를 업데이트합니다. Llama 3.1 70B라면 700억 개의 파라미터가 모두 바뀌는 거죠. 이걸 저장하고 최적화하려면 어마어마한 메모리가 필요합니다.

LoRA(Low-Rank Adaptation)는 다르게 접근합니다:

Full Fine-tuning:
W_new = W_original + 델타W
  (델타W는 원본 행렬과 같은 크기 = 70B 파라미터 업데이트)

LoRA: 델타W를 두 개의 작은 행렬로 분해
델타W = A × B
  A: (d × r) 행렬, B: (r × d) 행렬
  r = rank (보통 4~64, 작을수록 파라미터 절약)

예시: d=4096, r=16인 경우
- Full 델타W: 4096 × 4096 = 16.7M 파라미터
- LoRA 델타W: A(4096×16) + B(16×4096) = 131K 파라미터
- 128배 적은 파라미터로 동일한 효과!

학습 중: W_original은 얼리고(freeze), A와 B만 학습
추론 시: W_new = W_original + A × B (병합 또는 분리 유지)

수학적으로, 대부분의 가중치 업데이트는 낮은 랭크(low-rank) 구조를 가진다는 가설이 실험적으로 검증되었습니다. 즉, 모든 파라미터를 바꿀 필요가 없다는 거죠.

QLoRA: 더 적은 메모리로

QLoRA = LoRA + 4비트 양자화된 베이스 모델

일반 LoRA:
- 베이스 모델: FP16 (절반 정밀도)
- LoRA 어댑터: FP16
- 70B 모델 VRAM: ~140GB

QLoRA:
- 베이스 모델: 4비트 (NF4 양자화)
- LoRA 어댑터: BF16/FP16 유지
- 70B 모델 VRAM: ~20GB (!!)

4비트 양자화로 인한 품질 손실은 놀랍도록 적습니다. 특히 파인튜닝 이후에는 더욱 그렇습니다. QLoRA는 2023년 Tim Dettmers 등의 논문으로 발표되어 LLM 파인튜닝의 민주화를 이끌었습니다.

실전 코드: Llama 3.1 8B 파인튜닝 (처음부터 끝까지)

실제로 돌아가는 코드입니다. Hugging Face 생태계를 활용합니다.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset, Dataset

# ============================================================
# 1단계: 4비트 양자화로 모델 로드 (QLoRA)
# ============================================================
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NF4가 FP4보다 품질 우수
    bnb_4bit_compute_dtype=torch.bfloat16,  # 연산은 BF16으로
    bnb_4bit_use_double_quant=True      # 추가 메모리 절약
)

model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"  # 중요: 좌측 패딩하면 학습 불안정

# ============================================================
# 2단계: LoRA 구성
# ============================================================
lora_config = LoraConfig(
    r=16,                   # rank: 클수록 표현력 높지만 VRAM 증가
    lora_alpha=32,          # 스케일링 인자 (보통 rank의 2배)
    target_modules=[        # 어떤 레이어에 LoRA 적용할지
        "q_proj", "v_proj", "k_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"  # MLP 레이어도 포함
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# 출력 예: trainable params: 167,772,160 || all params: 8,201,441,280 || trainable%: 2.05
# 전체의 2%만 학습! 나머지 98%는 얼어있음

# ============================================================
# 3단계: 데이터셋 준비
# ============================================================
# 실제 데이터셋을 사용하거나 커스텀 데이터 사용
# 형식: instruction-response 쌍

raw_data = [
    {
        "instruction": "다음 이메일의 감정을 분류해주세요.",
        "input": "배송이 3일이나 늦었는데 아무 연락도 없었어요. 정말 실망입니다.",
        "output": "부정적 (불만, 실망)"
    },
    # ... 더 많은 예시
]

def format_instruction(example):
    """Llama 3.1 인스트럭션 형식으로 변환"""
    if example.get("input"):
        text = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
당신은 도움이 되는 한국어 AI 어시스턴트입니다.<|eot_id|><|start_header_id|>user<|end_header_id|>
{example['instruction']}

{example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{example['output']}<|eot_id|>"""
    else:
        text = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
당신은 도움이 되는 한국어 AI 어시스턴트입니다.<|eot_id|><|start_header_id|>user<|end_header_id|>
{example['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{example['output']}<|eot_id|>"""
    return {"text": text}

dataset = Dataset.from_list(raw_data)
dataset = dataset.map(format_instruction)
train_test = dataset.train_test_split(test_size=0.1)

# ============================================================
# 4단계: 학습
# ============================================================
training_args = SFTConfig(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,   # 유효 배치 크기 = 4 × 4 = 16
    gradient_checkpointing=True,     # VRAM 절약 (속도 ~20% 감소)
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    fp16=True,
    logging_steps=10,
    eval_steps=50,
    save_steps=100,
    eval_strategy="steps",
    load_best_model_at_end=True,
    max_seq_length=2048,
    dataset_text_field="text",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_test["train"],
    eval_dataset=train_test["test"],
    tokenizer=tokenizer,
)

trainer.train()

# ============================================================
# 5단계: 저장 및 병합 (선택)
# ============================================================
# LoRA 어댑터만 저장 (작은 파일)
model.save_pretrained("./lora-adapter")
tokenizer.save_pretrained("./lora-adapter")

# 선택사항: 베이스 모델에 병합 (배포용)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")

데이터셋 만들기: 제일 중요한 부분

코드보다 데이터가 훨씬 중요합니다. 파인튜닝 실패의 80%는 데이터 문제입니다.

품질 vs 양

현장에서 배운 교훈: 1,000개의 고품질 예시가 100,000개의 노이즈 데이터보다 낫습니다.

# 좋은 데이터의 기준
good_data_checklist = {
    "일관성": "같은 질문에 항상 같은 스타일로 답한다",
    "다양성": "모든 사용 케이스를 커버한다",
    "정확성": "틀린 정보가 없다",
    "형식": "목표 모델의 응답 형식과 일치한다",
    "길이": "너무 짧지도, 불필요하게 길지도 않다"
}

# 데이터 품질 체크 함수
def check_data_quality(examples):
    issues = []
    for i, ex in enumerate(examples):
        if len(ex["output"]) < 10:
            issues.append(f"예시 {i}: 응답이 너무 짧음")
        if len(ex["output"]) > 2000:
            issues.append(f"예시 {i}: 응답이 너무 김")
        if ex["output"] == examples[max(0, i-1)]["output"]:
            issues.append(f"예시 {i}: 중복 응답 가능성")
    return issues

데이터 소스:

수동 작성: 가장 비싸지만 품질 최고
GPT-4로 생성 후 검수: 균형 잡힌 방법 (단, 라이선스 확인 필수)
기존 프로덕션 로그: 실제 사용 패턴 반영, 하지만 정제 필요
공개 데이터셋 + 커스텀 조합: 효율적

Unsloth: 2배 빠른 LoRA 학습

Unsloth는 LoRA 학습을 최적화한 라이브러리입니다. 같은 하드웨어에서 표준 PEFT 대비 2배 빠르고, VRAM도 70% 적게 사용합니다.

from unsloth import FastLanguageModel
import torch

# 표준 transformers + PEFT보다 훨씬 빠름
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# LoRA 설정 (Unsloth 최적화 적용)
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,  # Unsloth는 0 권장
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth 최적화 체크포인트
    random_state=3407,
)

Unsloth는 특히 제한된 GPU에서 실험할 때 체감 차이가 큽니다. 공식 GitHub에서 각 모델별 최적 설정을 제공하니 참고하세요.

하이퍼파라미터 튜닝 가이드

파인튜닝 결과가 좋지 않을 때 먼저 확인할 것들:

학습률 (Learning Rate):
- 너무 높으면: 기존 능력 망각 (catastrophic forgetting)
- 너무 낮으면: 원하는 행동 학습 부족
- 권장 범위: 1e-4 ~ 3e-4 (LoRA에서 full FT보다 높게 설정 가능)

LoRA Rank (r):
- r=4: 빠르고 가벼움, 간단한 태스크
- r=8: 일반적인 시작점
- r=16: 복잡한 태스크나 스타일 학습
- r=64: 거의 full FT 수준, VRAM 증가

에포크 수:
- 소규모 데이터(< 1,000개): 3-5 에포크
- 중규모 데이터(1,000-10,000개): 1-3 에포크
- 대규모 데이터(> 10,000개): 1 에포크도 충분

배치 크기 × gradient_accumulation:
- 유효 배치 크기 = per_device_batch × gradient_accumulation
- 너무 작으면 학습 불안정, 너무 크면 과적합 위험
- 보통 16-32 권장

흔한 실수와 해결법

실수 1: Catastrophic Forgetting

파인튜닝 후 모델이 기존 능력을 잃어버리는 현상.

해결책: 기존 능력 유지용 데이터를 훈련 셋에 10-20% 혼합 (rehearsal mixing).

실수 2: Overfitting

훈련 loss는 계속 줄어드는데 실제 성능은 나빠짐.

# Early stopping 사용
training_args = SFTConfig(
    ...
    eval_strategy="steps",
    eval_steps=50,
    load_best_model_at_end=True,  # 최고 체크포인트 자동 선택
    metric_for_best_model="eval_loss",
    greater_is_better=False,
)

실수 3: 데이터 형식 불일치

모델마다 기대하는 채팅 템플릿이 다릅니다. Llama 3.1, Mistral, Qwen은 모두 형식이 다릅니다.

# tokenizer의 apply_chat_template 사용 권장
messages = [
    {"role": "system", "content": "당신은 전문 번역가입니다."},
    {"role": "user", "content": "다음을 영어로 번역해주세요: 안녕하세요"},
    {"role": "assistant", "content": "Hello."}
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False)

파인튜닝 vs 프롬프트 엔지니어링: 언제 파인튜닝이 필요한가?

파인튜닝이 만능은 아닙니다. 이럴 때 필요합니다:

파인튜닝이 필요한 경우:

특정 도메인 전문 지식 주입 (의료, 법률, 사내 문서)
응답 스타일/형식의 일관성 (항상 JSON, 특정 말투)
프롬프트만으로 불가능한 행동 변화
비용 절감 (작은 파인튜닝 모델이 큰 GPT-4보다 저렴)

프롬프트 엔지니어링으로 충분한 경우:

일반적인 태스크 (요약, 번역, Q&A)
빠른 프로토타입
데이터가 충분하지 않을 때

일반적인 조언: 먼저 프롬프트 엔지니어링을 최대한 시도하고, 한계에 부딪혔을 때 파인튜닝을 고려하세요.

결론

LoRA와 QLoRA는 LLM 파인튜닝을 소수의 대기업 전유물에서 모든 개발자의 도구로 바꿔놓았습니다.

핵심 요약:

QLoRA: 20GB VRAM으로 70B 모델 파인튜닝 가능
데이터 품질: 코드보다 중요. 1,000개의 좋은 예시면 충분
Unsloth: 같은 하드웨어에서 2배 빠른 학습
먼저 작게: 8B 모델로 개념 검증 후 70B로 확장

직접 만든 파인튜닝 모델이 GPT-4를 특정 태스크에서 이기는 경험을 해보세요. 그 만족감이 상당합니다.

Fine-tuning in Practice: Building Your Own Model with LoRA and QLoRA

Why LoRA Instead of Full Fine-tuning?
How LoRA Works: Intuitive Explanation
QLoRA: Even Less Memory
Production Code: Fine-tuning Llama 3.1 8B End-to-End
Building Your Dataset: The Most Important Part
Unsloth: 2x Faster LoRA Training
Hyperparameter Tuning Guide
Common Mistakes and Fixes
Fine-tuning vs Prompt Engineering: When Do You Need Fine-tuning?
Conclusion

Why LoRA Instead of Full Fine-tuning?

The first wall you hit when exploring fine-tuning is VRAM requirements.

Full Fine-tuning Llama 3.1 70B:
- VRAM required: ~560GB (FP32)
  -> Needs 7x H100 80GB GPUs
- Training time: days
- Cloud cost: thousands of dollars

LoRA Fine-tuning Llama 3.1 70B:
- VRAM required: ~48GB (QLoRA: ~20GB!)
  -> A single RTX 3090 or A10 works
- Training time: hours on 1 GPU
- Cloud cost: $20-100 (A100 rental)

How is this gap possible? Understanding LoRA's core idea makes it clear.

How LoRA Works: Intuitive Explanation

Full fine-tuning updates every weight in the model. For Llama 3.1 70B, that's 70 billion parameters all changing. Storing and optimizing that requires enormous memory.

LoRA (Low-Rank Adaptation) takes a different approach:

Full fine-tuning:
W_new = W_original + delta_W
(delta_W is the same size as the original matrix = 70B parameter updates)

LoRA: decompose delta_W into two small matrices
delta_W = A x B
  A: (d x r) matrix, B: (r x d) matrix
  r = rank (typically 4-64; smaller = more memory-efficient)

Example: d=4096, r=16
- Full delta_W: 4096 x 4096 = 16.7M parameters
- LoRA delta_W: A(4096x16) + B(16x4096) = 131K parameters
- 128x fewer parameters to train for equivalent effect!

During training: freeze W_original, only train A and B
During inference: W_new = W_original + A x B (merge or keep separate)

Empirically, most weight updates have low-rank structure — meaning you don't need to update every parameter to achieve the desired behavior change.

QLoRA: Even Less Memory

QLoRA = LoRA + 4-bit quantized base model

Standard LoRA:
- Base model: FP16
- LoRA adapters: FP16
- VRAM for 70B model: ~140GB

QLoRA:
- Base model: 4-bit (NF4 quantization)
- LoRA adapters: BF16/FP16 (full precision)
- VRAM for 70B model: ~20GB (!!)

The quality loss from 4-bit quantization is surprisingly small, especially after fine-tuning compensates for it. QLoRA was introduced in a 2023 paper by Tim Dettmers et al. and effectively democratized LLM fine-tuning.

Production Code: Fine-tuning Llama 3.1 8B End-to-End

Working code using the Hugging Face ecosystem.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset, Dataset

# ============================================================
# Step 1: Load model with 4-bit quantization (QLoRA)
# ============================================================
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",              # NF4 beats FP4 in quality
    bnb_4bit_compute_dtype=torch.bfloat16,  # compute in BF16
    bnb_4bit_use_double_quant=True          # extra memory savings
)

model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"  # Important: left-padding causes instability

# ============================================================
# Step 2: Configure LoRA
# ============================================================
lora_config = LoraConfig(
    r=16,                   # rank: higher = more expressive, more VRAM
    lora_alpha=32,          # scaling factor (typically 2x rank)
    target_modules=[        # which layers to apply LoRA to
        "q_proj", "v_proj", "k_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"  # include MLP layers too
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Example output: trainable params: 167,772,160 || all params: 8,201,441,280 || trainable%: 2.05
# Only 2% of parameters are trained — 98% are frozen

# ============================================================
# Step 3: Prepare dataset
# ============================================================
raw_data = [
    {
        "instruction": "Classify the sentiment of the following customer review.",
        "input": "The shipping took 3 extra days with no communication. Very disappointing.",
        "output": "Negative (frustrated, disappointed)"
    },
    # ... more examples
]

def format_instruction(example):
    """Convert to Llama 3.1 instruction format"""
    if example.get("input"):
        text = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
{example['instruction']}

{example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{example['output']}<|eot_id|>"""
    else:
        text = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
{example['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{example['output']}<|eot_id|>"""
    return {"text": text}

dataset = Dataset.from_list(raw_data)
dataset = dataset.map(format_instruction)
train_test = dataset.train_test_split(test_size=0.1)

# ============================================================
# Step 4: Train
# ============================================================
training_args = SFTConfig(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,     # effective batch = 4 x 4 = 16
    gradient_checkpointing=True,       # save VRAM (~20% speed tradeoff)
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    fp16=True,
    logging_steps=10,
    eval_steps=50,
    save_steps=100,
    eval_strategy="steps",
    load_best_model_at_end=True,
    max_seq_length=2048,
    dataset_text_field="text",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_test["train"],
    eval_dataset=train_test["test"],
    tokenizer=tokenizer,
)

trainer.train()

# ============================================================
# Step 5: Save and optionally merge
# ============================================================
# Save only the LoRA adapter (small file)
model.save_pretrained("./lora-adapter")
tokenizer.save_pretrained("./lora-adapter")

# Optional: merge into base model for deployment
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")

Building Your Dataset: The Most Important Part

Data matters more than code. Roughly 80% of fine-tuning failures trace back to data problems.

Quality vs Quantity

The field has validated this repeatedly: 1,000 high-quality examples beat 100,000 noisy ones.

# Criteria for good training data
good_data_checklist = {
    "consistency": "Same question always answered in the same style",
    "diversity": "Covers all use cases you care about",
    "accuracy": "No incorrect information",
    "format": "Matches target model response style",
    "length": "Not too short, not unnecessarily long",
}

# Simple data quality check
def check_data_quality(examples):
    issues = []
    for i, ex in enumerate(examples):
        if len(ex["output"]) < 10:
            issues.append(f"Example {i}: response too short")
        if len(ex["output"]) > 2000:
            issues.append(f"Example {i}: response too long")
        if i > 0 and ex["output"] == examples[i-1]["output"]:
            issues.append(f"Example {i}: possible duplicate response")
    return issues

Data sources:

Manual authoring: most expensive, highest quality
GPT-4 generation + human review: balanced approach (check license terms!)
Production logs: reflects real usage, but needs cleaning
Public datasets + custom combination: efficient

Unsloth: 2x Faster LoRA Training

Unsloth optimizes LoRA training at the kernel level. It's 2x faster than standard PEFT with 70% less VRAM usage on the same hardware.

from unsloth import FastLanguageModel
import torch

# Much faster than standard transformers + PEFT
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# LoRA config with Unsloth optimizations
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,     # Unsloth recommends 0
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth's optimized checkpointing
    random_state=3407,
)

The speedup is especially noticeable on constrained GPUs. Check the official Unsloth GitHub for per-model optimal configurations.

Hyperparameter Tuning Guide

When results aren't what you expected, check these first:

Learning Rate:
- Too high: catastrophic forgetting (model loses existing capabilities)
- Too low: doesn't learn target behavior
- Recommended range: 1e-4 to 3e-4 (LoRA can use higher LR than full FT)

LoRA Rank (r):
- r=4: fast and lightweight, simple tasks
- r=8: good starting point for most use cases
- r=16: complex tasks, style learning
- r=64: approaching full fine-tuning quality, VRAM increases

Number of epochs:
- Small dataset (< 1,000 examples): 3-5 epochs
- Medium dataset (1,000-10,000 examples): 1-3 epochs
- Large dataset (> 10,000 examples): 1 epoch often sufficient

Effective batch size = per_device_batch x gradient_accumulation:
- Too small: unstable training
- Too large: risk of overfitting
- Typically 16-32 recommended

Common Mistakes and Fixes

Mistake 1: Catastrophic Forgetting

After fine-tuning, the model loses general capabilities.

Fix: Mix 10-20% general-purpose examples into your training data (rehearsal mixing).

Mistake 2: Overfitting

Training loss keeps decreasing but real-world performance degrades.

# Use early stopping
training_args = SFTConfig(
    ...
    eval_strategy="steps",
    eval_steps=50,
    load_best_model_at_end=True,   # auto-select best checkpoint
    metric_for_best_model="eval_loss",
    greater_is_better=False,
)

Mistake 3: Chat Template Mismatch

Each model expects a different chat format. Llama 3.1, Mistral, and Qwen all differ.

# Use tokenizer.apply_chat_template instead of manual formatting
messages = [
    {"role": "system", "content": "You are a professional translator."},
    {"role": "user", "content": "Translate this to French: Hello"},
    {"role": "assistant", "content": "Bonjour"}
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False)

Fine-tuning vs Prompt Engineering: When Do You Need Fine-tuning?

Fine-tuning isn't always the answer. Consider it when:

Fine-tuning is justified:

Injecting proprietary domain knowledge (medical, legal, internal docs)
Consistent response format/style (always JSON, specific tone)
Behavior changes that prompting can't achieve reliably
Cost reduction (a small fine-tuned model can beat a large GPT-4 at lower cost)

Prompt engineering is sufficient:

Standard tasks (summarization, translation, Q&A)
Rapid prototyping
When you don't have enough training data

General advice: push prompt engineering as far as it'll go first. Only reach for fine-tuning when you've hit its limits.

Conclusion

LoRA and QLoRA have transformed LLM fine-tuning from a large-company privilege into a tool any developer can use.

Key takeaways:

QLoRA: 70B models fine-tunable on 20GB VRAM
Data quality: matters more than code. 1,000 good examples is enough to start
Unsloth: 2x faster training on the same hardware
Start small: validate with an 8B model, then scale to 70B

There's something satisfying about watching a model you fine-tuned outperform GPT-4 on your specific task. Worth experiencing firsthand.