Split View: LLM Fine-tuning 실전 가이드: LoRA, QLoRA, PEFT로 효율적 모델 적응

LLM Fine-tuning 실전 가이드: LoRA, QLoRA, PEFT로 효율적 모델 적응

들어가며
Fine-tuning 패러다임 변화
- 전체 파인튜닝의 한계
- PEFT 기법의 분류
LoRA 수학적 원리와 구현
QLoRA 4비트 양자화
PEFT 라이브러리 활용
- Hugging Face PEFT 라이브러리 개요
- 학습된 어댑터 저장과 로드
데이터셋 구성 전략
- Instruction Tuning 데이터 포맷
- 데이터 품질 관리 체크리스트
하이퍼파라미터 튜닝
- 핵심 하이퍼파라미터 가이드
- 학습 모니터링
트러블슈팅
운영 체크리스트
참고자료

들어가며

사전 학습된 대규모 언어 모델(LLM)을 특정 도메인이나 태스크에 맞게 적응시키는 Fine-tuning은 LLM 활용의 핵심 기술이다. 그러나 수십억 개의 파라미터를 가진 모델을 전체 파인튜닝(Full Fine-tuning)하려면 막대한 GPU 메모리와 연산 비용이 필요하다. GPT-3 175B 모델의 경우 Adam 옵티마이저 기준으로 약 1.2TB의 GPU 메모리가 필요하며, 이는 대부분의 조직에서 현실적이지 않다.

이 문제를 해결하기 위해 등장한 것이 Parameter-Efficient Fine-Tuning(PEFT) 기법이다. 특히 LoRA(Low-Rank Adaptation)와 QLoRA(Quantized LoRA)는 학습 가능한 파라미터 수를 원본 모델의 0.1% 미만으로 줄이면서도 전체 파인튜닝에 준하는 성능을 달성한다. 본 글에서는 이러한 효율적 파인튜닝 기법의 이론적 배경부터 프로덕션 수준의 실전 구현까지 체계적으로 다룬다.

Fine-tuning 패러다임 변화

전체 파인튜닝의 한계

전통적인 Fine-tuning은 사전 학습된 모델의 모든 파라미터를 업데이트한다. 이 방식은 아래와 같은 근본적인 문제를 안고 있다.

메모리 비용: 모델 가중치 + 그래디언트 + 옵티마이저 상태를 모두 GPU 메모리에 적재해야 한다
저장 비용: 태스크별로 전체 모델 사본을 저장해야 하므로, 10개 태스크에 대해 70B 모델을 사용하면 약 1.4TB의 저장 공간이 필요하다
Catastrophic Forgetting: 소규모 데이터셋에 과적합되면서 사전 학습에서 획득한 일반 지식을 상실한다

PEFT 기법의 분류

Parameter-Efficient Fine-Tuning 기법은 크게 세 가지 접근법으로 분류된다.

방법	대표 기법	원리	학습 파라미터 비율	GPU 메모리 (7B 기준)	성능 (Full FT 대비)
Full Fine-tuning	-	전체 파라미터 업데이트	100%	약 120GB	기준선
Additive (어댑터)	Adapter, Prefix Tuning	소규모 모듈 삽입	0.5-3%	약 30GB	95-98%
Reparameterization	LoRA, QLoRA	저랭크 행렬 분해	0.01-0.5%	약 16-28GB	97-100%
Selective	BitFit, Diff Pruning	일부 파라미터만 선택 학습	0.05-1%	약 25GB	90-95%

LoRA 수학적 원리와 구현

저랭크 분해의 핵심 아이디어

LoRA(Low-Rank Adaptation)는 Hu et al.(2021)이 제안한 기법으로, 사전 학습된 가중치 행렬의 업데이트를 저랭크 행렬의 곱으로 근사한다는 핵심 아이디어에 기반한다.

기존 가중치 행렬 W0 (d x k 차원)에 대해, 업데이트 delta_W를 두 개의 저랭크 행렬 B (d x r)와 A (r x k)의 곱으로 분해한다. 여기서 r은 랭크로, d나 k보다 훨씬 작은 값이다.

순전파 시 출력은 다음과 같이 계산된다: h = W0 _ x + (B _ A) _ x. 학습 시에는 W0는 동결(freeze)하고 B와 A만 학습한다. 학습 가능한 파라미터 수는 d _ k에서 r * (d + k)로 대폭 줄어든다.

LoRA 구현 코드

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# 베이스 모델 로드
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# LoRA 설정
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # 랭크: 일반적으로 8-64 사이
    lora_alpha=32,                 # 스케일링 팩터: 보통 r의 2배
    lora_dropout=0.05,             # 드롭아웃: 과적합 방지
    target_modules=[               # LoRA를 적용할 모듈
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",                   # bias 학습 여부
)

# PEFT 모델 생성
peft_model = get_peft_model(model, lora_config)

# 학습 가능한 파라미터 확인
peft_model.print_trainable_parameters()
# 출력 예시: trainable params: 33,554,432 || all params: 6,771,970,048
# || trainable%: 0.4956

랭크(r) 선택 가이드

랭크 r은 LoRA의 핵심 하이퍼파라미터다.

r=4-8: 간단한 분류 태스크, 감정 분석 등에 적합. 메모리 최소화가 목표일 때
r=16-32: 일반적인 instruction tuning, 대화형 모델에 권장되는 범위
r=64-128: 복잡한 도메인 적응(의료, 법률 등)이나 대규모 데이터셋에서 사용

alpha 값은 일반적으로 r의 2배로 설정한다. 실제 스케일링 팩터는 alpha/r이므로, alpha=32, r=16이면 스케일링은 2가 된다.

QLoRA 4비트 양자화

QLoRA의 혁신

QLoRA(Dettmers et al., 2023)는 LoRA에 4비트 양자화를 결합하여 메모리 사용량을 극적으로 줄인 기법이다. 65B 파라미터 모델을 단일 48GB GPU에서 파인튜닝할 수 있게 만들었으며, 세 가지 핵심 기술을 도입했다.

4-bit NormalFloat (NF4): 정규분포를 따르는 가중치에 최적화된 정보 이론적 데이터 타입
Double Quantization: 양자화 상수 자체를 재양자화하여 파라미터당 평균 0.37비트를 추가 절약
Paged Optimizers: GPU 메모리 스파이크 시 CPU RAM으로 자동 페이징하는 옵티마이저

QLoRA 학습 스크립트

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer
import torch

# 4비트 양자화 설정
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 양자화
    bnb_4bit_compute_dtype=torch.bfloat16, # 연산 시 bfloat16 사용
    bnb_4bit_use_double_quant=True,       # Double Quantization 활성화
)

# 4비트 양자화 모델 로드
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# k-bit 학습 준비 (gradient checkpointing 등)
model = prepare_model_for_kbit_training(model)

# LoRA 설정
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

# 학습 인자 설정
training_args = TrainingArguments(
    output_dir="./qlora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    fp16=False,
    bf16=True,
    optim="paged_adamw_8bit",            # Paged Optimizer 사용
    gradient_checkpointing=True,
    max_grad_norm=0.3,
    report_to="wandb",
)

# SFTTrainer로 학습
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
    max_seq_length=2048,
)

trainer.train()

메모리 사용량 비교

QLoRA의 메모리 절감 효과는 극적이다.

모델 크기	Full FT (FP16)	LoRA (FP16)	QLoRA (NF4)
7B	약 120GB	약 28GB	약 6GB
13B	약 220GB	약 52GB	약 10GB
70B	약 1.2TB	약 280GB	약 48GB

PEFT 라이브러리 활용

Hugging Face PEFT 라이브러리 개요

Hugging Face PEFT 라이브러리는 다양한 파라미터 효율적 파인튜닝 기법을 통합된 인터페이스로 제공한다. Transformers, Accelerate, TRL 라이브러리와 긴밀하게 통합되어 있어 기존 워크플로우에 최소한의 코드 변경으로 적용할 수 있다.

# PEFT 설치
# pip install peft transformers accelerate bitsandbytes trl

# 다양한 PEFT 기법을 동일한 인터페이스로 사용
from peft import (
    LoraConfig,
    PrefixTuningConfig,
    PromptTuningConfig,
    IA3Config,
    get_peft_model,
)

# LoRA
lora_config = LoraConfig(r=16, lora_alpha=32, task_type="CAUSAL_LM")

# Prefix Tuning
prefix_config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,
)

# Prompt Tuning
prompt_config = PromptTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,
    prompt_tuning_init="TEXT",
    prompt_tuning_init_text="Classify the following text:",
    tokenizer_name_or_path="meta-llama/Llama-2-7b-hf",
)

# IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations)
ia3_config = IA3Config(
    task_type="CAUSAL_LM",
    target_modules=["k_proj", "v_proj", "down_proj"],
    feedforward_modules=["down_proj"],
)

학습된 어댑터 저장과 로드

PEFT의 큰 장점은 어댑터만 별도로 저장하고 로드할 수 있다는 것이다. 7B 모델의 LoRA 어댑터는 약 30-100MB에 불과하다.

from peft import PeftModel, PeftConfig

# 어댑터 저장 (약 30-100MB)
peft_model.save_pretrained("./my-lora-adapter")

# 어댑터 로드: 베이스 모델 + 어댑터 결합
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")

# 추론 최적화: LoRA 가중치를 베이스 모델에 병합
model = model.merge_and_unload()

# 병합된 모델 저장 (추론 시 오버헤드 제거)
model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

데이터셋 구성 전략

Instruction Tuning 데이터 포맷

Instruction tuning에서 데이터 품질은 모델 성능을 좌우하는 가장 중요한 요소다. 일반적으로 아래와 같은 포맷을 사용한다.

from datasets import Dataset

# Alpaca 형식 데이터셋 구성
def format_instruction(sample):
    """Alpaca 스타일의 프롬프트 템플릿"""
    if sample.get("input"):
        return f"""### Instruction:
{sample['instruction']}

### Input:
{sample['input']}

### Response:
{sample['output']}"""
    else:
        return f"""### Instruction:
{sample['instruction']}

### Response:
{sample['output']}"""

# 데이터셋 예시
raw_data = [
    {
        "instruction": "다음 텍스트의 감정을 분석하세요.",
        "input": "이 제품 정말 훌륭합니다. 배송도 빠르고 품질도 최고예요!",
        "output": "긍정적 감정입니다. 제품 품질과 배송 속도에 대한 만족감을 표현하고 있습니다.",
    },
    {
        "instruction": "주어진 SQL 쿼리를 최적화하세요.",
        "input": "SELECT * FROM users WHERE created_at > '2024-01-01' ORDER BY name",
        "output": "SELECT id, name, email FROM users WHERE created_at > '2024-01-01' ORDER BY name LIMIT 100;\n\n최적화 포인트:\n1. SELECT *를 필요한 컬럼만 선택하도록 변경\n2. LIMIT 추가로 결과셋 제한\n3. created_at과 name에 복합 인덱스 생성 권장",
    },
]

dataset = Dataset.from_list(raw_data)
formatted = dataset.map(lambda x: {"text": format_instruction(x)})

데이터 품질 관리 체크리스트

고품질 파인튜닝 데이터셋을 구축하기 위한 핵심 원칙은 다음과 같다.

다양성 확보: 동일한 패턴의 데이터가 편중되지 않도록 태스크 유형, 난이도, 도메인을 고르게 분포
품질 검증: 최소 2명 이상의 검수자가 교차 검증. LLM을 활용한 자동 품질 평가도 병행
적절한 규모: 1,000-10,000개의 고품질 샘플이 100,000개의 저품질 샘플보다 효과적
포맷 일관성: instruction, input, output의 구조가 전체 데이터셋에서 일관되게 유지
유해 콘텐츠 제거: 편향, 유해 표현, 개인정보가 포함된 샘플 사전 필터링

하이퍼파라미터 튜닝

핵심 하이퍼파라미터 가이드

파인튜닝 성능은 하이퍼파라미터 설정에 민감하다. 아래는 실전에서 검증된 권장값이다.

파라미터	권장 범위	설명
Learning Rate	1e-4 ~ 3e-4	QLoRA의 경우 2e-4가 일반적 시작점
Batch Size (effective)	32-128	gradient accumulation으로 조절
Epochs	1-5	데이터 규모에 따라 조절. 소규모 시 3-5, 대규모 시 1-2
Warmup Ratio	0.03-0.1	전체 스텝의 3-10%
Weight Decay	0.01-0.1	L2 정규화. 과적합 방지
Max Grad Norm	0.3-1.0	그래디언트 클리핑 임계값
LR Scheduler	cosine	cosine annealing이 가장 안정적
LoRA r	8-64	태스크 복잡도에 비례하여 증가
LoRA alpha	2 * r	스케일링 팩터
LoRA dropout	0.05-0.1	과적합 방지

학습 모니터링

# Weights and Biases를 활용한 학습 모니터링
import wandb

wandb.init(
    project="llm-finetuning",
    config={
        "model": "Llama-2-7b",
        "method": "QLoRA",
        "r": 16,
        "alpha": 32,
        "lr": 2e-4,
        "epochs": 3,
    },
)

# 주요 모니터링 지표
# 1. Training Loss: 꾸준히 감소해야 함. 급격한 감소 후 정체는 과적합 신호
# 2. Validation Loss: training loss와 gap이 벌어지면 과적합
# 3. Learning Rate: 스케줄러가 의도대로 동작하는지 확인
# 4. Gradient Norm: 급격한 스파이크는 학습 불안정 신호
# 5. GPU Memory: OOM 방지를 위한 메모리 사용량 추적

트러블슈팅

Catastrophic Forgetting (치명적 망각)

파인튜닝 후 모델이 기본적인 일반 지식을 잃어버리는 현상이다.

원인: 소규모 도메인 데이터에 과도하게 적합되면서 사전 학습된 표현이 훼손됨
해결책 1: LoRA의 랭크를 낮추어 업데이트 범위를 제한 (r=8 이하)
해결책 2: 학습률을 1e-5 수준으로 낮추고 epoch 수를 줄임
해결책 3: 일반 지식 데이터를 학습 데이터에 10-20% 비율로 혼합
해결책 4: L2 정규화 (weight_decay) 강화

소규모 데이터셋 과적합

데이터가 1,000개 미만인 경우 과적합이 빈번하게 발생한다.

증상: training loss는 0에 수렴하지만 validation loss가 상승
해결책 1: 데이터 증강 - LLM을 활용한 paraphrasing으로 데이터 2-3배 확장
해결책 2: LoRA dropout을 0.1-0.2로 높이고 weight decay를 0.05 이상으로 설정
해결책 3: epoch을 1-2로 줄이고 early stopping 적용
해결책 4: 더 작은 베이스 모델 사용 (70B 대신 7B)

양자화 품질 저하

QLoRA 사용 시 양자화로 인한 정보 손실이 성능에 영향을 줄 수 있다.

증상: 동일 설정의 LoRA (FP16) 대비 성능이 2-5% 이상 하락
해결책 1: compute_dtype을 bfloat16으로 설정 (float16보다 안정적)
해결책 2: LoRA 랭크를 높여 표현력 보상 (r=32-64)
해결책 3: 학습 완료 후 merge_and_unload로 FP16 모델로 복원하여 서빙
해결책 4: IR-QLoRA, Q-BLoRA 등 개선된 양자화 파인튜닝 기법 검토

운영 체크리스트

프로덕션 수준의 LLM 파인튜닝을 위한 엔드투엔드 체크리스트다.

학습 전

베이스 모델 선정: 태스크 특성, 언어, 라이선스, 모델 크기 검토
데이터 파이프라인: 수집, 정제, 포매팅, 품질 검증, train/val/test 분할
환경 설정: GPU 사양 확인, 라이브러리 버전 호환성, CUDA 버전 점검
베이스라인 측정: 파인튜닝 전 모델의 태스크 성능 기록

학습 중

모니터링: loss curve, gradient norm, GPU 메모리 실시간 추적
체크포인트: 일정 간격으로 모델 저장, validation loss 기준 best model 관리
Early stopping: validation loss가 3-5 스텝 이상 개선되지 않으면 중단

학습 후

정량 평가: 태스크별 벤치마크 점수 측정 (BLEU, ROUGE, accuracy 등)
정성 평가: 다양한 입력에 대한 출력 품질 수동 검수
일반 능력 검증: catastrophic forgetting 여부 확인
어댑터 병합: merge_and_unload 후 서빙 최적화
A/B 테스트: 기존 모델 대비 실사용 환경에서 성능 비교

참고자료

LLM Fine-tuning Practical Guide: Efficient Model Adaptation with LoRA, QLoRA, and PEFT

Introduction
The Shifting Fine-tuning Paradigm
- Limitations of Full Fine-tuning
- Classification of PEFT Methods
LoRA: Mathematical Principles and Implementation
QLoRA: 4-bit Quantization
Working with the PEFT Library
- Hugging Face PEFT Library Overview
- Saving and Loading Adapters
Dataset Preparation Strategies
- Instruction Tuning Data Format
- Data Quality Checklist
Hyperparameter Tuning
- Key Hyperparameter Guide
- Training Monitoring
Troubleshooting
Production Checklist
References

Introduction

Fine-tuning pre-trained Large Language Models (LLMs) to specific domains and tasks is a core technique in LLM deployment. However, fully fine-tuning models with billions of parameters requires enormous GPU memory and compute resources. For GPT-3 175B, full fine-tuning with Adam optimizer requires approximately 1.2TB of GPU memory, making it impractical for most organizations.

Parameter-Efficient Fine-Tuning (PEFT) techniques emerged to solve this problem. In particular, LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) reduce the number of trainable parameters to less than 0.1% of the original model while achieving performance on par with full fine-tuning. This guide systematically covers the theoretical foundations through production-level implementation of these efficient fine-tuning methods.

The Shifting Fine-tuning Paradigm

Limitations of Full Fine-tuning

Traditional fine-tuning updates all parameters of a pre-trained model. This approach carries fundamental challenges:

Memory cost: Model weights + gradients + optimizer states must all reside in GPU memory
Storage cost: A complete model copy must be saved per task. Using a 70B model across 10 tasks requires roughly 1.4TB of storage
Catastrophic forgetting: Overfitting on small datasets causes the model to lose general knowledge acquired during pre-training

Classification of PEFT Methods

Parameter-Efficient Fine-Tuning methods fall into three main categories:

Method	Representative	Principle	Trainable Param %	GPU Memory (7B)	Perf vs Full FT
Full Fine-tuning	-	Update all params	100%	~120GB	Baseline
Additive (Adapter)	Adapter, Prefix Tuning	Insert small modules	0.5-3%	~30GB	95-98%
Reparameterization	LoRA, QLoRA	Low-rank matrix decomposition	0.01-0.5%	~16-28GB	97-100%
Selective	BitFit, Diff Pruning	Train only selected params	0.05-1%	~25GB	90-95%

LoRA: Mathematical Principles and Implementation

Core Idea of Low-Rank Decomposition

LoRA (Low-Rank Adaptation), proposed by Hu et al. (2021), is based on the key insight that weight updates during fine-tuning can be approximated as the product of low-rank matrices.

For a pre-trained weight matrix W0 of dimension d x k, the update delta_W is decomposed into two low-rank matrices B (d x r) and A (r x k), where r is the rank, much smaller than either d or k.

During the forward pass, the output is computed as: h = W0 _ x + (B _ A) _ x. During training, W0 is frozen and only B and A are learned. The number of trainable parameters drops from d _ k to r * (d + k).

LoRA Implementation

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load base model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank: typically between 8-64
    lora_alpha=32,                 # Scaling factor: usually 2x rank
    lora_dropout=0.05,             # Dropout: prevents overfitting
    target_modules=[               # Modules to apply LoRA to
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",                   # Whether to train bias
)

# Create PEFT model
peft_model = get_peft_model(model, lora_config)

# Check trainable parameters
peft_model.print_trainable_parameters()
# Example output: trainable params: 33,554,432 || all params: 6,771,970,048
# || trainable%: 0.4956

Rank (r) Selection Guide

The rank r is the most critical LoRA hyperparameter:

r=4-8: Suitable for simple classification tasks, sentiment analysis. Use when minimizing memory is the priority
r=16-32: Recommended range for general instruction tuning and conversational models
r=64-128: For complex domain adaptation (medical, legal) or large-scale datasets

The alpha value is typically set to 2x the rank. Since the effective scaling factor is alpha/r, alpha=32 with r=16 yields a scaling of 2.

QLoRA: 4-bit Quantization

The QLoRA Innovation

QLoRA (Dettmers et al., 2023) combines LoRA with 4-bit quantization to dramatically reduce memory usage. It enables fine-tuning a 65B parameter model on a single 48GB GPU, introducing three key techniques:

4-bit NormalFloat (NF4): An information-theoretically optimal data type for normally distributed weights
Double Quantization: Re-quantizes the quantization constants, saving an additional 0.37 bits per parameter on average
Paged Optimizers: Automatically pages optimizer states to CPU RAM during GPU memory spikes

QLoRA Training Script

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16, # Use bfloat16 for computation
    bnb_4bit_use_double_quant=True,       # Enable Double Quantization
)

# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Prepare model for k-bit training (gradient checkpointing, etc.)
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

# Training arguments
training_args = TrainingArguments(
    output_dir="./qlora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    fp16=False,
    bf16=True,
    optim="paged_adamw_8bit",            # Use Paged Optimizer
    gradient_checkpointing=True,
    max_grad_norm=0.3,
    report_to="wandb",
)

# Train with SFTTrainer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
    max_seq_length=2048,
)

trainer.train()

Memory Usage Comparison

The memory savings of QLoRA are dramatic:

Model Size	Full FT (FP16)	LoRA (FP16)	QLoRA (NF4)
7B	~120GB	~28GB	~6GB
13B	~220GB	~52GB	~10GB
70B	~1.2TB	~280GB	~48GB

Working with the PEFT Library

Hugging Face PEFT Library Overview

The Hugging Face PEFT library provides a unified interface for various parameter-efficient fine-tuning methods. It integrates tightly with Transformers, Accelerate, and TRL, allowing minimal code changes to existing workflows.

# Install PEFT
# pip install peft transformers accelerate bitsandbytes trl

# Use different PEFT methods through the same interface
from peft import (
    LoraConfig,
    PrefixTuningConfig,
    PromptTuningConfig,
    IA3Config,
    get_peft_model,
)

# LoRA
lora_config = LoraConfig(r=16, lora_alpha=32, task_type="CAUSAL_LM")

# Prefix Tuning
prefix_config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,
)

# Prompt Tuning
prompt_config = PromptTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,
    prompt_tuning_init="TEXT",
    prompt_tuning_init_text="Classify the following text:",
    tokenizer_name_or_path="meta-llama/Llama-2-7b-hf",
)

# IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations)
ia3_config = IA3Config(
    task_type="CAUSAL_LM",
    target_modules=["k_proj", "v_proj", "down_proj"],
    feedforward_modules=["down_proj"],
)

Saving and Loading Adapters

A major advantage of PEFT is saving and loading adapters separately. A LoRA adapter for a 7B model is only about 30-100MB.

from peft import PeftModel, PeftConfig

# Save adapter (~30-100MB)
peft_model.save_pretrained("./my-lora-adapter")

# Load adapter: combine base model + adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")

# Inference optimization: merge LoRA weights into base model
model = model.merge_and_unload()

# Save merged model (no overhead during inference)
model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

Dataset Preparation Strategies

Instruction Tuning Data Format

In instruction tuning, data quality is the single most important factor determining model performance. The following format is commonly used:

from datasets import Dataset

# Alpaca-format dataset construction
def format_instruction(sample):
    """Alpaca-style prompt template"""
    if sample.get("input"):
        return f"""### Instruction:
{sample['instruction']}

### Input:
{sample['input']}

### Response:
{sample['output']}"""
    else:
        return f"""### Instruction:
{sample['instruction']}

### Response:
{sample['output']}"""

# Dataset example
raw_data = [
    {
        "instruction": "Analyze the sentiment of the following text.",
        "input": "This product is amazing! Fast shipping and top quality!",
        "output": "Positive sentiment. The text expresses satisfaction with product quality and shipping speed.",
    },
    {
        "instruction": "Optimize the given SQL query.",
        "input": "SELECT * FROM users WHERE created_at > '2024-01-01' ORDER BY name",
        "output": "SELECT id, name, email FROM users WHERE created_at > '2024-01-01' ORDER BY name LIMIT 100;\n\nOptimization points:\n1. Changed SELECT * to select only needed columns\n2. Added LIMIT to restrict result set\n3. Recommend creating a composite index on created_at and name",
    },
]

dataset = Dataset.from_list(raw_data)
formatted = dataset.map(lambda x: {"text": format_instruction(x)})

Data Quality Checklist

Key principles for building high-quality fine-tuning datasets:

Ensure diversity: Balance task types, difficulty levels, and domains to avoid pattern bias
Quality verification: At least 2 reviewers cross-validate. Supplement with LLM-based automated quality assessment
Appropriate scale: 1,000-10,000 high-quality samples are more effective than 100,000 low-quality ones
Format consistency: Maintain consistent instruction, input, output structure across the entire dataset
Remove harmful content: Pre-filter samples containing bias, toxic language, or personal information

Hyperparameter Tuning

Key Hyperparameter Guide

Fine-tuning performance is sensitive to hyperparameter settings. Below are field-tested recommended values:

Parameter	Recommended Range	Description
Learning Rate	1e-4 to 3e-4	2e-4 is the typical starting point for QLoRA
Batch Size (effective)	32-128	Adjust via gradient accumulation
Epochs	1-5	Scale with data size: 3-5 for small, 1-2 for large
Warmup Ratio	0.03-0.1	3-10% of total steps
Weight Decay	0.01-0.1	L2 regularization to prevent overfitting
Max Grad Norm	0.3-1.0	Gradient clipping threshold
LR Scheduler	cosine	Cosine annealing is most stable
LoRA r	8-64	Increase proportionally to task complexity
LoRA alpha	2 * r	Scaling factor
LoRA dropout	0.05-0.1	Prevents overfitting

Training Monitoring

# Training monitoring with Weights and Biases
import wandb

wandb.init(
    project="llm-finetuning",
    config={
        "model": "Llama-2-7b",
        "method": "QLoRA",
        "r": 16,
        "alpha": 32,
        "lr": 2e-4,
        "epochs": 3,
    },
)

# Key metrics to monitor:
# 1. Training Loss: Should steadily decrease. Plateaus after sharp drops signal overfitting
# 2. Validation Loss: Growing gap with training loss indicates overfitting
# 3. Learning Rate: Verify the scheduler is behaving as intended
# 4. Gradient Norm: Sudden spikes indicate training instability
# 5. GPU Memory: Track usage to prevent OOM errors

Troubleshooting

Catastrophic Forgetting

The model loses basic general knowledge after fine-tuning.

Cause: Over-adapting to small domain data corrupts pre-trained representations
Solution 1: Lower the LoRA rank to restrict update scope (r=8 or below)
Solution 2: Reduce learning rate to 1e-5 and decrease epochs
Solution 3: Mix 10-20% general knowledge data into the training set
Solution 4: Increase L2 regularization (weight_decay)

Overfitting on Small Datasets

Overfitting frequently occurs with fewer than 1,000 samples.

Symptom: Training loss converges to 0 while validation loss increases
Solution 1: Data augmentation -- use LLM paraphrasing to expand data 2-3x
Solution 2: Increase LoRA dropout to 0.1-0.2 and set weight decay above 0.05
Solution 3: Reduce epochs to 1-2 and apply early stopping
Solution 4: Use a smaller base model (7B instead of 70B)

Quantization Quality Degradation

When using QLoRA, information loss from quantization can impact performance.

Symptom: Performance drops 2-5% or more compared to LoRA (FP16) with identical settings
Solution 1: Set compute_dtype to bfloat16 (more stable than float16)
Solution 2: Increase LoRA rank to compensate for expressiveness (r=32-64)
Solution 3: After training, restore to FP16 via merge_and_unload for serving
Solution 4: Consider improved quantized fine-tuning methods like IR-QLoRA or Q-BLoRA

Production Checklist

An end-to-end checklist for production-grade LLM fine-tuning:

Before Training

Base model selection: Review task characteristics, language, license, and model size
Data pipeline: Collection, cleaning, formatting, quality validation, train/val/test split
Environment setup: Verify GPU specs, library version compatibility, CUDA version
Baseline measurement: Record task performance of the base model before fine-tuning

During Training

Monitoring: Real-time tracking of loss curves, gradient norms, GPU memory
Checkpointing: Save model at regular intervals, manage best model by validation loss
Early stopping: Halt if validation loss shows no improvement for 3-5 steps

After Training

Quantitative evaluation: Measure task-specific benchmark scores (BLEU, ROUGE, accuracy, etc.)
Qualitative evaluation: Manually review output quality across diverse inputs
General capability check: Verify no catastrophic forgetting has occurred
Adapter merging: Optimize for serving with merge_and_unload
A/B testing: Compare performance against existing models in real usage environments