Skip to content
Published on

Complete Guide to LLM Fine-tuning with Unsloth 2025: QLoRA, 4-bit Quantization, 2x Faster Training

Authors

Introduction: Why Unsloth?

The biggest barrier to LLM fine-tuning is GPU memory (VRAM). Full fine-tuning of Llama 3.1 8B requires about 60GB VRAM, which is tight even on a single A100 80GB. QLoRA solved this problem, but training speed remained slow.

Unsloth solves both problems simultaneously:

ComparisonHuggingFace PEFTAxolotlUnsloth
Training Speed1x (baseline)1.1x2x
Memory Usage100%95%40%
Setup DifficultyMediumHighLow
Model SupportAllAllMajor models
Flash AttentionSeparate installBuilt-inBuilt-in
Custom KernelsNoneNoneTriton kernels

The secret behind Unsloth is custom Triton kernels. Core operations like Attention, MLP, and Cross-Entropy Loss are replaced with GPU-optimized custom kernels, achieving 2x faster training and 60% memory savings.

Supported Models (as of 2025):

  • Llama 3 / 3.1 / 3.2 (8B, 70B)
  • Mistral / Mixtral
  • Phi-3 / Phi-3.5
  • Qwen 2 / 2.5
  • Gemma 2
  • Yi
  • DeepSeek V2

1. LoRA/QLoRA Theory

1.1 Full Fine-tuning vs LoRA vs QLoRA

Full Fine-tuning (update all parameters)
+------------------------+
|   W (d x d)            |  <- Update entire weights
|   e.g.: 4096 x 4096   |     = 16M parameters
|   = 64MB (FP16)        |
+------------------------+

LoRA (Low-Rank Adaptation)
+------------------------+
|   W0 (frozen) + B * A  |
|   W0: 4096 x 4096     |  <- Frozen (no updates)
|   B: 4096 x 16         |  <- Trainable (65K params)
|   A: 16 x 4096         |  <- Trainable (65K params)
|   = 0.25MB (FP16)      |     Total 130K params
+------------------------+

QLoRA (Quantized LoRA)
+------------------------+
|   W0 (4bit) + B * A   |
|   W0: 4096 x 4096     |  <- 4-bit quantized (8MB)
|   B: 4096 x 16         |  <- FP16 trainable
|   A: 16 x 4096         |  <- FP16 trainable
|   = 8.25MB total        |
+------------------------+

1.2 Low-Rank Decomposition Principle

The core idea of LoRA is based on the observation that weight update matrices are actually low-rank.

The original weight update:

W_new = W_old + delta_W

LoRA decomposes delta_W into a product of two small matrices:

delta_W = B * A
where:
  B is a d x r matrix (d=model dimension, r=LoRA rank)
  A is a r x d matrix
  r << d (e.g., r=16, d=4096)

Parameter savings:

# Full Fine-tuning parameters
d = 4096
full_params = d * d  # = 16,777,216 (16.7M)

# LoRA parameters
r = 16
lora_params = d * r + r * d  # = 131,072 (131K)

# Savings ratio
savings = 1 - (lora_params / full_params)
print(f"Parameter savings: {savings:.2%}")  # 99.22%

1.3 4-bit NormalFloat Quantization (NF4)

NF4 quantization used in QLoRA differs from standard 4-bit:

Standard 4-bit INT quantization:

  • Uniformly divides into 16 intervals
  • Does not consider value distribution

NF4 (NormalFloat4):

  • Leverages the fact that weights follow a normal distribution
  • Sets 16 values aligned with normal distribution quantiles
  • Near-optimal quantization from an information theory perspective
# NF4 quantization values (based on normal distribution quantiles)
nf4_values = [
    -1.0, -0.6962, -0.5251, -0.3949,
    -0.2844, -0.1848, -0.0911, 0.0,
    0.0796, 0.1609, 0.2461, 0.3379,
    0.4407, 0.5626, 0.7230, 1.0,
]

1.4 Double Quantization

Another innovation of QLoRA is Double Quantization:

  1. Quantize weights to 4-bit (NF4)
  2. Quantize the quantization constants (scaling factors) to 8-bit
  3. Additional memory savings: from 32bit to 8bit per block

1.5 Memory Comparison Table

ModelFull FT (FP16)LoRA (FP16)QLoRA (4bit)
Llama 3 8B~60GB~18GB~6GB
Llama 3 70B~500GB~160GB~40GB
Mistral 7B~52GB~16GB~5GB
Phi-3 3.8B~28GB~9GB~3GB
Qwen 2 7B~52GB~16GB~5GB

2. Environment Setup

2.1 GPU Requirements

GPUVRAMTrainable Models (QLoRA)
T4 (Colab Free)16GB7B-8B (seq_len 1024)
A10G24GB7B-13B
RTX 409024GB7B-13B
A100 40GB40GB7B-70B
A100 80GB80GB70B+
Apple M2 Ultra192GBCPU training (slow)

2.2 Google Colab Setup

# Install Unsloth on Colab (T4 GPU)
# Runtime -> Change runtime type -> Select T4 GPU

# 1. Install Unsloth
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

# 2. Verify GPU
import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB")

2.3 Local Environment Setup

# Create Conda environment
conda create -n unsloth python=3.11
conda activate unsloth

# Install PyTorch (CUDA 12.1)
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

# Install Unsloth
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

# Verify installation
python -c "from unsloth import FastLanguageModel; print('Unsloth OK')"

2.4 Docker Environment

FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

RUN apt-get update && apt-get install -y python3.11 python3-pip git

RUN pip install torch --index-url https://download.pytorch.org/whl/cu121
RUN pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
RUN pip install --no-deps trl peft accelerate bitsandbytes

WORKDIR /workspace
CMD ["python3"]

3. Unsloth Fine-tuning Step by Step

3.1 Model Loading

from unsloth import FastLanguageModel
import torch

# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",  # Pre-quantized 4bit model
    max_seq_length=2048,    # Maximum sequence length
    dtype=None,             # Auto-detect (A100: bfloat16, others: float16)
    load_in_4bit=True,      # Load with 4bit quantization
)

# Check GPU memory
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_mem / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

Recommended Pre-quantized Models:

Use CaseModelSize
General Koreanunsloth/Meta-Llama-3.1-8B-bnb-4bit~5GB
Korean-specificbeomi/Llama-3-Open-Ko-8B-bnb-4bit~5GB
Codingunsloth/Mistral-7B-v0.3-bnb-4bit~4.5GB
Lightweightunsloth/Phi-3.5-mini-instruct-bnb-4bit~2.5GB
Multilingualunsloth/Qwen2.5-7B-bnb-4bit~4.5GB

3.2 LoRA Adapter Configuration

# Add LoRA adapter
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                          # LoRA rank (8, 16, 32, 64)
    target_modules=[               # Modules to apply LoRA
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj",       # MLP
    ],
    lora_alpha=16,                 # LoRA alpha (usually same as r)
    lora_dropout=0,                # 0 is optimal for Unsloth
    bias="none",                   # No bias training
    use_gradient_checkpointing="unsloth",  # Unsloth-optimized checkpointing
    random_state=3407,
    use_rslora=False,              # Rank-Stabilized LoRA (experimental)
    loftq_config=None,             # LoftQ configuration
)

# Check trainable parameters
def print_trainable_parameters(model):
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Trainable: {trainable:,} / Total: {total:,} = {trainable/total:.2%}")

print_trainable_parameters(model)
# Trainable: 41,943,040 / Total: 8,030,261,248 = 0.52%

LoRA Rank Selection Guide:

LoRA rParametersVRAM OverheadRecommended Use
8~21M~80MBSimple tasks, VRAM-limited
16~42M~160MBGenerally recommended
32~84M~320MBComplex tasks
64~168M~640MBLarge datasets, high expressiveness
128~336M~1.3GBExperimental, close to Full FT

4. Data Preparation

4.1 Chat Template Formatting

# Alpaca prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Dataset formatting function
EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

4.2 Dataset Loading and Conversion

from datasets import load_dataset

# Load KoAlpaca dataset
dataset = load_dataset("beomi/KoAlpaca-v1.1a", split="train")

# Format conversion
def format_koalpaca(examples):
    texts = []
    for instruction, output in zip(examples["instruction"], examples["output"]):
        text = alpaca_prompt.format(instruction, "", output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

dataset = dataset.map(format_koalpaca, batched=True)

# OpenAI Messages format (using Llama 3 chat template)
def format_openai_messages(examples):
    texts = []
    for messages in examples["messages"]:
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
        )
        texts.append(text)
    return {"text": texts}

4.3 Max Sequence Length Considerations

# Analyze sequence length distribution
def analyze_sequence_lengths(dataset, tokenizer):
    lengths = []
    for item in dataset:
        tokens = tokenizer.encode(item["text"])
        lengths.append(len(tokens))

    import numpy as np
    print(f"Mean length: {np.mean(lengths):.0f}")
    print(f"Median: {np.median(lengths):.0f}")
    print(f"95th percentile: {np.percentile(lengths, 95):.0f}")
    print(f"99th percentile: {np.percentile(lengths, 99):.0f}")
    print(f"Max length: {max(lengths)}")

    recommended = int(np.percentile(lengths, 95))
    print(f"\nRecommended max_seq_length: {recommended}")
    return lengths

analyze_sequence_lengths(dataset, tokenizer)

5. Training Configuration

5.1 SFTTrainer Setup

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        # === Basic ===
        output_dir="./outputs",
        num_train_epochs=3,

        # === Batch & Memory ===
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # Effective batch = 2 * 4 = 8

        # === Learning Rate ===
        learning_rate=2e-4,             # QLoRA recommended LR
        lr_scheduler_type="cosine",
        warmup_steps=5,

        # === Precision ===
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),

        # === Logging ===
        logging_steps=1,
        logging_dir="./logs",
        report_to="wandb",

        # === Saving ===
        save_strategy="steps",
        save_steps=100,
        save_total_limit=3,

        # === Optimization ===
        optim="adamw_8bit",
        weight_decay=0.01,
        max_grad_norm=0.3,
        seed=3407,
    ),
)

5.2 Learning Rate Guide

ScenarioRecommended LRReason
QLoRA default2e-4QLoRA paper recommendation
Large dataset (100K+)1e-4Prevent overfitting
Small dataset (under 1K)5e-5 to 1e-4Fine-grained learning
Domain adaptation2e-5 to 5e-5Preserve existing knowledge
Continued Pre-training1e-5 to 5e-5Stable training

Batch Size vs Gradient Accumulation:

# Two ways to achieve effective batch size of 8

# Method 1: Large batch (needs more VRAM)
per_device_train_batch_size = 8
gradient_accumulation_steps = 1
# Effective batch = 8 * 1 = 8, VRAM: ~12GB

# Method 2: Small batch + Gradient Accumulation (less VRAM)
per_device_train_batch_size = 2
gradient_accumulation_steps = 4
# Effective batch = 2 * 4 = 8, VRAM: ~6GB
# Note: Training slightly slower

5.3 Training Execution

# Start training
trainer_stats = trainer.train()

# Print results
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f}s")
print(f"Final Loss: {trainer_stats.metrics['train_loss']:.4f}")

# Check GPU memory usage
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
print(f"Peak VRAM usage: {used_memory} GB")

6. VRAM Optimization Techniques

6.1 Gradient Checkpointing

# Unsloth-optimized Gradient Checkpointing
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",  # Key! 30% VRAM savings
)

# Gradient checkpointing options:
# "unsloth": Unsloth-optimized version (faster and more memory efficient)
# True: Standard PyTorch gradient checkpointing
# False: Disabled (fastest but uses most memory)

6.2 Sequence Packing

# Pack short sequences together for better GPU utilization
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    packing=True,           # Enable sequence packing
    max_seq_length=2048,    # Total packed length
)

# Packing effect:
# Packing OFF: [tokentokenPAD PAD PAD PAD] [tokenPAD PAD PAD PAD PAD]
# Packing ON:  [tokentokentokenSEPtokentokentoken] -> Better GPU utilization

6.3 VRAM Usage Table (Unsloth QLoRA)

ModelBatch=1Batch=2Batch=4Batch=8
Llama 3 8B4.2GB5.8GB8.5GB14.2GB
Mistral 7B3.8GB5.2GB7.8GB13.0GB
Phi-3 3.8B2.4GB3.2GB4.8GB7.6GB
Qwen 2 7B3.8GB5.2GB7.8GB13.0GB
Llama 3 70B36GB42GB56GBOOM

* Based on max_seq_length=2048, gradient_checkpointing="unsloth"


7. Model Export and Conversion

7.1 Save LoRA Adapter

# Save LoRA adapter only (small size)
model.save_pretrained("lora_adapter")
tokenizer.save_pretrained("lora_adapter")

# Check saved files
import os
for f in os.listdir("lora_adapter"):
    size = os.path.getsize(f"lora_adapter/{f}") / 1024 / 1024
    print(f"  {f}: {size:.1f} MB")

# adapter_config.json: 0.0 MB
# adapter_model.safetensors: 160.0 MB  <- LoRA weights
# tokenizer.json: 17.1 MB

7.2 Merge Adapter

# Merge LoRA adapter with base model
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("merged_model")
tokenizer.save_pretrained("merged_model")

7.3 GGUF Conversion (for llama.cpp)

# Unsloth's built-in GGUF conversion
# Supports various quantization levels

# Q4_K_M: Most common (quality/size balance)
model.save_pretrained_gguf(
    "model_gguf",
    tokenizer,
    quantization_method="q4_k_m",
)

# Q5_K_M: Higher quality
model.save_pretrained_gguf(
    "model_q5",
    tokenizer,
    quantization_method="q5_k_m",
)

# Q8_0: Highest quality (larger size)
model.save_pretrained_gguf(
    "model_q8",
    tokenizer,
    quantization_method="q8_0",
)

# F16: No quantization (largest)
model.save_pretrained_gguf(
    "model_f16",
    tokenizer,
    quantization_method="f16",
)

GGUF Quantization Comparison:

QuantizationFile Size (8B)QualityInference SpeedRecommended
Q4_K_M~4.5GBGoodFastGeneral use
Q5_K_M~5.5GBVery GoodMediumQuality-focused
Q8_0~8.0GBExcellentSlowHighest quality
F16~16GBOriginalSlowestReference

7.4 GPTQ Conversion

# GPTQ quantization (for GPU inference)
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
)

# Prepare calibration data
calibration_data = [
    tokenizer(text, return_tensors="pt")
    for text in calibration_texts[:128]
]

# Run GPTQ quantization
gptq_model = AutoGPTQForCausalLM.from_pretrained(
    "merged_model",
    quantize_config=quantize_config,
)
gptq_model.quantize(calibration_data)
gptq_model.save_quantized("model_gptq")

7.5 Upload to Hugging Face Hub

# Upload model to Hugging Face Hub

# Upload LoRA adapter only
model.push_to_hub(
    "my-org/llama3-8b-korean-lora",
    token="hf_xxxxx",
    private=True,
)
tokenizer.push_to_hub(
    "my-org/llama3-8b-korean-lora",
    token="hf_xxxxx",
    private=True,
)

# Upload GGUF file
model.push_to_hub_gguf(
    "my-org/llama3-8b-korean-gguf",
    tokenizer,
    quantization_method="q4_k_m",
    token="hf_xxxxx",
)

8. Evaluation and Testing

8.1 Inference with Fine-tuned Model

# Switch to inference mode
FastLanguageModel.for_inference(model)

# Single prompt inference
def generate_response(instruction, input_text=""):
    prompt = alpaca_prompt.format(instruction, input_text, "")
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.15,
        do_sample=True,
    )

    response = tokenizer.batch_decode(outputs)[0]
    response = response.split("### Response:\n")[-1]
    response = response.replace(tokenizer.eos_token, "").strip()
    return response

# Test
test_questions = [
    "Explain the traditional holidays of Korea.",
    "Explain how decorators work in Python.",
    "Give me tips for healthy eating habits.",
]

for q in test_questions:
    print(f"Q: {q}")
    print(f"A: {generate_response(q)}")
    print("-" * 80)

8.2 Generation Parameter Tuning

# Generation parameter effects
generation_configs = {
    "Factual answers": {
        "temperature": 0.1,
        "top_p": 0.9,
        "repetition_penalty": 1.0,
    },
    "Creative answers": {
        "temperature": 0.8,
        "top_p": 0.95,
        "repetition_penalty": 1.15,
    },
    "Balanced answers": {
        "temperature": 0.5,
        "top_p": 0.9,
        "repetition_penalty": 1.1,
    },
}

8.3 lm-eval-harness Benchmark

# Benchmark evaluation with lm-eval-harness
pip install lm-eval

# Korean benchmark evaluation
lm_eval --model hf \
    --model_args pretrained=./merged_model \
    --tasks kobest_boolq,kobest_copa,kobest_hellaswag,kobest_sentineg,kobest_wic \
    --batch_size 4 \
    --output_path ./eval_results
# Run from Python
from lm_eval import evaluator

results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=./merged_model",
    tasks=["kobest_boolq", "kobest_copa", "kobest_hellaswag"],
    batch_size=4,
)

for task, metrics in results["results"].items():
    print(f"{task}: acc={metrics.get('acc', 'N/A')}")

9. Advanced Techniques

9.1 Multi-GPU Training (DeepSpeed ZeRO)

# deepspeed_config.json
"""
{
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "reduce_scatter": true
    },
    "bf16": {
        "enabled": true
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}
"""

# Execute
# deepspeed --num_gpus 4 train.py --deepspeed deepspeed_config.json

9.2 DPO Training

from trl import DPOTrainer, DPOConfig
from unsloth import FastLanguageModel, PatchDPOTrainer

# Apply DPO patch
PatchDPOTrainer()

# Prepare DPO dataset
dpo_dataset = load_dataset("argilla/ultrafeedback-binarized-preferences", split="train")

# Configure DPO Trainer
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,           # None in Unsloth (auto-handled)
    args=DPOConfig(
        output_dir="./dpo_output",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=5e-7,       # DPO uses lower LR
        num_train_epochs=1,
        beta=0.1,                 # DPO beta (KL divergence weight)
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
    ),
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)

dpo_trainer.train()

9.3 Continued Pre-training (Domain Adaptation)

# Continued Pre-training with domain-specific text
from trl import SFTTrainer

# Domain text data (medical, legal, financial, etc.)
domain_dataset = load_dataset("my-org/medical-korean-corpus", split="train")

# Use lower learning rate for Continued Pre-training
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=domain_dataset,
    dataset_text_field="text",
    max_seq_length=4096,     # Long documents
    packing=True,            # Use packing for efficiency
    args=TrainingArguments(
        output_dir="./cpt_output",
        learning_rate=2e-5,  # Very low learning rate
        num_train_epochs=1,  # 1 epoch is sufficient
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        optim="adamw_8bit",
        warmup_ratio=0.1,
    ),
)

trainer.train()

10. Common Issues and Solutions

10.1 OOM (Out of Memory) Errors

# Symptom: CUDA out of memory
# RuntimeError: CUDA out of memory.

# Solutions in order:
# 1. Reduce batch_size
per_device_train_batch_size = 1  # Minimum

# 2. Increase gradient_accumulation_steps
gradient_accumulation_steps = 8

# 3. Reduce max_seq_length
max_seq_length = 1024  # 2048 -> 1024

# 4. Reduce LoRA rank
r = 8  # 16 -> 8

# 5. Verify gradient checkpointing
use_gradient_checkpointing = "unsloth"

# 6. Clear cache
torch.cuda.empty_cache()
import gc
gc.collect()

10.2 NaN Loss

# Symptom: loss diverges to NaN
# Cause: Learning rate too high or data issues

# Solutions:
# 1. Lower learning rate
learning_rate = 1e-5  # 2e-4 -> 1e-5

# 2. Set max_grad_norm
max_grad_norm = 0.3  # Gradient clipping

# 3. Validate data
def check_data_issues(dataset, tokenizer):
    """Check for data problems"""
    issues = []
    for i, item in enumerate(dataset):
        text = item["text"]
        if not text.strip():
            issues.append(f"[{i}] Empty text")
        tokens = tokenizer.encode(text)
        if len(tokens) > 4096:
            issues.append(f"[{i}] Text too long: {len(tokens)} tokens")
        if not any(c.isalnum() for c in text):
            issues.append(f"[{i}] No valid text content")
    return issues

10.3 Catastrophic Forgetting

# Symptom: Existing knowledge disappears after fine-tuning
# Solutions:

# 1. Use lower learning rate
learning_rate = 5e-5

# 2. Fewer epochs (1-3)
num_train_epochs = 1

# 3. Mix general knowledge into training data
# Original data 80% + general knowledge data 20%

# 4. Lower LoRA rank (limits change magnitude)
r = 8

# 5. Stronger regularization
weight_decay = 0.1

10.4 Overfitting Detection

# Overfitting indicators:
# 1. Train loss decreases but eval loss increases
# 2. Model output nearly memorizes training data
# 3. Performance degrades on new prompts

# Solutions:
# 1. Increase data volume
# 2. Regularization (dropout, weight_decay)
# 3. Early stopping
from transformers import EarlyStoppingCallback

trainer = SFTTrainer(
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    args=TrainingArguments(
        evaluation_strategy="steps",
        eval_steps=50,
        load_best_model_at_end=True,
    ),
)

11. Quiz

Q1. With LoRA r=16, what percentage of parameters are trained compared to the original weights?

Answer: About 0.5% (99.5% savings)

For d=4096:

  • Full: 4096 x 4096 = 16,777,216
  • LoRA r=16: (4096 x 16) + (16 x 4096) = 131,072
  • Ratio: 131,072 / 16,777,216 = 0.78%

In practice, applying to multiple modules (q, k, v, o, gate, up, down) results in about 0.5% of total parameters.

Q2. Why is QLoRA's NF4 quantization better than standard INT4?

Answer: Optimal quantization leveraging the normal distribution characteristics of weights

NF4 exploits the fact that neural network weights generally follow a normal distribution. By placing 16 quantization values at the quantiles of the normal distribution, it achieves less information loss than uniformly-spaced INT4. Theoretically, it achieves near-optimal quantization for normally distributed data.

Q3. What is the core reason Unsloth is 2x faster than HuggingFace PEFT?

Answer: Custom Triton kernels

Unsloth replaces core operations like Attention, MLP, and Cross-Entropy Loss with custom GPU kernels written in Triton. These kernels optimize memory access patterns and reduce unnecessary intermediate tensor creation, achieving 2x faster training and 60% memory savings.

Q4. What is the principle and tradeoff of Gradient Checkpointing?

Answer:

Principle: Instead of storing intermediate activations in memory during the forward pass, they are recomputed on-demand during the backward pass.

Tradeoff:

  • Benefit: VRAM usage reduced by approximately 30-50%
  • Cost: Training time increases by approximately 20-30% due to recomputation

Unsloth's custom gradient checkpointing is more efficient than the standard PyTorch implementation, resulting in less time overhead.

Q5. What are the differences between GGUF Q4_K_M and Q8_0, and when should each be used?

Answer:

Q4_K_M (4-bit Mixed):

  • File size: About 28% of original (approximately 4.5GB for 8B models)
  • Quality: Slight performance degradation from original
  • Speed: Fast
  • Recommended for: Daily use, mobile/edge deployment, limited VRAM/RAM environments

Q8_0 (8-bit):

  • File size: About 50% of original (approximately 8GB for 8B models)
  • Quality: Very close to original
  • Speed: Slower than Q4
  • Recommended for: Quality-first use cases, environments with sufficient memory, services requiring accurate inference

12. References

  1. LoRA: Low-Rank Adaptation of Large Language Models - Hu et al., 2021
  2. QLoRA: Efficient Finetuning of Quantized LLMs - Dettmers et al., 2023
  3. Unsloth Documentation - github.com/unslothai/unsloth
  4. PEFT: Parameter-Efficient Fine-Tuning - HuggingFace
  5. TRL: Transformer Reinforcement Learning - HuggingFace
  6. Flash Attention 2 - Dao et al., 2023
  7. LLM.int8(): 8-bit Matrix Multiplication - Dettmers et al., 2022
  8. llama.cpp - github.com/ggerganov/llama.cpp
  9. GPTQ: Accurate Post-Training Quantization - Frantar et al., 2022
  10. DeepSpeed ZeRO - Rajbhandari et al., 2020
  11. Direct Preference Optimization - Rafailov et al., 2023
  12. Scaling Data-Constrained Language Models - Muennighoff et al., 2023
  13. Training Compute-Optimal Large Language Models (Chinchilla) - Hoffmann et al., 2022
  14. The Llama 3 Herd of Models - Meta AI, 2024