Complete Guide to LLM Fine-tuning with Unsloth 2025: QLoRA, 4-bit Quantization, 2x Faster Training

Introduction: Why Unsloth?
1. LoRA/QLoRA Theory
2. Environment Setup
3. Unsloth Fine-tuning Step by Step
- 3.1 Model Loading
- 3.2 LoRA Adapter Configuration
4. Data Preparation
5. Training Configuration
6. VRAM Optimization Techniques
7. Model Export and Conversion
8. Evaluation and Testing
9. Advanced Techniques
10. Common Issues and Solutions
11. Quiz
12. References

Introduction: Why Unsloth?

The biggest barrier to LLM fine-tuning is GPU memory (VRAM). Full fine-tuning of Llama 3.1 8B requires about 60GB VRAM, which is tight even on a single A100 80GB. QLoRA solved this problem, but training speed remained slow.

Unsloth solves both problems simultaneously:

Comparison	HuggingFace PEFT	Axolotl	Unsloth
Training Speed	1x (baseline)	1.1x	2x
Memory Usage	100%	95%	40%
Setup Difficulty	Medium	High	Low
Model Support	All	All	Major models
Flash Attention	Separate install	Built-in	Built-in
Custom Kernels	None	None	Triton kernels

The secret behind Unsloth is custom Triton kernels. Core operations like Attention, MLP, and Cross-Entropy Loss are replaced with GPU-optimized custom kernels, achieving 2x faster training and 60% memory savings.

Supported Models (as of 2025):

Llama 3 / 3.1 / 3.2 (8B, 70B)
Mistral / Mixtral
Phi-3 / Phi-3.5
Qwen 2 / 2.5
Gemma 2
Yi
DeepSeek V2

1. LoRA/QLoRA Theory

1.1 Full Fine-tuning vs LoRA vs QLoRA

Full Fine-tuning (update all parameters)
+------------------------+
|   W (d x d)            |  <- Update entire weights
|   e.g.: 4096 x 4096   |     = 16M parameters
|   = 64MB (FP16)        |
+------------------------+

LoRA (Low-Rank Adaptation)
+------------------------+
|   W0 (frozen) + B * A  |
|   W0: 4096 x 4096     |  <- Frozen (no updates)
|   B: 4096 x 16         |  <- Trainable (65K params)
|   A: 16 x 4096         |  <- Trainable (65K params)
|   = 0.25MB (FP16)      |     Total 130K params
+------------------------+

QLoRA (Quantized LoRA)
+------------------------+
|   W0 (4bit) + B * A   |
|   W0: 4096 x 4096     |  <- 4-bit quantized (8MB)
|   B: 4096 x 16         |  <- FP16 trainable
|   A: 16 x 4096         |  <- FP16 trainable
|   = 8.25MB total        |
+------------------------+

1.2 Low-Rank Decomposition Principle

The core idea of LoRA is based on the observation that weight update matrices are actually low-rank.

The original weight update:

W_new = W_old + delta_W

LoRA decomposes delta_W into a product of two small matrices:

delta_W = B * A
where:
  B is a d x r matrix (d=model dimension, r=LoRA rank)
  A is a r x d matrix
  r << d (e.g., r=16, d=4096)

Parameter savings:

# Full Fine-tuning parameters
d = 4096
full_params = d * d  # = 16,777,216 (16.7M)

# LoRA parameters
r = 16
lora_params = d * r + r * d  # = 131,072 (131K)

# Savings ratio
savings = 1 - (lora_params / full_params)
print(f"Parameter savings: {savings:.2%}")  # 99.22%

1.3 4-bit NormalFloat Quantization (NF4)

NF4 quantization used in QLoRA differs from standard 4-bit:

Standard 4-bit INT quantization:

Uniformly divides into 16 intervals
Does not consider value distribution

NF4 (NormalFloat4):

Leverages the fact that weights follow a normal distribution
Sets 16 values aligned with normal distribution quantiles
Near-optimal quantization from an information theory perspective

# NF4 quantization values (based on normal distribution quantiles)
nf4_values = [
    -1.0, -0.6962, -0.5251, -0.3949,
    -0.2844, -0.1848, -0.0911, 0.0,
    0.0796, 0.1609, 0.2461, 0.3379,
    0.4407, 0.5626, 0.7230, 1.0,
]

1.4 Double Quantization

Another innovation of QLoRA is Double Quantization:

Quantize weights to 4-bit (NF4)
Quantize the quantization constants (scaling factors) to 8-bit
Additional memory savings: from 32bit to 8bit per block

1.5 Memory Comparison Table

Model	Full FT (FP16)	LoRA (FP16)	QLoRA (4bit)
Llama 3 8B	~60GB	~18GB	~6GB
Llama 3 70B	~500GB	~160GB	~40GB
Mistral 7B	~52GB	~16GB	~5GB
Phi-3 3.8B	~28GB	~9GB	~3GB
Qwen 2 7B	~52GB	~16GB	~5GB

2. Environment Setup

2.1 GPU Requirements

GPU	VRAM	Trainable Models (QLoRA)
T4 (Colab Free)	16GB	7B-8B (seq_len 1024)
A10G	24GB	7B-13B
RTX 4090	24GB	7B-13B
A100 40GB	40GB	7B-70B
A100 80GB	80GB	70B+
Apple M2 Ultra	192GB	CPU training (slow)

2.2 Google Colab Setup

# Install Unsloth on Colab (T4 GPU)
# Runtime -> Change runtime type -> Select T4 GPU

# 1. Install Unsloth
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

# 2. Verify GPU
import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB")

2.3 Local Environment Setup

# Create Conda environment
conda create -n unsloth python=3.11
conda activate unsloth

# Install PyTorch (CUDA 12.1)
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

# Install Unsloth
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

# Verify installation
python -c "from unsloth import FastLanguageModel; print('Unsloth OK')"

2.4 Docker Environment

FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

RUN apt-get update && apt-get install -y python3.11 python3-pip git

RUN pip install torch --index-url https://download.pytorch.org/whl/cu121
RUN pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
RUN pip install --no-deps trl peft accelerate bitsandbytes

WORKDIR /workspace
CMD ["python3"]

3. Unsloth Fine-tuning Step by Step

3.1 Model Loading

from unsloth import FastLanguageModel
import torch

# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",  # Pre-quantized 4bit model
    max_seq_length=2048,    # Maximum sequence length
    dtype=None,             # Auto-detect (A100: bfloat16, others: float16)
    load_in_4bit=True,      # Load with 4bit quantization
)

# Check GPU memory
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_mem / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

Recommended Pre-quantized Models:

Use Case	Model	Size
General Korean	`unsloth/Meta-Llama-3.1-8B-bnb-4bit`	~5GB
Korean-specific	`beomi/Llama-3-Open-Ko-8B-bnb-4bit`	~5GB
Coding	`unsloth/Mistral-7B-v0.3-bnb-4bit`	~4.5GB
Lightweight	`unsloth/Phi-3.5-mini-instruct-bnb-4bit`	~2.5GB
Multilingual	`unsloth/Qwen2.5-7B-bnb-4bit`	~4.5GB

3.2 LoRA Adapter Configuration

# Add LoRA adapter
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                          # LoRA rank (8, 16, 32, 64)
    target_modules=[               # Modules to apply LoRA
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj",       # MLP
    ],
    lora_alpha=16,                 # LoRA alpha (usually same as r)
    lora_dropout=0,                # 0 is optimal for Unsloth
    bias="none",                   # No bias training
    use_gradient_checkpointing="unsloth",  # Unsloth-optimized checkpointing
    random_state=3407,
    use_rslora=False,              # Rank-Stabilized LoRA (experimental)
    loftq_config=None,             # LoftQ configuration
)

# Check trainable parameters
def print_trainable_parameters(model):
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Trainable: {trainable:,} / Total: {total:,} = {trainable/total:.2%}")

print_trainable_parameters(model)
# Trainable: 41,943,040 / Total: 8,030,261,248 = 0.52%

LoRA Rank Selection Guide:

LoRA r	Parameters	VRAM Overhead	Recommended Use
8	~21M	~80MB	Simple tasks, VRAM-limited
16	~42M	~160MB	Generally recommended
32	~84M	~320MB	Complex tasks
64	~168M	~640MB	Large datasets, high expressiveness
128	~336M	~1.3GB	Experimental, close to Full FT

4. Data Preparation

4.1 Chat Template Formatting

# Alpaca prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Dataset formatting function
EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

4.2 Dataset Loading and Conversion

from datasets import load_dataset

# Load KoAlpaca dataset
dataset = load_dataset("beomi/KoAlpaca-v1.1a", split="train")

# Format conversion
def format_koalpaca(examples):
    texts = []
    for instruction, output in zip(examples["instruction"], examples["output"]):
        text = alpaca_prompt.format(instruction, "", output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

dataset = dataset.map(format_koalpaca, batched=True)

# OpenAI Messages format (using Llama 3 chat template)
def format_openai_messages(examples):
    texts = []
    for messages in examples["messages"]:
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
        )
        texts.append(text)
    return {"text": texts}

4.3 Max Sequence Length Considerations

# Analyze sequence length distribution
def analyze_sequence_lengths(dataset, tokenizer):
    lengths = []
    for item in dataset:
        tokens = tokenizer.encode(item["text"])
        lengths.append(len(tokens))

    import numpy as np
    print(f"Mean length: {np.mean(lengths):.0f}")
    print(f"Median: {np.median(lengths):.0f}")
    print(f"95th percentile: {np.percentile(lengths, 95):.0f}")
    print(f"99th percentile: {np.percentile(lengths, 99):.0f}")
    print(f"Max length: {max(lengths)}")

    recommended = int(np.percentile(lengths, 95))
    print(f"\nRecommended max_seq_length: {recommended}")
    return lengths

analyze_sequence_lengths(dataset, tokenizer)

5. Training Configuration

5.1 SFTTrainer Setup

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        # === Basic ===
        output_dir="./outputs",
        num_train_epochs=3,

        # === Batch & Memory ===
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # Effective batch = 2 * 4 = 8

        # === Learning Rate ===
        learning_rate=2e-4,             # QLoRA recommended LR
        lr_scheduler_type="cosine",
        warmup_steps=5,

        # === Precision ===
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),

        # === Logging ===
        logging_steps=1,
        logging_dir="./logs",
        report_to="wandb",

        # === Saving ===
        save_strategy="steps",
        save_steps=100,
        save_total_limit=3,

        # === Optimization ===
        optim="adamw_8bit",
        weight_decay=0.01,
        max_grad_norm=0.3,
        seed=3407,
    ),
)

5.2 Learning Rate Guide

Scenario	Recommended LR	Reason
QLoRA default	2e-4	QLoRA paper recommendation
Large dataset (100K+)	1e-4	Prevent overfitting
Small dataset (under 1K)	5e-5 to 1e-4	Fine-grained learning
Domain adaptation	2e-5 to 5e-5	Preserve existing knowledge
Continued Pre-training	1e-5 to 5e-5	Stable training

Batch Size vs Gradient Accumulation:

# Two ways to achieve effective batch size of 8

# Method 1: Large batch (needs more VRAM)
per_device_train_batch_size = 8
gradient_accumulation_steps = 1
# Effective batch = 8 * 1 = 8, VRAM: ~12GB

# Method 2: Small batch + Gradient Accumulation (less VRAM)
per_device_train_batch_size = 2
gradient_accumulation_steps = 4
# Effective batch = 2 * 4 = 8, VRAM: ~6GB
# Note: Training slightly slower

5.3 Training Execution

# Start training
trainer_stats = trainer.train()

# Print results
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f}s")
print(f"Final Loss: {trainer_stats.metrics['train_loss']:.4f}")

# Check GPU memory usage
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
print(f"Peak VRAM usage: {used_memory} GB")

6. VRAM Optimization Techniques

6.1 Gradient Checkpointing

# Unsloth-optimized Gradient Checkpointing
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",  # Key! 30% VRAM savings
)

# Gradient checkpointing options:
# "unsloth": Unsloth-optimized version (faster and more memory efficient)
# True: Standard PyTorch gradient checkpointing
# False: Disabled (fastest but uses most memory)

6.2 Sequence Packing

# Pack short sequences together for better GPU utilization
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    packing=True,           # Enable sequence packing
    max_seq_length=2048,    # Total packed length
)

# Packing effect:
# Packing OFF: [tokentokenPAD PAD PAD PAD] [tokenPAD PAD PAD PAD PAD]
# Packing ON:  [tokentokentokenSEPtokentokentoken] -> Better GPU utilization

6.3 VRAM Usage Table (Unsloth QLoRA)

Model	Batch=1	Batch=2	Batch=4	Batch=8
Llama 3 8B	4.2GB	5.8GB	8.5GB	14.2GB
Mistral 7B	3.8GB	5.2GB	7.8GB	13.0GB
Phi-3 3.8B	2.4GB	3.2GB	4.8GB	7.6GB
Qwen 2 7B	3.8GB	5.2GB	7.8GB	13.0GB
Llama 3 70B	36GB	42GB	56GB	OOM

* Based on max_seq_length=2048, gradient_checkpointing="unsloth"

7. Model Export and Conversion

7.1 Save LoRA Adapter

# Save LoRA adapter only (small size)
model.save_pretrained("lora_adapter")
tokenizer.save_pretrained("lora_adapter")

# Check saved files
import os
for f in os.listdir("lora_adapter"):
    size = os.path.getsize(f"lora_adapter/{f}") / 1024 / 1024
    print(f"  {f}: {size:.1f} MB")

# adapter_config.json: 0.0 MB
# adapter_model.safetensors: 160.0 MB  <- LoRA weights
# tokenizer.json: 17.1 MB

7.2 Merge Adapter

# Merge LoRA adapter with base model
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("merged_model")
tokenizer.save_pretrained("merged_model")

7.3 GGUF Conversion (for llama.cpp)

# Unsloth's built-in GGUF conversion
# Supports various quantization levels

# Q4_K_M: Most common (quality/size balance)
model.save_pretrained_gguf(
    "model_gguf",
    tokenizer,
    quantization_method="q4_k_m",
)

# Q5_K_M: Higher quality
model.save_pretrained_gguf(
    "model_q5",
    tokenizer,
    quantization_method="q5_k_m",
)

# Q8_0: Highest quality (larger size)
model.save_pretrained_gguf(
    "model_q8",
    tokenizer,
    quantization_method="q8_0",
)

# F16: No quantization (largest)
model.save_pretrained_gguf(
    "model_f16",
    tokenizer,
    quantization_method="f16",
)

GGUF Quantization Comparison:

Quantization	File Size (8B)	Quality	Inference Speed	Recommended
Q4_K_M	~4.5GB	Good	Fast	General use
Q5_K_M	~5.5GB	Very Good	Medium	Quality-focused
Q8_0	~8.0GB	Excellent	Slow	Highest quality
F16	~16GB	Original	Slowest	Reference

7.4 GPTQ Conversion

# GPTQ quantization (for GPU inference)
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
)

# Prepare calibration data
calibration_data = [
    tokenizer(text, return_tensors="pt")
    for text in calibration_texts[:128]
]

# Run GPTQ quantization
gptq_model = AutoGPTQForCausalLM.from_pretrained(
    "merged_model",
    quantize_config=quantize_config,
)
gptq_model.quantize(calibration_data)
gptq_model.save_quantized("model_gptq")

7.5 Upload to Hugging Face Hub

# Upload model to Hugging Face Hub

# Upload LoRA adapter only
model.push_to_hub(
    "my-org/llama3-8b-korean-lora",
    token="hf_xxxxx",
    private=True,
)
tokenizer.push_to_hub(
    "my-org/llama3-8b-korean-lora",
    token="hf_xxxxx",
    private=True,
)

# Upload GGUF file
model.push_to_hub_gguf(
    "my-org/llama3-8b-korean-gguf",
    tokenizer,
    quantization_method="q4_k_m",
    token="hf_xxxxx",
)

8. Evaluation and Testing

8.1 Inference with Fine-tuned Model

# Switch to inference mode
FastLanguageModel.for_inference(model)

# Single prompt inference
def generate_response(instruction, input_text=""):
    prompt = alpaca_prompt.format(instruction, input_text, "")
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.15,
        do_sample=True,
    )

    response = tokenizer.batch_decode(outputs)[0]
    response = response.split("### Response:\n")[-1]
    response = response.replace(tokenizer.eos_token, "").strip()
    return response

# Test
test_questions = [
    "Explain the traditional holidays of Korea.",
    "Explain how decorators work in Python.",
    "Give me tips for healthy eating habits.",
]

for q in test_questions:
    print(f"Q: {q}")
    print(f"A: {generate_response(q)}")
    print("-" * 80)

8.2 Generation Parameter Tuning

# Generation parameter effects
generation_configs = {
    "Factual answers": {
        "temperature": 0.1,
        "top_p": 0.9,
        "repetition_penalty": 1.0,
    },
    "Creative answers": {
        "temperature": 0.8,
        "top_p": 0.95,
        "repetition_penalty": 1.15,
    },
    "Balanced answers": {
        "temperature": 0.5,
        "top_p": 0.9,
        "repetition_penalty": 1.1,
    },
}

8.3 lm-eval-harness Benchmark

# Benchmark evaluation with lm-eval-harness
pip install lm-eval

# Korean benchmark evaluation
lm_eval --model hf \
    --model_args pretrained=./merged_model \
    --tasks kobest_boolq,kobest_copa,kobest_hellaswag,kobest_sentineg,kobest_wic \
    --batch_size 4 \
    --output_path ./eval_results

# Run from Python
from lm_eval import evaluator

results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=./merged_model",
    tasks=["kobest_boolq", "kobest_copa", "kobest_hellaswag"],
    batch_size=4,
)

for task, metrics in results["results"].items():
    print(f"{task}: acc={metrics.get('acc', 'N/A')}")

9. Advanced Techniques

9.1 Multi-GPU Training (DeepSpeed ZeRO)

# deepspeed_config.json
"""
{
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "reduce_scatter": true
    },
    "bf16": {
        "enabled": true
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}
"""

# Execute
# deepspeed --num_gpus 4 train.py --deepspeed deepspeed_config.json

9.2 DPO Training

from trl import DPOTrainer, DPOConfig
from unsloth import FastLanguageModel, PatchDPOTrainer

# Apply DPO patch
PatchDPOTrainer()

# Prepare DPO dataset
dpo_dataset = load_dataset("argilla/ultrafeedback-binarized-preferences", split="train")

# Configure DPO Trainer
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,           # None in Unsloth (auto-handled)
    args=DPOConfig(
        output_dir="./dpo_output",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=5e-7,       # DPO uses lower LR
        num_train_epochs=1,
        beta=0.1,                 # DPO beta (KL divergence weight)
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
    ),
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)

dpo_trainer.train()

9.3 Continued Pre-training (Domain Adaptation)

# Continued Pre-training with domain-specific text
from trl import SFTTrainer

# Domain text data (medical, legal, financial, etc.)
domain_dataset = load_dataset("my-org/medical-korean-corpus", split="train")

# Use lower learning rate for Continued Pre-training
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=domain_dataset,
    dataset_text_field="text",
    max_seq_length=4096,     # Long documents
    packing=True,            # Use packing for efficiency
    args=TrainingArguments(
        output_dir="./cpt_output",
        learning_rate=2e-5,  # Very low learning rate
        num_train_epochs=1,  # 1 epoch is sufficient
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        optim="adamw_8bit",
        warmup_ratio=0.1,
    ),
)

trainer.train()

10. Common Issues and Solutions

10.1 OOM (Out of Memory) Errors

# Symptom: CUDA out of memory
# RuntimeError: CUDA out of memory.

# Solutions in order:
# 1. Reduce batch_size
per_device_train_batch_size = 1  # Minimum

# 2. Increase gradient_accumulation_steps
gradient_accumulation_steps = 8

# 3. Reduce max_seq_length
max_seq_length = 1024  # 2048 -> 1024

# 4. Reduce LoRA rank
r = 8  # 16 -> 8

# 5. Verify gradient checkpointing
use_gradient_checkpointing = "unsloth"

# 6. Clear cache
torch.cuda.empty_cache()
import gc
gc.collect()

10.2 NaN Loss

# Symptom: loss diverges to NaN
# Cause: Learning rate too high or data issues

# Solutions:
# 1. Lower learning rate
learning_rate = 1e-5  # 2e-4 -> 1e-5

# 2. Set max_grad_norm
max_grad_norm = 0.3  # Gradient clipping

# 3. Validate data
def check_data_issues(dataset, tokenizer):
    """Check for data problems"""
    issues = []
    for i, item in enumerate(dataset):
        text = item["text"]
        if not text.strip():
            issues.append(f"[{i}] Empty text")
        tokens = tokenizer.encode(text)
        if len(tokens) > 4096:
            issues.append(f"[{i}] Text too long: {len(tokens)} tokens")
        if not any(c.isalnum() for c in text):
            issues.append(f"[{i}] No valid text content")
    return issues

10.3 Catastrophic Forgetting

# Symptom: Existing knowledge disappears after fine-tuning
# Solutions:

# 1. Use lower learning rate
learning_rate = 5e-5

# 2. Fewer epochs (1-3)
num_train_epochs = 1

# 3. Mix general knowledge into training data
# Original data 80% + general knowledge data 20%

# 4. Lower LoRA rank (limits change magnitude)
r = 8

# 5. Stronger regularization
weight_decay = 0.1

10.4 Overfitting Detection

# Overfitting indicators:
# 1. Train loss decreases but eval loss increases
# 2. Model output nearly memorizes training data
# 3. Performance degrades on new prompts

# Solutions:
# 1. Increase data volume
# 2. Regularization (dropout, weight_decay)
# 3. Early stopping
from transformers import EarlyStoppingCallback

trainer = SFTTrainer(
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    args=TrainingArguments(
        evaluation_strategy="steps",
        eval_steps=50,
        load_best_model_at_end=True,
    ),
)

11. Quiz

Q1. With LoRA r=16, what percentage of parameters are trained compared to the original weights?

Answer: About 0.5% (99.5% savings)

For d=4096:

Full: 4096 x 4096 = 16,777,216
LoRA r=16: (4096 x 16) + (16 x 4096) = 131,072
Ratio: 131,072 / 16,777,216 = 0.78%

In practice, applying to multiple modules (q, k, v, o, gate, up, down) results in about 0.5% of total parameters.

Q2. Why is QLoRA's NF4 quantization better than standard INT4?

Answer: Optimal quantization leveraging the normal distribution characteristics of weights

NF4 exploits the fact that neural network weights generally follow a normal distribution. By placing 16 quantization values at the quantiles of the normal distribution, it achieves less information loss than uniformly-spaced INT4. Theoretically, it achieves near-optimal quantization for normally distributed data.

Q3. What is the core reason Unsloth is 2x faster than HuggingFace PEFT?

Answer: Custom Triton kernels

Unsloth replaces core operations like Attention, MLP, and Cross-Entropy Loss with custom GPU kernels written in Triton. These kernels optimize memory access patterns and reduce unnecessary intermediate tensor creation, achieving 2x faster training and 60% memory savings.

Q4. What is the principle and tradeoff of Gradient Checkpointing?

Answer:

Principle: Instead of storing intermediate activations in memory during the forward pass, they are recomputed on-demand during the backward pass.

Tradeoff:

Benefit: VRAM usage reduced by approximately 30-50%
Cost: Training time increases by approximately 20-30% due to recomputation

Unsloth's custom gradient checkpointing is more efficient than the standard PyTorch implementation, resulting in less time overhead.

Q5. What are the differences between GGUF Q4_K_M and Q8_0, and when should each be used?

Answer:

Q4_K_M (4-bit Mixed):

File size: About 28% of original (approximately 4.5GB for 8B models)
Quality: Slight performance degradation from original
Speed: Fast
Recommended for: Daily use, mobile/edge deployment, limited VRAM/RAM environments

Q8_0 (8-bit):

File size: About 50% of original (approximately 8GB for 8B models)
Quality: Very close to original
Speed: Slower than Q4
Recommended for: Quality-first use cases, environments with sufficient memory, services requiring accurate inference

12. References

LoRA: Low-Rank Adaptation of Large Language Models - Hu et al., 2021
QLoRA: Efficient Finetuning of Quantized LLMs - Dettmers et al., 2023
Unsloth Documentation - github.com/unslothai/unsloth
PEFT: Parameter-Efficient Fine-Tuning - HuggingFace
TRL: Transformer Reinforcement Learning - HuggingFace
Flash Attention 2 - Dao et al., 2023
LLM.int8(): 8-bit Matrix Multiplication - Dettmers et al., 2022
llama.cpp - github.com/ggerganov/llama.cpp
GPTQ: Accurate Post-Training Quantization - Frantar et al., 2022
DeepSpeed ZeRO - Rajbhandari et al., 2020
Direct Preference Optimization - Rafailov et al., 2023
Scaling Data-Constrained Language Models - Muennighoff et al., 2023
Training Compute-Optimal Large Language Models (Chinchilla) - Hoffmann et al., 2022
The Llama 3 Herd of Models - Meta AI, 2024