- Published on
Complete Guide to LLM Fine-tuning with Unsloth 2025: QLoRA, 4-bit Quantization, 2x Faster Training
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction: Why Unsloth?
- 1. LoRA/QLoRA Theory
- 2. Environment Setup
- 3. Unsloth Fine-tuning Step by Step
- 4. Data Preparation
- 5. Training Configuration
- 6. VRAM Optimization Techniques
- 7. Model Export and Conversion
- 8. Evaluation and Testing
- 9. Advanced Techniques
- 10. Common Issues and Solutions
- 11. Quiz
- 12. References
Introduction: Why Unsloth?
The biggest barrier to LLM fine-tuning is GPU memory (VRAM). Full fine-tuning of Llama 3.1 8B requires about 60GB VRAM, which is tight even on a single A100 80GB. QLoRA solved this problem, but training speed remained slow.
Unsloth solves both problems simultaneously:
| Comparison | HuggingFace PEFT | Axolotl | Unsloth |
|---|---|---|---|
| Training Speed | 1x (baseline) | 1.1x | 2x |
| Memory Usage | 100% | 95% | 40% |
| Setup Difficulty | Medium | High | Low |
| Model Support | All | All | Major models |
| Flash Attention | Separate install | Built-in | Built-in |
| Custom Kernels | None | None | Triton kernels |
The secret behind Unsloth is custom Triton kernels. Core operations like Attention, MLP, and Cross-Entropy Loss are replaced with GPU-optimized custom kernels, achieving 2x faster training and 60% memory savings.
Supported Models (as of 2025):
- Llama 3 / 3.1 / 3.2 (8B, 70B)
- Mistral / Mixtral
- Phi-3 / Phi-3.5
- Qwen 2 / 2.5
- Gemma 2
- Yi
- DeepSeek V2
1. LoRA/QLoRA Theory
1.1 Full Fine-tuning vs LoRA vs QLoRA
Full Fine-tuning (update all parameters)
+------------------------+
| W (d x d) | <- Update entire weights
| e.g.: 4096 x 4096 | = 16M parameters
| = 64MB (FP16) |
+------------------------+
LoRA (Low-Rank Adaptation)
+------------------------+
| W0 (frozen) + B * A |
| W0: 4096 x 4096 | <- Frozen (no updates)
| B: 4096 x 16 | <- Trainable (65K params)
| A: 16 x 4096 | <- Trainable (65K params)
| = 0.25MB (FP16) | Total 130K params
+------------------------+
QLoRA (Quantized LoRA)
+------------------------+
| W0 (4bit) + B * A |
| W0: 4096 x 4096 | <- 4-bit quantized (8MB)
| B: 4096 x 16 | <- FP16 trainable
| A: 16 x 4096 | <- FP16 trainable
| = 8.25MB total |
+------------------------+
1.2 Low-Rank Decomposition Principle
The core idea of LoRA is based on the observation that weight update matrices are actually low-rank.
The original weight update:
W_new = W_old + delta_W
LoRA decomposes delta_W into a product of two small matrices:
delta_W = B * A
where:
B is a d x r matrix (d=model dimension, r=LoRA rank)
A is a r x d matrix
r << d (e.g., r=16, d=4096)
Parameter savings:
# Full Fine-tuning parameters
d = 4096
full_params = d * d # = 16,777,216 (16.7M)
# LoRA parameters
r = 16
lora_params = d * r + r * d # = 131,072 (131K)
# Savings ratio
savings = 1 - (lora_params / full_params)
print(f"Parameter savings: {savings:.2%}") # 99.22%
1.3 4-bit NormalFloat Quantization (NF4)
NF4 quantization used in QLoRA differs from standard 4-bit:
Standard 4-bit INT quantization:
- Uniformly divides into 16 intervals
- Does not consider value distribution
NF4 (NormalFloat4):
- Leverages the fact that weights follow a normal distribution
- Sets 16 values aligned with normal distribution quantiles
- Near-optimal quantization from an information theory perspective
# NF4 quantization values (based on normal distribution quantiles)
nf4_values = [
-1.0, -0.6962, -0.5251, -0.3949,
-0.2844, -0.1848, -0.0911, 0.0,
0.0796, 0.1609, 0.2461, 0.3379,
0.4407, 0.5626, 0.7230, 1.0,
]
1.4 Double Quantization
Another innovation of QLoRA is Double Quantization:
- Quantize weights to 4-bit (NF4)
- Quantize the quantization constants (scaling factors) to 8-bit
- Additional memory savings: from 32bit to 8bit per block
1.5 Memory Comparison Table
| Model | Full FT (FP16) | LoRA (FP16) | QLoRA (4bit) |
|---|---|---|---|
| Llama 3 8B | ~60GB | ~18GB | ~6GB |
| Llama 3 70B | ~500GB | ~160GB | ~40GB |
| Mistral 7B | ~52GB | ~16GB | ~5GB |
| Phi-3 3.8B | ~28GB | ~9GB | ~3GB |
| Qwen 2 7B | ~52GB | ~16GB | ~5GB |
2. Environment Setup
2.1 GPU Requirements
| GPU | VRAM | Trainable Models (QLoRA) |
|---|---|---|
| T4 (Colab Free) | 16GB | 7B-8B (seq_len 1024) |
| A10G | 24GB | 7B-13B |
| RTX 4090 | 24GB | 7B-13B |
| A100 40GB | 40GB | 7B-70B |
| A100 80GB | 80GB | 70B+ |
| Apple M2 Ultra | 192GB | CPU training (slow) |
2.2 Google Colab Setup
# Install Unsloth on Colab (T4 GPU)
# Runtime -> Change runtime type -> Select T4 GPU
# 1. Install Unsloth
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
# 2. Verify GPU
import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB")
2.3 Local Environment Setup
# Create Conda environment
conda create -n unsloth python=3.11
conda activate unsloth
# Install PyTorch (CUDA 12.1)
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
# Install Unsloth
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
# Verify installation
python -c "from unsloth import FastLanguageModel; print('Unsloth OK')"
2.4 Docker Environment
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
RUN apt-get update && apt-get install -y python3.11 python3-pip git
RUN pip install torch --index-url https://download.pytorch.org/whl/cu121
RUN pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
RUN pip install --no-deps trl peft accelerate bitsandbytes
WORKDIR /workspace
CMD ["python3"]
3. Unsloth Fine-tuning Step by Step
3.1 Model Loading
from unsloth import FastLanguageModel
import torch
# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit", # Pre-quantized 4bit model
max_seq_length=2048, # Maximum sequence length
dtype=None, # Auto-detect (A100: bfloat16, others: float16)
load_in_4bit=True, # Load with 4bit quantization
)
# Check GPU memory
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_mem / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
Recommended Pre-quantized Models:
| Use Case | Model | Size |
|---|---|---|
| General Korean | unsloth/Meta-Llama-3.1-8B-bnb-4bit | ~5GB |
| Korean-specific | beomi/Llama-3-Open-Ko-8B-bnb-4bit | ~5GB |
| Coding | unsloth/Mistral-7B-v0.3-bnb-4bit | ~4.5GB |
| Lightweight | unsloth/Phi-3.5-mini-instruct-bnb-4bit | ~2.5GB |
| Multilingual | unsloth/Qwen2.5-7B-bnb-4bit | ~4.5GB |
3.2 LoRA Adapter Configuration
# Add LoRA adapter
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank (8, 16, 32, 64)
target_modules=[ # Modules to apply LoRA
"q_proj", "k_proj", "v_proj", "o_proj", # Attention
"gate_proj", "up_proj", "down_proj", # MLP
],
lora_alpha=16, # LoRA alpha (usually same as r)
lora_dropout=0, # 0 is optimal for Unsloth
bias="none", # No bias training
use_gradient_checkpointing="unsloth", # Unsloth-optimized checkpointing
random_state=3407,
use_rslora=False, # Rank-Stabilized LoRA (experimental)
loftq_config=None, # LoftQ configuration
)
# Check trainable parameters
def print_trainable_parameters(model):
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / Total: {total:,} = {trainable/total:.2%}")
print_trainable_parameters(model)
# Trainable: 41,943,040 / Total: 8,030,261,248 = 0.52%
LoRA Rank Selection Guide:
| LoRA r | Parameters | VRAM Overhead | Recommended Use |
|---|---|---|---|
| 8 | ~21M | ~80MB | Simple tasks, VRAM-limited |
| 16 | ~42M | ~160MB | Generally recommended |
| 32 | ~84M | ~320MB | Complex tasks |
| 64 | ~168M | ~640MB | Large datasets, high expressiveness |
| 128 | ~336M | ~1.3GB | Experimental, close to Full FT |
4. Data Preparation
4.1 Chat Template Formatting
# Alpaca prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
# Dataset formatting function
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
texts = []
for instruction, input_text, output in zip(instructions, inputs, outputs):
text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
texts.append(text)
return {"text": texts}
4.2 Dataset Loading and Conversion
from datasets import load_dataset
# Load KoAlpaca dataset
dataset = load_dataset("beomi/KoAlpaca-v1.1a", split="train")
# Format conversion
def format_koalpaca(examples):
texts = []
for instruction, output in zip(examples["instruction"], examples["output"]):
text = alpaca_prompt.format(instruction, "", output) + EOS_TOKEN
texts.append(text)
return {"text": texts}
dataset = dataset.map(format_koalpaca, batched=True)
# OpenAI Messages format (using Llama 3 chat template)
def format_openai_messages(examples):
texts = []
for messages in examples["messages"]:
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)
texts.append(text)
return {"text": texts}
4.3 Max Sequence Length Considerations
# Analyze sequence length distribution
def analyze_sequence_lengths(dataset, tokenizer):
lengths = []
for item in dataset:
tokens = tokenizer.encode(item["text"])
lengths.append(len(tokens))
import numpy as np
print(f"Mean length: {np.mean(lengths):.0f}")
print(f"Median: {np.median(lengths):.0f}")
print(f"95th percentile: {np.percentile(lengths, 95):.0f}")
print(f"99th percentile: {np.percentile(lengths, 99):.0f}")
print(f"Max length: {max(lengths)}")
recommended = int(np.percentile(lengths, 95))
print(f"\nRecommended max_seq_length: {recommended}")
return lengths
analyze_sequence_lengths(dataset, tokenizer)
5. Training Configuration
5.1 SFTTrainer Setup
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
dataset_num_proc=2,
packing=False,
args=TrainingArguments(
# === Basic ===
output_dir="./outputs",
num_train_epochs=3,
# === Batch & Memory ===
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # Effective batch = 2 * 4 = 8
# === Learning Rate ===
learning_rate=2e-4, # QLoRA recommended LR
lr_scheduler_type="cosine",
warmup_steps=5,
# === Precision ===
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
# === Logging ===
logging_steps=1,
logging_dir="./logs",
report_to="wandb",
# === Saving ===
save_strategy="steps",
save_steps=100,
save_total_limit=3,
# === Optimization ===
optim="adamw_8bit",
weight_decay=0.01,
max_grad_norm=0.3,
seed=3407,
),
)
5.2 Learning Rate Guide
| Scenario | Recommended LR | Reason |
|---|---|---|
| QLoRA default | 2e-4 | QLoRA paper recommendation |
| Large dataset (100K+) | 1e-4 | Prevent overfitting |
| Small dataset (under 1K) | 5e-5 to 1e-4 | Fine-grained learning |
| Domain adaptation | 2e-5 to 5e-5 | Preserve existing knowledge |
| Continued Pre-training | 1e-5 to 5e-5 | Stable training |
Batch Size vs Gradient Accumulation:
# Two ways to achieve effective batch size of 8
# Method 1: Large batch (needs more VRAM)
per_device_train_batch_size = 8
gradient_accumulation_steps = 1
# Effective batch = 8 * 1 = 8, VRAM: ~12GB
# Method 2: Small batch + Gradient Accumulation (less VRAM)
per_device_train_batch_size = 2
gradient_accumulation_steps = 4
# Effective batch = 2 * 4 = 8, VRAM: ~6GB
# Note: Training slightly slower
5.3 Training Execution
# Start training
trainer_stats = trainer.train()
# Print results
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f}s")
print(f"Final Loss: {trainer_stats.metrics['train_loss']:.4f}")
# Check GPU memory usage
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
print(f"Peak VRAM usage: {used_memory} GB")
6. VRAM Optimization Techniques
6.1 Gradient Checkpointing
# Unsloth-optimized Gradient Checkpointing
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
use_gradient_checkpointing="unsloth", # Key! 30% VRAM savings
)
# Gradient checkpointing options:
# "unsloth": Unsloth-optimized version (faster and more memory efficient)
# True: Standard PyTorch gradient checkpointing
# False: Disabled (fastest but uses most memory)
6.2 Sequence Packing
# Pack short sequences together for better GPU utilization
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
packing=True, # Enable sequence packing
max_seq_length=2048, # Total packed length
)
# Packing effect:
# Packing OFF: [tokentokenPAD PAD PAD PAD] [tokenPAD PAD PAD PAD PAD]
# Packing ON: [tokentokentokenSEPtokentokentoken] -> Better GPU utilization
6.3 VRAM Usage Table (Unsloth QLoRA)
| Model | Batch=1 | Batch=2 | Batch=4 | Batch=8 |
|---|---|---|---|---|
| Llama 3 8B | 4.2GB | 5.8GB | 8.5GB | 14.2GB |
| Mistral 7B | 3.8GB | 5.2GB | 7.8GB | 13.0GB |
| Phi-3 3.8B | 2.4GB | 3.2GB | 4.8GB | 7.6GB |
| Qwen 2 7B | 3.8GB | 5.2GB | 7.8GB | 13.0GB |
| Llama 3 70B | 36GB | 42GB | 56GB | OOM |
* Based on max_seq_length=2048, gradient_checkpointing="unsloth"
7. Model Export and Conversion
7.1 Save LoRA Adapter
# Save LoRA adapter only (small size)
model.save_pretrained("lora_adapter")
tokenizer.save_pretrained("lora_adapter")
# Check saved files
import os
for f in os.listdir("lora_adapter"):
size = os.path.getsize(f"lora_adapter/{f}") / 1024 / 1024
print(f" {f}: {size:.1f} MB")
# adapter_config.json: 0.0 MB
# adapter_model.safetensors: 160.0 MB <- LoRA weights
# tokenizer.json: 17.1 MB
7.2 Merge Adapter
# Merge LoRA adapter with base model
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("merged_model")
tokenizer.save_pretrained("merged_model")
7.3 GGUF Conversion (for llama.cpp)
# Unsloth's built-in GGUF conversion
# Supports various quantization levels
# Q4_K_M: Most common (quality/size balance)
model.save_pretrained_gguf(
"model_gguf",
tokenizer,
quantization_method="q4_k_m",
)
# Q5_K_M: Higher quality
model.save_pretrained_gguf(
"model_q5",
tokenizer,
quantization_method="q5_k_m",
)
# Q8_0: Highest quality (larger size)
model.save_pretrained_gguf(
"model_q8",
tokenizer,
quantization_method="q8_0",
)
# F16: No quantization (largest)
model.save_pretrained_gguf(
"model_f16",
tokenizer,
quantization_method="f16",
)
GGUF Quantization Comparison:
| Quantization | File Size (8B) | Quality | Inference Speed | Recommended |
|---|---|---|---|---|
| Q4_K_M | ~4.5GB | Good | Fast | General use |
| Q5_K_M | ~5.5GB | Very Good | Medium | Quality-focused |
| Q8_0 | ~8.0GB | Excellent | Slow | Highest quality |
| F16 | ~16GB | Original | Slowest | Reference |
7.4 GPTQ Conversion
# GPTQ quantization (for GPU inference)
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False,
)
# Prepare calibration data
calibration_data = [
tokenizer(text, return_tensors="pt")
for text in calibration_texts[:128]
]
# Run GPTQ quantization
gptq_model = AutoGPTQForCausalLM.from_pretrained(
"merged_model",
quantize_config=quantize_config,
)
gptq_model.quantize(calibration_data)
gptq_model.save_quantized("model_gptq")
7.5 Upload to Hugging Face Hub
# Upload model to Hugging Face Hub
# Upload LoRA adapter only
model.push_to_hub(
"my-org/llama3-8b-korean-lora",
token="hf_xxxxx",
private=True,
)
tokenizer.push_to_hub(
"my-org/llama3-8b-korean-lora",
token="hf_xxxxx",
private=True,
)
# Upload GGUF file
model.push_to_hub_gguf(
"my-org/llama3-8b-korean-gguf",
tokenizer,
quantization_method="q4_k_m",
token="hf_xxxxx",
)
8. Evaluation and Testing
8.1 Inference with Fine-tuned Model
# Switch to inference mode
FastLanguageModel.for_inference(model)
# Single prompt inference
def generate_response(instruction, input_text=""):
prompt = alpaca_prompt.format(instruction, input_text, "")
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.15,
do_sample=True,
)
response = tokenizer.batch_decode(outputs)[0]
response = response.split("### Response:\n")[-1]
response = response.replace(tokenizer.eos_token, "").strip()
return response
# Test
test_questions = [
"Explain the traditional holidays of Korea.",
"Explain how decorators work in Python.",
"Give me tips for healthy eating habits.",
]
for q in test_questions:
print(f"Q: {q}")
print(f"A: {generate_response(q)}")
print("-" * 80)
8.2 Generation Parameter Tuning
# Generation parameter effects
generation_configs = {
"Factual answers": {
"temperature": 0.1,
"top_p": 0.9,
"repetition_penalty": 1.0,
},
"Creative answers": {
"temperature": 0.8,
"top_p": 0.95,
"repetition_penalty": 1.15,
},
"Balanced answers": {
"temperature": 0.5,
"top_p": 0.9,
"repetition_penalty": 1.1,
},
}
8.3 lm-eval-harness Benchmark
# Benchmark evaluation with lm-eval-harness
pip install lm-eval
# Korean benchmark evaluation
lm_eval --model hf \
--model_args pretrained=./merged_model \
--tasks kobest_boolq,kobest_copa,kobest_hellaswag,kobest_sentineg,kobest_wic \
--batch_size 4 \
--output_path ./eval_results
# Run from Python
from lm_eval import evaluator
results = evaluator.simple_evaluate(
model="hf",
model_args="pretrained=./merged_model",
tasks=["kobest_boolq", "kobest_copa", "kobest_hellaswag"],
batch_size=4,
)
for task, metrics in results["results"].items():
print(f"{task}: acc={metrics.get('acc', 'N/A')}")
9. Advanced Techniques
9.1 Multi-GPU Training (DeepSpeed ZeRO)
# deepspeed_config.json
"""
{
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"reduce_scatter": true
},
"bf16": {
"enabled": true
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
}
"""
# Execute
# deepspeed --num_gpus 4 train.py --deepspeed deepspeed_config.json
9.2 DPO Training
from trl import DPOTrainer, DPOConfig
from unsloth import FastLanguageModel, PatchDPOTrainer
# Apply DPO patch
PatchDPOTrainer()
# Prepare DPO dataset
dpo_dataset = load_dataset("argilla/ultrafeedback-binarized-preferences", split="train")
# Configure DPO Trainer
dpo_trainer = DPOTrainer(
model=model,
ref_model=None, # None in Unsloth (auto-handled)
args=DPOConfig(
output_dir="./dpo_output",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=5e-7, # DPO uses lower LR
num_train_epochs=1,
beta=0.1, # DPO beta (KL divergence weight)
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
),
train_dataset=dpo_dataset,
tokenizer=tokenizer,
)
dpo_trainer.train()
9.3 Continued Pre-training (Domain Adaptation)
# Continued Pre-training with domain-specific text
from trl import SFTTrainer
# Domain text data (medical, legal, financial, etc.)
domain_dataset = load_dataset("my-org/medical-korean-corpus", split="train")
# Use lower learning rate for Continued Pre-training
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=domain_dataset,
dataset_text_field="text",
max_seq_length=4096, # Long documents
packing=True, # Use packing for efficiency
args=TrainingArguments(
output_dir="./cpt_output",
learning_rate=2e-5, # Very low learning rate
num_train_epochs=1, # 1 epoch is sufficient
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
optim="adamw_8bit",
warmup_ratio=0.1,
),
)
trainer.train()
10. Common Issues and Solutions
10.1 OOM (Out of Memory) Errors
# Symptom: CUDA out of memory
# RuntimeError: CUDA out of memory.
# Solutions in order:
# 1. Reduce batch_size
per_device_train_batch_size = 1 # Minimum
# 2. Increase gradient_accumulation_steps
gradient_accumulation_steps = 8
# 3. Reduce max_seq_length
max_seq_length = 1024 # 2048 -> 1024
# 4. Reduce LoRA rank
r = 8 # 16 -> 8
# 5. Verify gradient checkpointing
use_gradient_checkpointing = "unsloth"
# 6. Clear cache
torch.cuda.empty_cache()
import gc
gc.collect()
10.2 NaN Loss
# Symptom: loss diverges to NaN
# Cause: Learning rate too high or data issues
# Solutions:
# 1. Lower learning rate
learning_rate = 1e-5 # 2e-4 -> 1e-5
# 2. Set max_grad_norm
max_grad_norm = 0.3 # Gradient clipping
# 3. Validate data
def check_data_issues(dataset, tokenizer):
"""Check for data problems"""
issues = []
for i, item in enumerate(dataset):
text = item["text"]
if not text.strip():
issues.append(f"[{i}] Empty text")
tokens = tokenizer.encode(text)
if len(tokens) > 4096:
issues.append(f"[{i}] Text too long: {len(tokens)} tokens")
if not any(c.isalnum() for c in text):
issues.append(f"[{i}] No valid text content")
return issues
10.3 Catastrophic Forgetting
# Symptom: Existing knowledge disappears after fine-tuning
# Solutions:
# 1. Use lower learning rate
learning_rate = 5e-5
# 2. Fewer epochs (1-3)
num_train_epochs = 1
# 3. Mix general knowledge into training data
# Original data 80% + general knowledge data 20%
# 4. Lower LoRA rank (limits change magnitude)
r = 8
# 5. Stronger regularization
weight_decay = 0.1
10.4 Overfitting Detection
# Overfitting indicators:
# 1. Train loss decreases but eval loss increases
# 2. Model output nearly memorizes training data
# 3. Performance degrades on new prompts
# Solutions:
# 1. Increase data volume
# 2. Regularization (dropout, weight_decay)
# 3. Early stopping
from transformers import EarlyStoppingCallback
trainer = SFTTrainer(
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
args=TrainingArguments(
evaluation_strategy="steps",
eval_steps=50,
load_best_model_at_end=True,
),
)
11. Quiz
Q1. With LoRA r=16, what percentage of parameters are trained compared to the original weights?
Answer: About 0.5% (99.5% savings)
For d=4096:
- Full: 4096 x 4096 = 16,777,216
- LoRA r=16: (4096 x 16) + (16 x 4096) = 131,072
- Ratio: 131,072 / 16,777,216 = 0.78%
In practice, applying to multiple modules (q, k, v, o, gate, up, down) results in about 0.5% of total parameters.
Q2. Why is QLoRA's NF4 quantization better than standard INT4?
Answer: Optimal quantization leveraging the normal distribution characteristics of weights
NF4 exploits the fact that neural network weights generally follow a normal distribution. By placing 16 quantization values at the quantiles of the normal distribution, it achieves less information loss than uniformly-spaced INT4. Theoretically, it achieves near-optimal quantization for normally distributed data.
Q3. What is the core reason Unsloth is 2x faster than HuggingFace PEFT?
Answer: Custom Triton kernels
Unsloth replaces core operations like Attention, MLP, and Cross-Entropy Loss with custom GPU kernels written in Triton. These kernels optimize memory access patterns and reduce unnecessary intermediate tensor creation, achieving 2x faster training and 60% memory savings.
Q4. What is the principle and tradeoff of Gradient Checkpointing?
Answer:
Principle: Instead of storing intermediate activations in memory during the forward pass, they are recomputed on-demand during the backward pass.
Tradeoff:
- Benefit: VRAM usage reduced by approximately 30-50%
- Cost: Training time increases by approximately 20-30% due to recomputation
Unsloth's custom gradient checkpointing is more efficient than the standard PyTorch implementation, resulting in less time overhead.
Q5. What are the differences between GGUF Q4_K_M and Q8_0, and when should each be used?
Answer:
Q4_K_M (4-bit Mixed):
- File size: About 28% of original (approximately 4.5GB for 8B models)
- Quality: Slight performance degradation from original
- Speed: Fast
- Recommended for: Daily use, mobile/edge deployment, limited VRAM/RAM environments
Q8_0 (8-bit):
- File size: About 50% of original (approximately 8GB for 8B models)
- Quality: Very close to original
- Speed: Slower than Q4
- Recommended for: Quality-first use cases, environments with sufficient memory, services requiring accurate inference
12. References
- LoRA: Low-Rank Adaptation of Large Language Models - Hu et al., 2021
- QLoRA: Efficient Finetuning of Quantized LLMs - Dettmers et al., 2023
- Unsloth Documentation - github.com/unslothai/unsloth
- PEFT: Parameter-Efficient Fine-Tuning - HuggingFace
- TRL: Transformer Reinforcement Learning - HuggingFace
- Flash Attention 2 - Dao et al., 2023
- LLM.int8(): 8-bit Matrix Multiplication - Dettmers et al., 2022
- llama.cpp - github.com/ggerganov/llama.cpp
- GPTQ: Accurate Post-Training Quantization - Frantar et al., 2022
- DeepSpeed ZeRO - Rajbhandari et al., 2020
- Direct Preference Optimization - Rafailov et al., 2023
- Scaling Data-Constrained Language Models - Muennighoff et al., 2023
- Training Compute-Optimal Large Language Models (Chinchilla) - Hoffmann et al., 2022
- The Llama 3 Herd of Models - Meta AI, 2024