Skip to content
Published on

Fine-tuning in Practice: Building Your Own Model with LoRA and QLoRA

Authors

Why LoRA Instead of Full Fine-tuning?

The first wall you hit when exploring fine-tuning is VRAM requirements.

Full Fine-tuning Llama 3.1 70B:
- VRAM required: ~560GB (FP32)
  -> Needs 7x H100 80GB GPUs
- Training time: days
- Cloud cost: thousands of dollars

LoRA Fine-tuning Llama 3.1 70B:
- VRAM required: ~48GB (QLoRA: ~20GB!)
  -> A single RTX 3090 or A10 works
- Training time: hours on 1 GPU
- Cloud cost: $20-100 (A100 rental)

How is this gap possible? Understanding LoRA's core idea makes it clear.


How LoRA Works: Intuitive Explanation

Full fine-tuning updates every weight in the model. For Llama 3.1 70B, that's 70 billion parameters all changing. Storing and optimizing that requires enormous memory.

LoRA (Low-Rank Adaptation) takes a different approach:

Full fine-tuning:
W_new = W_original + delta_W
(delta_W is the same size as the original matrix = 70B parameter updates)

LoRA: decompose delta_W into two small matrices
delta_W = A x B
  A: (d x r) matrix, B: (r x d) matrix
  r = rank (typically 4-64; smaller = more memory-efficient)

Example: d=4096, r=16
- Full delta_W: 4096 x 4096 = 16.7M parameters
- LoRA delta_W: A(4096x16) + B(16x4096) = 131K parameters
- 128x fewer parameters to train for equivalent effect!

During training: freeze W_original, only train A and B
During inference: W_new = W_original + A x B (merge or keep separate)

Empirically, most weight updates have low-rank structure — meaning you don't need to update every parameter to achieve the desired behavior change.


QLoRA: Even Less Memory

QLoRA = LoRA + 4-bit quantized base model

Standard LoRA:
- Base model: FP16
- LoRA adapters: FP16
- VRAM for 70B model: ~140GB

QLoRA:
- Base model: 4-bit (NF4 quantization)
- LoRA adapters: BF16/FP16 (full precision)
- VRAM for 70B model: ~20GB (!!)

The quality loss from 4-bit quantization is surprisingly small, especially after fine-tuning compensates for it. QLoRA was introduced in a 2023 paper by Tim Dettmers et al. and effectively democratized LLM fine-tuning.


Production Code: Fine-tuning Llama 3.1 8B End-to-End

Working code using the Hugging Face ecosystem.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset, Dataset

# ============================================================
# Step 1: Load model with 4-bit quantization (QLoRA)
# ============================================================
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",              # NF4 beats FP4 in quality
    bnb_4bit_compute_dtype=torch.bfloat16,  # compute in BF16
    bnb_4bit_use_double_quant=True          # extra memory savings
)

model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"  # Important: left-padding causes instability

# ============================================================
# Step 2: Configure LoRA
# ============================================================
lora_config = LoraConfig(
    r=16,                   # rank: higher = more expressive, more VRAM
    lora_alpha=32,          # scaling factor (typically 2x rank)
    target_modules=[        # which layers to apply LoRA to
        "q_proj", "v_proj", "k_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"  # include MLP layers too
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Example output: trainable params: 167,772,160 || all params: 8,201,441,280 || trainable%: 2.05
# Only 2% of parameters are trained — 98% are frozen

# ============================================================
# Step 3: Prepare dataset
# ============================================================
raw_data = [
    {
        "instruction": "Classify the sentiment of the following customer review.",
        "input": "The shipping took 3 extra days with no communication. Very disappointing.",
        "output": "Negative (frustrated, disappointed)"
    },
    # ... more examples
]

def format_instruction(example):
    """Convert to Llama 3.1 instruction format"""
    if example.get("input"):
        text = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
{example['instruction']}

{example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{example['output']}<|eot_id|>"""
    else:
        text = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
{example['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{example['output']}<|eot_id|>"""
    return {"text": text}

dataset = Dataset.from_list(raw_data)
dataset = dataset.map(format_instruction)
train_test = dataset.train_test_split(test_size=0.1)

# ============================================================
# Step 4: Train
# ============================================================
training_args = SFTConfig(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,     # effective batch = 4 x 4 = 16
    gradient_checkpointing=True,       # save VRAM (~20% speed tradeoff)
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    fp16=True,
    logging_steps=10,
    eval_steps=50,
    save_steps=100,
    eval_strategy="steps",
    load_best_model_at_end=True,
    max_seq_length=2048,
    dataset_text_field="text",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_test["train"],
    eval_dataset=train_test["test"],
    tokenizer=tokenizer,
)

trainer.train()

# ============================================================
# Step 5: Save and optionally merge
# ============================================================
# Save only the LoRA adapter (small file)
model.save_pretrained("./lora-adapter")
tokenizer.save_pretrained("./lora-adapter")

# Optional: merge into base model for deployment
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")

Building Your Dataset: The Most Important Part

Data matters more than code. Roughly 80% of fine-tuning failures trace back to data problems.

Quality vs Quantity

The field has validated this repeatedly: 1,000 high-quality examples beat 100,000 noisy ones.

# Criteria for good training data
good_data_checklist = {
    "consistency": "Same question always answered in the same style",
    "diversity": "Covers all use cases you care about",
    "accuracy": "No incorrect information",
    "format": "Matches target model response style",
    "length": "Not too short, not unnecessarily long",
}

# Simple data quality check
def check_data_quality(examples):
    issues = []
    for i, ex in enumerate(examples):
        if len(ex["output"]) < 10:
            issues.append(f"Example {i}: response too short")
        if len(ex["output"]) > 2000:
            issues.append(f"Example {i}: response too long")
        if i > 0 and ex["output"] == examples[i-1]["output"]:
            issues.append(f"Example {i}: possible duplicate response")
    return issues

Data sources:

  • Manual authoring: most expensive, highest quality
  • GPT-4 generation + human review: balanced approach (check license terms!)
  • Production logs: reflects real usage, but needs cleaning
  • Public datasets + custom combination: efficient

Unsloth: 2x Faster LoRA Training

Unsloth optimizes LoRA training at the kernel level. It's 2x faster than standard PEFT with 70% less VRAM usage on the same hardware.

from unsloth import FastLanguageModel
import torch

# Much faster than standard transformers + PEFT
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# LoRA config with Unsloth optimizations
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,     # Unsloth recommends 0
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth's optimized checkpointing
    random_state=3407,
)

The speedup is especially noticeable on constrained GPUs. Check the official Unsloth GitHub for per-model optimal configurations.


Hyperparameter Tuning Guide

When results aren't what you expected, check these first:

Learning Rate:
- Too high: catastrophic forgetting (model loses existing capabilities)
- Too low: doesn't learn target behavior
- Recommended range: 1e-4 to 3e-4 (LoRA can use higher LR than full FT)

LoRA Rank (r):
- r=4: fast and lightweight, simple tasks
- r=8: good starting point for most use cases
- r=16: complex tasks, style learning
- r=64: approaching full fine-tuning quality, VRAM increases

Number of epochs:
- Small dataset (< 1,000 examples): 3-5 epochs
- Medium dataset (1,000-10,000 examples): 1-3 epochs
- Large dataset (> 10,000 examples): 1 epoch often sufficient

Effective batch size = per_device_batch x gradient_accumulation:
- Too small: unstable training
- Too large: risk of overfitting
- Typically 16-32 recommended

Common Mistakes and Fixes

Mistake 1: Catastrophic Forgetting

After fine-tuning, the model loses general capabilities.

Fix: Mix 10-20% general-purpose examples into your training data (rehearsal mixing).

Mistake 2: Overfitting

Training loss keeps decreasing but real-world performance degrades.

# Use early stopping
training_args = SFTConfig(
    ...
    eval_strategy="steps",
    eval_steps=50,
    load_best_model_at_end=True,   # auto-select best checkpoint
    metric_for_best_model="eval_loss",
    greater_is_better=False,
)

Mistake 3: Chat Template Mismatch

Each model expects a different chat format. Llama 3.1, Mistral, and Qwen all differ.

# Use tokenizer.apply_chat_template instead of manual formatting
messages = [
    {"role": "system", "content": "You are a professional translator."},
    {"role": "user", "content": "Translate this to French: Hello"},
    {"role": "assistant", "content": "Bonjour"}
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False)

Fine-tuning vs Prompt Engineering: When Do You Need Fine-tuning?

Fine-tuning isn't always the answer. Consider it when:

Fine-tuning is justified:

  • Injecting proprietary domain knowledge (medical, legal, internal docs)
  • Consistent response format/style (always JSON, specific tone)
  • Behavior changes that prompting can't achieve reliably
  • Cost reduction (a small fine-tuned model can beat a large GPT-4 at lower cost)

Prompt engineering is sufficient:

  • Standard tasks (summarization, translation, Q&A)
  • Rapid prototyping
  • When you don't have enough training data

General advice: push prompt engineering as far as it'll go first. Only reach for fine-tuning when you've hit its limits.


Conclusion

LoRA and QLoRA have transformed LLM fine-tuning from a large-company privilege into a tool any developer can use.

Key takeaways:

  • QLoRA: 70B models fine-tunable on 20GB VRAM
  • Data quality: matters more than code. 1,000 good examples is enough to start
  • Unsloth: 2x faster training on the same hardware
  • Start small: validate with an 8B model, then scale to 70B

There's something satisfying about watching a model you fine-tuned outperform GPT-4 on your specific task. Worth experiencing firsthand.