Skip to content
Published on

LLM Fine-tuning Practical Guide: Efficient Model Adaptation with LoRA, QLoRA, and PEFT

Authors
  • Name
    Twitter
LLM Fine-tuning Guide

Introduction

Fine-tuning pre-trained Large Language Models (LLMs) to specific domains and tasks is a core technique in LLM deployment. However, fully fine-tuning models with billions of parameters requires enormous GPU memory and compute resources. For GPT-3 175B, full fine-tuning with Adam optimizer requires approximately 1.2TB of GPU memory, making it impractical for most organizations.

Parameter-Efficient Fine-Tuning (PEFT) techniques emerged to solve this problem. In particular, LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) reduce the number of trainable parameters to less than 0.1% of the original model while achieving performance on par with full fine-tuning. This guide systematically covers the theoretical foundations through production-level implementation of these efficient fine-tuning methods.

The Shifting Fine-tuning Paradigm

Limitations of Full Fine-tuning

Traditional fine-tuning updates all parameters of a pre-trained model. This approach carries fundamental challenges:

  • Memory cost: Model weights + gradients + optimizer states must all reside in GPU memory
  • Storage cost: A complete model copy must be saved per task. Using a 70B model across 10 tasks requires roughly 1.4TB of storage
  • Catastrophic forgetting: Overfitting on small datasets causes the model to lose general knowledge acquired during pre-training

Classification of PEFT Methods

Parameter-Efficient Fine-Tuning methods fall into three main categories:

MethodRepresentativePrincipleTrainable Param %GPU Memory (7B)Perf vs Full FT
Full Fine-tuning-Update all params100%~120GBBaseline
Additive (Adapter)Adapter, Prefix TuningInsert small modules0.5-3%~30GB95-98%
ReparameterizationLoRA, QLoRALow-rank matrix decomposition0.01-0.5%~16-28GB97-100%
SelectiveBitFit, Diff PruningTrain only selected params0.05-1%~25GB90-95%

LoRA: Mathematical Principles and Implementation

Core Idea of Low-Rank Decomposition

LoRA (Low-Rank Adaptation), proposed by Hu et al. (2021), is based on the key insight that weight updates during fine-tuning can be approximated as the product of low-rank matrices.

For a pre-trained weight matrix W0 of dimension d x k, the update delta_W is decomposed into two low-rank matrices B (d x r) and A (r x k), where r is the rank, much smaller than either d or k.

During the forward pass, the output is computed as: h = W0 _ x + (B _ A) _ x. During training, W0 is frozen and only B and A are learned. The number of trainable parameters drops from d _ k to r * (d + k).

LoRA Implementation

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load base model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank: typically between 8-64
    lora_alpha=32,                 # Scaling factor: usually 2x rank
    lora_dropout=0.05,             # Dropout: prevents overfitting
    target_modules=[               # Modules to apply LoRA to
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",                   # Whether to train bias
)

# Create PEFT model
peft_model = get_peft_model(model, lora_config)

# Check trainable parameters
peft_model.print_trainable_parameters()
# Example output: trainable params: 33,554,432 || all params: 6,771,970,048
# || trainable%: 0.4956

Rank (r) Selection Guide

The rank r is the most critical LoRA hyperparameter:

  • r=4-8: Suitable for simple classification tasks, sentiment analysis. Use when minimizing memory is the priority
  • r=16-32: Recommended range for general instruction tuning and conversational models
  • r=64-128: For complex domain adaptation (medical, legal) or large-scale datasets

The alpha value is typically set to 2x the rank. Since the effective scaling factor is alpha/r, alpha=32 with r=16 yields a scaling of 2.

QLoRA: 4-bit Quantization

The QLoRA Innovation

QLoRA (Dettmers et al., 2023) combines LoRA with 4-bit quantization to dramatically reduce memory usage. It enables fine-tuning a 65B parameter model on a single 48GB GPU, introducing three key techniques:

  1. 4-bit NormalFloat (NF4): An information-theoretically optimal data type for normally distributed weights
  2. Double Quantization: Re-quantizes the quantization constants, saving an additional 0.37 bits per parameter on average
  3. Paged Optimizers: Automatically pages optimizer states to CPU RAM during GPU memory spikes

QLoRA Training Script

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16, # Use bfloat16 for computation
    bnb_4bit_use_double_quant=True,       # Enable Double Quantization
)

# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Prepare model for k-bit training (gradient checkpointing, etc.)
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

# Training arguments
training_args = TrainingArguments(
    output_dir="./qlora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    fp16=False,
    bf16=True,
    optim="paged_adamw_8bit",            # Use Paged Optimizer
    gradient_checkpointing=True,
    max_grad_norm=0.3,
    report_to="wandb",
)

# Train with SFTTrainer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
    max_seq_length=2048,
)

trainer.train()

Memory Usage Comparison

The memory savings of QLoRA are dramatic:

Model SizeFull FT (FP16)LoRA (FP16)QLoRA (NF4)
7B~120GB~28GB~6GB
13B~220GB~52GB~10GB
70B~1.2TB~280GB~48GB

Working with the PEFT Library

Hugging Face PEFT Library Overview

The Hugging Face PEFT library provides a unified interface for various parameter-efficient fine-tuning methods. It integrates tightly with Transformers, Accelerate, and TRL, allowing minimal code changes to existing workflows.

# Install PEFT
# pip install peft transformers accelerate bitsandbytes trl

# Use different PEFT methods through the same interface
from peft import (
    LoraConfig,
    PrefixTuningConfig,
    PromptTuningConfig,
    IA3Config,
    get_peft_model,
)

# LoRA
lora_config = LoraConfig(r=16, lora_alpha=32, task_type="CAUSAL_LM")

# Prefix Tuning
prefix_config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,
)

# Prompt Tuning
prompt_config = PromptTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,
    prompt_tuning_init="TEXT",
    prompt_tuning_init_text="Classify the following text:",
    tokenizer_name_or_path="meta-llama/Llama-2-7b-hf",
)

# IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations)
ia3_config = IA3Config(
    task_type="CAUSAL_LM",
    target_modules=["k_proj", "v_proj", "down_proj"],
    feedforward_modules=["down_proj"],
)

Saving and Loading Adapters

A major advantage of PEFT is saving and loading adapters separately. A LoRA adapter for a 7B model is only about 30-100MB.

from peft import PeftModel, PeftConfig

# Save adapter (~30-100MB)
peft_model.save_pretrained("./my-lora-adapter")

# Load adapter: combine base model + adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")

# Inference optimization: merge LoRA weights into base model
model = model.merge_and_unload()

# Save merged model (no overhead during inference)
model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

Dataset Preparation Strategies

Instruction Tuning Data Format

In instruction tuning, data quality is the single most important factor determining model performance. The following format is commonly used:

from datasets import Dataset

# Alpaca-format dataset construction
def format_instruction(sample):
    """Alpaca-style prompt template"""
    if sample.get("input"):
        return f"""### Instruction:
{sample['instruction']}

### Input:
{sample['input']}

### Response:
{sample['output']}"""
    else:
        return f"""### Instruction:
{sample['instruction']}

### Response:
{sample['output']}"""

# Dataset example
raw_data = [
    {
        "instruction": "Analyze the sentiment of the following text.",
        "input": "This product is amazing! Fast shipping and top quality!",
        "output": "Positive sentiment. The text expresses satisfaction with product quality and shipping speed.",
    },
    {
        "instruction": "Optimize the given SQL query.",
        "input": "SELECT * FROM users WHERE created_at > '2024-01-01' ORDER BY name",
        "output": "SELECT id, name, email FROM users WHERE created_at > '2024-01-01' ORDER BY name LIMIT 100;\n\nOptimization points:\n1. Changed SELECT * to select only needed columns\n2. Added LIMIT to restrict result set\n3. Recommend creating a composite index on created_at and name",
    },
]

dataset = Dataset.from_list(raw_data)
formatted = dataset.map(lambda x: {"text": format_instruction(x)})

Data Quality Checklist

Key principles for building high-quality fine-tuning datasets:

  • Ensure diversity: Balance task types, difficulty levels, and domains to avoid pattern bias
  • Quality verification: At least 2 reviewers cross-validate. Supplement with LLM-based automated quality assessment
  • Appropriate scale: 1,000-10,000 high-quality samples are more effective than 100,000 low-quality ones
  • Format consistency: Maintain consistent instruction, input, output structure across the entire dataset
  • Remove harmful content: Pre-filter samples containing bias, toxic language, or personal information

Hyperparameter Tuning

Key Hyperparameter Guide

Fine-tuning performance is sensitive to hyperparameter settings. Below are field-tested recommended values:

ParameterRecommended RangeDescription
Learning Rate1e-4 to 3e-42e-4 is the typical starting point for QLoRA
Batch Size (effective)32-128Adjust via gradient accumulation
Epochs1-5Scale with data size: 3-5 for small, 1-2 for large
Warmup Ratio0.03-0.13-10% of total steps
Weight Decay0.01-0.1L2 regularization to prevent overfitting
Max Grad Norm0.3-1.0Gradient clipping threshold
LR SchedulercosineCosine annealing is most stable
LoRA r8-64Increase proportionally to task complexity
LoRA alpha2 * rScaling factor
LoRA dropout0.05-0.1Prevents overfitting

Training Monitoring

# Training monitoring with Weights and Biases
import wandb

wandb.init(
    project="llm-finetuning",
    config={
        "model": "Llama-2-7b",
        "method": "QLoRA",
        "r": 16,
        "alpha": 32,
        "lr": 2e-4,
        "epochs": 3,
    },
)

# Key metrics to monitor:
# 1. Training Loss: Should steadily decrease. Plateaus after sharp drops signal overfitting
# 2. Validation Loss: Growing gap with training loss indicates overfitting
# 3. Learning Rate: Verify the scheduler is behaving as intended
# 4. Gradient Norm: Sudden spikes indicate training instability
# 5. GPU Memory: Track usage to prevent OOM errors

Troubleshooting

Catastrophic Forgetting

The model loses basic general knowledge after fine-tuning.

  • Cause: Over-adapting to small domain data corrupts pre-trained representations
  • Solution 1: Lower the LoRA rank to restrict update scope (r=8 or below)
  • Solution 2: Reduce learning rate to 1e-5 and decrease epochs
  • Solution 3: Mix 10-20% general knowledge data into the training set
  • Solution 4: Increase L2 regularization (weight_decay)

Overfitting on Small Datasets

Overfitting frequently occurs with fewer than 1,000 samples.

  • Symptom: Training loss converges to 0 while validation loss increases
  • Solution 1: Data augmentation -- use LLM paraphrasing to expand data 2-3x
  • Solution 2: Increase LoRA dropout to 0.1-0.2 and set weight decay above 0.05
  • Solution 3: Reduce epochs to 1-2 and apply early stopping
  • Solution 4: Use a smaller base model (7B instead of 70B)

Quantization Quality Degradation

When using QLoRA, information loss from quantization can impact performance.

  • Symptom: Performance drops 2-5% or more compared to LoRA (FP16) with identical settings
  • Solution 1: Set compute_dtype to bfloat16 (more stable than float16)
  • Solution 2: Increase LoRA rank to compensate for expressiveness (r=32-64)
  • Solution 3: After training, restore to FP16 via merge_and_unload for serving
  • Solution 4: Consider improved quantized fine-tuning methods like IR-QLoRA or Q-BLoRA

Production Checklist

An end-to-end checklist for production-grade LLM fine-tuning:

Before Training

  • Base model selection: Review task characteristics, language, license, and model size
  • Data pipeline: Collection, cleaning, formatting, quality validation, train/val/test split
  • Environment setup: Verify GPU specs, library version compatibility, CUDA version
  • Baseline measurement: Record task performance of the base model before fine-tuning

During Training

  • Monitoring: Real-time tracking of loss curves, gradient norms, GPU memory
  • Checkpointing: Save model at regular intervals, manage best model by validation loss
  • Early stopping: Halt if validation loss shows no improvement for 3-5 steps

After Training

  • Quantitative evaluation: Measure task-specific benchmark scores (BLEU, ROUGE, accuracy, etc.)
  • Qualitative evaluation: Manually review output quality across diverse inputs
  • General capability check: Verify no catastrophic forgetting has occurred
  • Adapter merging: Optimize for serving with merge_and_unload
  • A/B testing: Compare performance against existing models in real usage environments

References