Practical LLM Fine-Tuning — Building Your Own Model with LoRA, QLoRA, and PEFT

Introduction
Full Fine-tuning vs Parameter-Efficient Fine-tuning
- The Problem with Full Fine-tuning
- The Emergence of PEFT
LoRA: Low-Rank Adaptation
QLoRA: 4-bit Quantization + LoRA
Data Preparation and Training
Hyperparameter Tuning Guide
- Choosing the LoRA Rank (r)
- Learning Rate
Advanced Techniques
- DoRA: Weight-Decomposed Low-Rank Adaptation
- Combining Multiple LoRA Adapters
Conclusion
Quiz

Introduction

Training a 7B, 13B, or 70B parameter LLM from scratch requires dozens to hundreds of GPUs and millions of dollars. However, with fine-tuning, you can build your own specialized model using just a single consumer-grade GPU.

This article covers practical fine-tuning methods using LoRA, QLoRA, and the PEFT library.

Full Fine-tuning vs Parameter-Efficient Fine-tuning

The Problem with Full Fine-tuning

To fully fine-tune a 7B model, you need:

Model parameters: 7B x 4 bytes (FP32) = 28GB
Optimizer states: Adam requires 2x the parameters = 56GB
Gradients: Same as parameters = 28GB
Total VRAM: Approximately 112GB or more

Even a single A100 80GB is not enough.

The Emergence of PEFT

Parameter-Efficient Fine-tuning (PEFT) trains only 0.1-1% of the total parameters:

Method	Trainable Parameter Ratio	VRAM (7B model)
Full Fine-tuning	100%	~112GB
LoRA	~0.1-1%	~16GB
QLoRA	~0.1-1%	~6GB

LoRA: Low-Rank Adaptation

Mathematical Principles

The core idea of LoRA: The weight update matrix Delta-W is low-rank.

Original linear transformation:

$y = Wx$

With LoRA applied:

$y = Wx + \frac{\alpha}{r} \cdot BAx$

Where:

$W \in \mathbb{R}^{d \times d}$ : Original weights (frozen)
$B \in \mathbb{R}^{d \times r}$ : Low-rank matrix (trainable)
$A \in \mathbb{R}^{r \times d}$ : Low-rank matrix (trainable)
$r$ : Rank (typically 4-64, very small compared to the original dimension)
$\alpha$ : Scaling factor

Original W (4096 x 4096) = 16M parameters [frozen]

LoRA:
A (r x 4096) + B (4096 x r) = r x 8192 parameters
With r=8: 65,536 parameters (0.4%)

Code Implementation

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. Load the base model
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 2. LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank
    lora_alpha=32,                 # Scaling (typically 2x the rank)
    lora_dropout=0.05,             # Dropout
    target_modules=[               # Modules to apply LoRA to
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    bias="none"
)

# 3. Create PEFT model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5194

Guide to Selecting target_modules

# Check all Linear layers in the model
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        print(name, module.in_features, module.out_features)

# Common choices:
# - Attention only: ["q_proj", "v_proj"] — Minimum VRAM
# - Full attention: ["q_proj", "k_proj", "v_proj", "o_proj"] — Recommended
# - Including MLP: above + ["gate_proj", "up_proj", "down_proj"] — Maximum performance

QLoRA: 4-bit Quantization + LoRA

What Makes QLoRA Special

QLoRA enables fine-tuning large models on consumer GPUs through three innovations:

4-bit NormalFloat (NF4): Quantization optimized for normally distributed weights
Double Quantization: Further memory savings by quantizing the quantization constants themselves
Paged Optimizers: Automatic paging to CPU when GPU memory runs out

Implementation

from transformers import BitsAndBytesConfig
import torch

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16, # Computation in bf16
    bnb_4bit_use_double_quant=True,        # Double Quantization
)

# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto"
)

# Apply LoRA (QLoRA = 4-bit quantized model + LoRA)
model = get_peft_model(model, lora_config)

VRAM Usage Comparison

Model	Full FP16	LoRA FP16	QLoRA 4-bit
7B	~28GB	~16GB	~6GB
13B	~52GB	~30GB	~10GB
70B	~280GB	~160GB	~48GB

With QLoRA, you can fine-tune models up to 13B on a single RTX 3090/4090 (24GB).

Data Preparation and Training

Dataset Format

Data format for instruction fine-tuning:

from datasets import load_dataset

# Alpaca-style dataset
dataset = load_dataset("json", data_files="train_data.json")

# Data example
# {
#   "instruction": "Please summarize the following text.",
#   "input": "Kubernetes is a platform for containerized workloads and services...",
#   "output": "Kubernetes is a container orchestration platform."
# }

# Apply prompt template
def format_instruction(sample):
    if sample["input"]:
        text = f"""### Instruction:
{sample["instruction"]}

### Input:
{sample["input"]}

### Response:
{sample["output"]}"""
    else:
        text = f"""### Instruction:
{sample["instruction"]}

### Response:
{sample["output"]}"""
    return {"text": text}

dataset = dataset.map(format_instruction)

Training with SFTTrainer

from trl import SFTTrainer, SFTConfig

training_args = SFTConfig(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # Effective batch = 4 x 4 = 16
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    max_seq_length=2048,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    optim="paged_adamw_8bit",         # Paged optimizer for QLoRA
    gradient_checkpointing=True,      # Additional VRAM savings
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    args=training_args,
)

trainer.train()

Saving and Merging the Model After Training

# Save only the LoRA adapter (tens of MB)
model.save_pretrained("./lora-adapter")

# Load the adapter later
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "./lora-adapter")

# Merge adapter with the base model (for deployment)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")

Hyperparameter Tuning Guide

Choosing the LoRA Rank (r)

# Characteristics by rank
# r=4:   Fewest parameters, suitable for simple domain adaptation
# r=8:   Typical starting point
# r=16:  Good balance (recommended)
# r=32:  Complex tasks, requires more VRAM
# r=64+: Performance close to full fine-tuning, proportionally less efficient

# Empirically, r=16 + alpha=32 works well in most cases

Learning Rate

# LoRA/QLoRA learning rates are set higher than full FT
# Full FT: 1e-5 ~ 5e-5
# LoRA:    1e-4 ~ 3e-4
# QLoRA:   2e-4 (typical)

Advanced Techniques

DoRA: Weight-Decomposed Low-Rank Adaptation

An evolution of LoRA that decomposes weights into magnitude and direction:

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    use_dora=True,   # Enable DoRA
)

Combining Multiple LoRA Adapters

from peft import PeftModel

# Load multiple adapters on the base model
model = PeftModel.from_pretrained(base_model, "./adapter-korean")
model.load_adapter("./adapter-code", adapter_name="code")

# Switch adapters
model.set_adapter("code")

# Or combine adapter weights
model.add_weighted_adapter(
    adapters=["default", "code"],
    weights=[0.7, 0.3],
    adapter_name="merged"
)

Conclusion

LoRA and QLoRA have revolutionized the accessibility of LLM fine-tuning. The ability to customize multi-billion parameter models on a single consumer GPU is at the heart of AI democratization.

Key takeaways:

LoRA: Reduces trainable parameters to 0.1-1% through low-rank decomposition
QLoRA: Saves an additional ~4x VRAM through 4-bit quantization
PEFT: Hugging Face library that enables implementation with just a few lines of code

Go ahead and prepare your data, train, and deploy. It is much easier than you might think.

Quiz

Q1: What does the rank (r) mean in LoRA?

It is the low-dimensional size used when decomposing the weight update matrix Delta-W. A smaller r means fewer trainable parameters and less VRAM consumption, but it also limits the model's expressiveness.

Q2: How much VRAM is approximately needed to fully fine-tune a 7B model?

Approximately 112GB or more. This includes model parameters (28GB) + optimizer states (56GB) + gradients (28GB).

Q3: What are the three key innovations of QLoRA?

4-bit NormalFloat (NF4) quantization, 2) Double Quantization (quantizing the quantization constants), 3) Paged Optimizers (automatic GPU to CPU paging)

Q4: What is the role of the lora_alpha parameter in LoRA?

It is the scaling factor for LoRA updates. The actual scale is calculated as alpha/r, and it is typically set to 2x the rank (e.g., alpha=32 when r=16).

Q5: Why is NF4 quantization used in QLoRA better than standard INT4?

Since neural network weights approximately follow a normal distribution, NF4 quantization, which is optimized for normal distributions, incurs less information loss than INT4, which assumes a uniform distribution.

Q6: Why would you merge a LoRA adapter with the base model?

To eliminate additional computational overhead during inference. Once merged, the model has the same structure as the original, resulting in faster inference compared to keeping the adapter separate.

Q7: What is the effect of gradient_checkpointing=True?

It avoids storing intermediate activations from the forward pass in memory, recomputing them during the backward pass instead. This saves VRAM but increases training time by approximately 20-30%.

Q8: How does the choice of target_modules affect performance in LoRA?

Applying LoRA to more modules improves performance but increases VRAM usage and training time. Applying it only to q_proj and v_proj in the attention layer is the minimal configuration, while including the MLP layers yields maximum performance.