- Authors
- Name
- Introduction
- Full Fine-tuning vs Parameter-Efficient Fine-tuning
- LoRA: Low-Rank Adaptation
- QLoRA: 4-bit Quantization + LoRA
- Data Preparation and Training
- Hyperparameter Tuning Guide
- Advanced Techniques
- Conclusion
- Quiz

Introduction
Training a 7B, 13B, or 70B parameter LLM from scratch requires dozens to hundreds of GPUs and millions of dollars. However, with fine-tuning, you can build your own specialized model using just a single consumer-grade GPU.
This article covers practical fine-tuning methods using LoRA, QLoRA, and the PEFT library.
Full Fine-tuning vs Parameter-Efficient Fine-tuning
The Problem with Full Fine-tuning
To fully fine-tune a 7B model, you need:
- Model parameters: 7B x 4 bytes (FP32) = 28GB
- Optimizer states: Adam requires 2x the parameters = 56GB
- Gradients: Same as parameters = 28GB
- Total VRAM: Approximately 112GB or more
Even a single A100 80GB is not enough.
The Emergence of PEFT
Parameter-Efficient Fine-tuning (PEFT) trains only 0.1-1% of the total parameters:
| Method | Trainable Parameter Ratio | VRAM (7B model) |
|---|---|---|
| Full Fine-tuning | 100% | ~112GB |
| LoRA | ~0.1-1% | ~16GB |
| QLoRA | ~0.1-1% | ~6GB |
LoRA: Low-Rank Adaptation
Mathematical Principles
The core idea of LoRA: The weight update matrix Delta-W is low-rank.
Original linear transformation:
With LoRA applied:
Where:
- : Original weights (frozen)
- : Low-rank matrix (trainable)
- : Low-rank matrix (trainable)
- : Rank (typically 4-64, very small compared to the original dimension)
- : Scaling factor
Original W (4096 x 4096) = 16M parameters [frozen]
LoRA:
A (r x 4096) + B (4096 x r) = r x 8192 parameters
With r=8: 65,536 parameters (0.4%)
Code Implementation
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
# 1. Load the base model
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# 2. LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank
lora_alpha=32, # Scaling (typically 2x the rank)
lora_dropout=0.05, # Dropout
target_modules=[ # Modules to apply LoRA to
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
bias="none"
)
# 3. Create PEFT model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5194
Guide to Selecting target_modules
# Check all Linear layers in the model
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
print(name, module.in_features, module.out_features)
# Common choices:
# - Attention only: ["q_proj", "v_proj"] — Minimum VRAM
# - Full attention: ["q_proj", "k_proj", "v_proj", "o_proj"] — Recommended
# - Including MLP: above + ["gate_proj", "up_proj", "down_proj"] — Maximum performance
QLoRA: 4-bit Quantization + LoRA
What Makes QLoRA Special
QLoRA enables fine-tuning large models on consumer GPUs through three innovations:
- 4-bit NormalFloat (NF4): Quantization optimized for normally distributed weights
- Double Quantization: Further memory savings by quantizing the quantization constants themselves
- Paged Optimizers: Automatic paging to CPU when GPU memory runs out
Implementation
from transformers import BitsAndBytesConfig
import torch
# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_compute_dtype=torch.bfloat16, # Computation in bf16
bnb_4bit_use_double_quant=True, # Double Quantization
)
# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=bnb_config,
device_map="auto"
)
# Apply LoRA (QLoRA = 4-bit quantized model + LoRA)
model = get_peft_model(model, lora_config)
VRAM Usage Comparison
| Model | Full FP16 | LoRA FP16 | QLoRA 4-bit |
|---|---|---|---|
| 7B | ~28GB | ~16GB | ~6GB |
| 13B | ~52GB | ~30GB | ~10GB |
| 70B | ~280GB | ~160GB | ~48GB |
With QLoRA, you can fine-tune models up to 13B on a single RTX 3090/4090 (24GB).
Data Preparation and Training
Dataset Format
Data format for instruction fine-tuning:
from datasets import load_dataset
# Alpaca-style dataset
dataset = load_dataset("json", data_files="train_data.json")
# Data example
# {
# "instruction": "Please summarize the following text.",
# "input": "Kubernetes is a platform for containerized workloads and services...",
# "output": "Kubernetes is a container orchestration platform."
# }
# Apply prompt template
def format_instruction(sample):
if sample["input"]:
text = f"""### Instruction:
{sample["instruction"]}
### Input:
{sample["input"]}
### Response:
{sample["output"]}"""
else:
text = f"""### Instruction:
{sample["instruction"]}
### Response:
{sample["output"]}"""
return {"text": text}
dataset = dataset.map(format_instruction)
Training with SFTTrainer
from trl import SFTTrainer, SFTConfig
training_args = SFTConfig(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch = 4 x 4 = 16
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
max_seq_length=2048,
bf16=True,
logging_steps=10,
save_strategy="epoch",
optim="paged_adamw_8bit", # Paged optimizer for QLoRA
gradient_checkpointing=True, # Additional VRAM savings
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
args=training_args,
)
trainer.train()
Saving and Merging the Model After Training
# Save only the LoRA adapter (tens of MB)
model.save_pretrained("./lora-adapter")
# Load the adapter later
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "./lora-adapter")
# Merge adapter with the base model (for deployment)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
Hyperparameter Tuning Guide
Choosing the LoRA Rank (r)
# Characteristics by rank
# r=4: Fewest parameters, suitable for simple domain adaptation
# r=8: Typical starting point
# r=16: Good balance (recommended)
# r=32: Complex tasks, requires more VRAM
# r=64+: Performance close to full fine-tuning, proportionally less efficient
# Empirically, r=16 + alpha=32 works well in most cases
Learning Rate
# LoRA/QLoRA learning rates are set higher than full FT
# Full FT: 1e-5 ~ 5e-5
# LoRA: 1e-4 ~ 3e-4
# QLoRA: 2e-4 (typical)
Advanced Techniques
DoRA: Weight-Decomposed Low-Rank Adaptation
An evolution of LoRA that decomposes weights into magnitude and direction:
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
use_dora=True, # Enable DoRA
)
Combining Multiple LoRA Adapters
from peft import PeftModel
# Load multiple adapters on the base model
model = PeftModel.from_pretrained(base_model, "./adapter-korean")
model.load_adapter("./adapter-code", adapter_name="code")
# Switch adapters
model.set_adapter("code")
# Or combine adapter weights
model.add_weighted_adapter(
adapters=["default", "code"],
weights=[0.7, 0.3],
adapter_name="merged"
)
Conclusion
LoRA and QLoRA have revolutionized the accessibility of LLM fine-tuning. The ability to customize multi-billion parameter models on a single consumer GPU is at the heart of AI democratization.
Key takeaways:
- LoRA: Reduces trainable parameters to 0.1-1% through low-rank decomposition
- QLoRA: Saves an additional ~4x VRAM through 4-bit quantization
- PEFT: Hugging Face library that enables implementation with just a few lines of code
Go ahead and prepare your data, train, and deploy. It is much easier than you might think.
Quiz
Q1: What does the rank (r) mean in LoRA?
It is the low-dimensional size used when decomposing the weight update matrix Delta-W. A smaller r means fewer trainable parameters and less VRAM consumption, but it also limits the model's expressiveness.
Q2: How much VRAM is approximately needed to fully fine-tune a 7B model?
Approximately 112GB or more. This includes model parameters (28GB) + optimizer states (56GB) + gradients (28GB).
Q3: What are the three key innovations of QLoRA?
- 4-bit NormalFloat (NF4) quantization, 2) Double Quantization (quantizing the quantization constants), 3) Paged Optimizers (automatic GPU to CPU paging)
Q4: What is the role of the lora_alpha parameter in LoRA?
It is the scaling factor for LoRA updates. The actual scale is calculated as alpha/r, and it is typically set to 2x the rank (e.g., alpha=32 when r=16).
Q5: Why is NF4 quantization used in QLoRA better than standard INT4?
Since neural network weights approximately follow a normal distribution, NF4 quantization, which is optimized for normal distributions, incurs less information loss than INT4, which assumes a uniform distribution.
Q6: Why would you merge a LoRA adapter with the base model?
To eliminate additional computational overhead during inference. Once merged, the model has the same structure as the original, resulting in faster inference compared to keeping the adapter separate.
Q7: What is the effect of gradient_checkpointing=True?
It avoids storing intermediate activations from the forward pass in memory, recomputing them during the backward pass instead. This saves VRAM but increases training time by approximately 20-30%.
Q8: How does the choice of target_modules affect performance in LoRA?
Applying LoRA to more modules improves performance but increases VRAM usage and training time. Applying it only to q_proj and v_proj in the attention layer is the minimal configuration, while including the MLP layers yields maximum performance.