- Authors
- Name
- 1. Introduction
- 2. The Problems with Full Fine-tuning
- 3. Core Idea of LoRA: Low-Rank Decomposition of Weight Updates
- 4. Mathematical Background: Intrinsic Dimensionality
- 5. Rank Selection Strategy and Target Layers
- 6. QLoRA: 4-bit Quantization + LoRA
- 7. Hands-On with the HuggingFace PEFT Library
- 8. LoRA vs Full Fine-tuning Performance Comparison
- 9. Practical Tips: Learning Rate, Rank, and Alpha Tuning
- 10. The LoRA Ecosystem and Variant Techniques
- 11. Conclusion
- References
1. Introduction
Large Language Models (LLMs) such as GPT-3 (175B), LLaMA (65B), and Falcon (180B) have demonstrated remarkable performance across a wide range of NLP tasks. However, fine-tuning these models for specific domains or tasks still demands enormous computing resources. The LoRA (Low-Rank Adaptation of Large Language Models) paper, published by Microsoft Research in 2021, proposed an elegant solution to this problem and has since become the de facto standard for LLM fine-tuning.
This article provides a mathematical analysis of the core principles of the LoRA paper (Hu et al., 2021), covers the follow-up work QLoRA, and offers a systematic guide to practical implementation using the HuggingFace PEFT library.
2. The Problems with Full Fine-tuning
2.1 Parameter Count and GPU Memory
Full fine-tuning updates all parameters of a pre-trained model. Looking at GPT-3 175B as a reference, the cost becomes abundantly clear.
| Item | Full Fine-tuning | LoRA (r=4) |
|---|---|---|
| Trainable Parameters | 175B (100%) | 4.7M (~0.003%) |
| GPU VRAM Required | ~1.2TB | ~350GB |
| Checkpoint Size | ~350GB | ~35MB |
When using the Adam optimizer for full fine-tuning, two additional state values — momentum and variance — must be stored for each parameter. Therefore, even at 16-bit precision, model weights (2 bytes) + gradients (2 bytes) + optimizer states (8 bytes) = approximately 12 bytes per parameter are needed. For a 175B model, this translates to roughly 2.1TB of memory.
2.2 Storage and Deployment Costs
Full fine-tuning requires storing a complete copy of the model for each task. Fine-tuning GPT-3 for 10 different tasks would require 350GB x 10 = 3.5TB of storage. In contrast, LoRA only needs to store approximately 35MB of adapter weights per task, so the same scenario requires just 350MB. This represents roughly a 10,000x reduction in storage.
2.3 Inference Latency
Previous adapter-based methods (Houlsby et al., 2019) inserted additional layers into the model architecture, which increased latency during inference. This overhead becomes non-negligible in online serving environments with small batch sizes. LoRA fundamentally solves this problem.
3. Core Idea of LoRA: Low-Rank Decomposition of Weight Updates
3.1 The Key Hypothesis
The core hypothesis of LoRA is as follows:
When adapting pre-trained model weights to a specific task, the weight update (delta W) has a low intrinsic rank.
This hypothesis is inspired by research from Aghajanyan et al. (2020), and rests on the intuition that since pre-trained models have already learned sufficiently good representations, the changes needed for task adaptation are concentrated in only a small subspace of the full parameter space.
3.2 Mathematical Formulation
Let the pre-trained weight matrix be . In full fine-tuning, this is updated to , where has the same dimensions as the full parameter space.
LoRA decomposes this into a product of two low-rank matrices:
Where:
- (down-projection)
- (up-projection)
- (rank, typically 1–64)
The forward pass is thus modified as follows:
The number of trainable parameters is drastically reduced from to . For example, if (GPT-3's hidden dimension) and :
- Full Fine-tuning: parameters
- LoRA: parameters (approximately 768x reduction)
3.3 Initialization Strategy
LoRA's initialization is designed so that at the start of training:
- Matrix A: Random Gaussian initialization
- Matrix B: Zero initialization
This ensures at the beginning of training, so the model produces the same output as the original pre-trained model even with LoRA adapters attached. This is a critical design decision that ensures training stability.
3.4 Scaling Factor: alpha/r
In the actual forward pass, LoRA's output is multiplied by a scaling factor :
Here, is a constant (hyperparameter). The original paper mentions setting to the first value of tried and not changing it afterward. The purpose of this scaling factor is to reduce the need to re-tune the learning rate when changing the rank .
More recently, Rank-Stabilized LoRA (rsLoRA) has been proposed, which sets the scaling factor to to improve training stability at higher ranks. In HuggingFace PEFT, this can be enabled with the use_rslora=True option.
3.5 Merging at Inference Time
One of LoRA's greatest advantages is that there is no additional latency at inference time. Once training is complete, the adapter weights can be merged into the original model:
After merging, the model has the same structure as the original, so inference speed is not degraded at all. Conversely, if you want to switch to an adapter for a different task, you simply subtract and add a different . This flexibility enables efficient management of multiple task-specific adapters on a single base model.
4. Mathematical Background: Intrinsic Dimensionality
4.1 What Is Intrinsic Dimensionality?
Aghajanyan et al. (2020) experimentally demonstrated in their paper "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning" that the fine-tuning process of pre-trained language models has a surprisingly low intrinsic dimension.
The core idea is that the optimization problem in the full parameter space (where D is the total number of parameters) can be reparameterized into a much smaller subspace () while still achieving performance comparable to full fine-tuning.
4.2 Connection to LoRA
LoRA directly exploits this observation. If the weight update has a low intrinsic rank, then approximating it as a low-rank matrix product is well-justified.
In fact, the experimental results in the LoRA paper strongly support this hypothesis. When applying LoRA to and in GPT-3 175B, even rank r=1 achieved highly competitive performance:
| Rank (r) | WikiSQL Acc. | MultiNLI Acc. |
|---|---|---|
| 1 | 73.4% | 91.3% |
| 2 | 73.3% | 91.4% |
| 4 | 73.7% | 91.3% |
| 8 | 73.8% | 91.6% |
| 64 | 73.5% | 91.4% |
The negligible performance difference between and empirically validates the hypothesis that the intrinsic rank of weight updates is very low. This provides the practical insight that "a larger rank is not always better."
5. Rank Selection Strategy and Target Layers
5.1 Which Layers Should LoRA Be Applied To?
The Self-Attention block in a Transformer contains four weight matrices: (Query), (Key), (Value), and (Output projection). The LoRA paper experimented with various combinations on GPT-3 175B under the same parameter budget (18M).
The results showed that applying LoRA to both and simultaneously yielded the best performance. Applying it to only a single matrix (e.g., only or only ) degraded performance, and applying it to all matrices reduced the rank allocated to each matrix, which also hurt performance.
However, subsequent research and practical experience have shown that applying LoRA to all Linear layers often produces better results. According to experiments by Sebastian Raschka, when applying LoRA to all layers of LLaMA-2 7B — including Attention Q, K, V, O and MLP gate, up, and down projections:
- Trainable parameters: 4,194,304 -> 20,277,248 (approximately 5x increase)
- GPU memory: 14.18 GB -> 16.62 GB (approximately 17% increase)
- Performance: notable improvement
While the parameter count and memory increase somewhat, the performance-to-efficiency ratio is excellent, making the target_modules="all-linear" setting the recommended approach in practice.
5.2 Rank Selection Guidelines
There are no absolute rules for rank selection, but the following guidelines can serve as a reference:
| Use Case | Recommended Rank | Rationale |
|---|---|---|
| Simple task adaptation (sentiment analysis) | r=4–8 | Low intrinsic dimension is sufficient |
| General instruction tuning | r=16–32 | Adaptation to diverse instructions is needed |
| Complex domain adaptation (medical, legal) | r=64–256 | More capacity needed for domain knowledge |
| Maximum performance | r=256+ | Recommended with rsLoRA |
In Sebastian Raschka's experiments, r=256 outperformed r=8, 32, 64, and 128, but performance declined at r=512. This suggests that excessively high ranks can lead to overfitting.
6. QLoRA: 4-bit Quantization + LoRA
6.1 Overview of QLoRA
QLoRA (Quantized LoRA), published by Tim Dettmers et al. in 2023, takes LoRA's memory efficiency one step further. The core idea is to store the base model weights in 4-bit quantized format while training only the LoRA adapters in 16-bit (or BFloat16).
Three technical innovations of QLoRA:
6.2 4-bit NormalFloat (NF4)
NF4 is a new data type proposed in QLoRA that is information-theoretically optimal for normally distributed weights.
Pre-trained neural network weights generally follow a zero-centered normal distribution. NF4 uses quantile quantization to ensure that each quantization bin represents an equal expected number of values from the target normal distribution. This preserves the original weight distribution far more accurately than conventional INT4 or FP4.
6.3 Double Quantization
Double Quantization is a technique that quantizes the quantization constants themselves. In block-wise quantization, each block requires a scaling factor, and by quantizing these scaling factors again, an additional average savings of approximately 0.37 bits per parameter is achieved. For a 65B model, this translates to roughly 3GB of additional memory savings.
6.4 Paged Optimizers
Paged Optimizers leverage NVIDIA Unified Memory to automatically page out optimizer states to CPU memory when GPU memory becomes saturated. This effectively handles gradient checkpointing memory spikes that occur when processing long sequences. Based on the Unified Memory model introduced in CUDA Toolkit 6.0, it utilizes a single memory space directly accessible by both CPU and GPU.
6.5 QLoRA Performance
QLoRA enables fine-tuning a 65B parameter model on a single 48GB GPU while maintaining task performance equivalent to 16-bit full fine-tuning. The Guanaco model family trained with QLoRA surpassed all previous open models on the Vicuna benchmark and reached 99.3% of ChatGPT's performance. This was achieved with just 24 hours of training on a single GPU.
6.6 LoRA vs QLoRA Comparison
| Item | LoRA (16-bit) | QLoRA (4-bit) |
|---|---|---|
| Training Time (7B, same data) | ~1.85 hours | ~2.79 hours |
| GPU Memory | ~21.33 GB | ~14.18 GB |
| Performance | Baseline | Nearly identical (marginal) |
| Advantage | Faster training | Lower memory requirements |
QLoRA saves approximately 33% of memory at the cost of approximately 39% increase in training time due to quantization/dequantization overhead. QLoRA is better suited for memory-constrained environments, while LoRA is preferable when training speed is the priority.
7. Hands-On with the HuggingFace PEFT Library
7.1 Installation
pip install peft transformers trl datasets bitsandbytes accelerate
7.2 Basic LoRA Configuration and Training
Below is a basic example of applying LoRA to a Causal Language Model using HuggingFace PEFT.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch
# 1. Load model and tokenizer
model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# 2. LoRA configuration
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32, # scaling factor (alpha)
target_modules="all-linear", # Apply to all Linear layers
lora_dropout=0.05, # dropout rate
bias="none", # disable bias training
task_type=TaskType.CAUSAL_LM, # task type
)
# 3. Create PEFT model
model = get_peft_model(model, lora_config)
# 4. Check trainable parameters
model.print_trainable_parameters()
# Output example: trainable params: 20,277,248 || all params: 6,758,404,096 || trainable%: 0.30
7.3 Instruction Tuning with SFTTrainer
Combining TRL's SFTTrainer with PEFT allows for more concise fine-tuning.
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from datasets import load_dataset
# Load dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")
# LoRA configuration
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
)
# SFTTrainer setup and training
trainer = SFTTrainer(
model=model_id,
args=SFTConfig(
output_dir="./lora-output",
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
save_strategy="epoch",
max_seq_length=2048,
),
train_dataset=dataset,
peft_config=peft_config,
)
trainer.train()
# Save the trained adapter
trainer.save_model("./lora-adapter")
7.4 QLoRA Example
To apply QLoRA, add 4-bit quantization configuration using bitsandbytes.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # Enable 4-bit quantization
bnb_4bit_quant_type="nf4", # Use NormalFloat4 type
bnb_4bit_compute_dtype=torch.bfloat16, # Use BFloat16 for computation
bnb_4bit_use_double_quant=True, # Enable Double Quantization
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)
# LoRA configuration (same as used with QLoRA)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules="all-linear",
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
# Create PEFT model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
7.5 Loading a Trained Adapter and Inference
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./lora-adapter")
# (Optional) Merge adapter into base model for optimized inference speed
model = model.merge_and_unload()
# Inference
inputs = tokenizer("The future of AI is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
8. LoRA vs Full Fine-tuning Performance Comparison
8.1 Original Paper Results (GPT-3 175B)
The results reported in the original LoRA paper for GPT-3 175B are as follows:
| Method | Params | WikiSQL Acc. | MNLI-m Acc. | SAMSum (R1/R2/RL) |
|---|---|---|---|---|
| Full Fine-tuning | 175B | 73.8% | 89.5% | 52.0/28.0/44.5 |
| Prefix Tuning | 0.77M | - | - | - |
| LoRA | 4.7M | 73.4% | 91.7% | 53.8/29.8/45.9 |
| LoRA | 37.7M | 74.0% | 91.6% | 53.4/29.2/45.1 |
The notable finding is that LoRA outperformed full fine-tuning on MNLI and SAMSum. Despite training only 4.7M parameters (0.003% of the total), it achieved better results than training all 175B parameters. This is interpreted as an implicit regularization effect of LoRA — the low-rank decomposition acts as a form of regularization that prevents overfitting.
8.2 Inference Performance and Training Efficiency
| Metric | Full Fine-tuning | LoRA |
|---|---|---|
| Training Speed | Baseline | ~25% faster |
| VRAM Usage | Baseline | ~67% reduction |
| Inference Latency | Baseline | Same (after merge) |
| Multi-task Switching | Full model swap needed | Swap adapter only |
8.3 Limitations of LoRA
LoRA is not always equivalent to full fine-tuning. Recent research (Biderman et al., 2024, "LoRA vs Full Fine-tuning: An Illusion of Equivalence") highlights the following:
- Differences in SVD Structure: The Singular Value Decomposition structure of weight matrices trained with LoRA versus full fine-tuning differs.
- Out-of-Distribution (OOD) Generalization: LoRA and full fine-tuning models exhibit different generalization patterns for inputs outside the training data distribution.
- Dataset Size Dependency: When dataset size significantly exceeds LoRA's trainable parameter count, full fine-tuning may have an advantage.
Therefore, when applying LoRA, the rank and target modules should be appropriately configured based on the task complexity and dataset scale.
9. Practical Tips: Learning Rate, Rank, and Alpha Tuning
9.1 Learning Rate
- LoRA requires a learning rate approximately 10x higher than full fine-tuning. If you use 1e-5 to 3e-5 for full fine-tuning, 1e-4 to 3e-4 is appropriate for LoRA.
- Learning rate should be optimized before other hyperparameters, as the effects of rank and alpha depend on the learning rate.
- Cosine annealing schedulers are effective with SGD but have minimal impact with Adam/AdamW.
9.2 Rank (r) Settings
- Starting point: Begin with r=16, evaluate performance, and adjust as needed.
- Risk of setting too low: Insufficient representational capacity for the task.
- Risk of setting too high: Can cause overfitting and increases memory and compute costs.
- Using rsLoRA: When using high ranks of r=32 or above, setting
use_rslora=Trueadjusts the scaling factor to , which improves training stability at higher ranks.
9.3 Alpha Settings
- Common heuristic: Set
alpha = 2 * r. That is, if the rank is 16, set alpha to 32. - This ratio works well in most cases, but the optimal ratio may vary depending on the model and dataset.
- In Sebastian Raschka's experiments, r=256 with alpha=128 (0.5x ratio) showed better results in some cases.
9.4 Dropout
- LoRA dropout is typically set in the range of 0.05 to 0.1.
- Use 0.1 for smaller datasets or when overfitting is a concern; consider 0.05 or disabled (0.0) for large-scale datasets.
9.5 Optimizer Selection
- AdamW: The most stable and widely used choice.
- SGD: At low ranks, the memory difference from AdamW is minimal, but at high ranks (r=256), meaningful memory savings are possible (17.86 GB vs 14.46 GB).
- Use AdamW as the default in most cases, but consider SGD if memory is extremely limited and a high rank is required.
9.6 Dataset Tips
- Beware of multi-epoch training: Repeating the same dataset multiple times can cause overfitting. In Sebastian Raschka's experiments, training for 2 epochs on the Alpaca dataset actually degraded performance.
- Data quality over quantity: The 1K-example LIMA dataset achieved similar or better results than the 50K-example Alpaca dataset. Building a high-quality dataset may be more important than hyperparameter tuning.
9.7 Practical Configuration Templates
Below are LoRA configuration templates ready for practical use.
from peft import LoraConfig
# Standard Instruction Tuning configuration
config_standard = LoraConfig(
r=16,
lora_alpha=32,
target_modules="all-linear",
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
# Memory-efficient configuration (use with QLoRA)
config_memory_efficient = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
)
# Maximum performance configuration
config_high_performance = LoraConfig(
r=64,
lora_alpha=128,
target_modules="all-linear",
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
use_rslora=True, # Use Rank-Stabilized LoRA for high ranks
)
10. The LoRA Ecosystem and Variant Techniques
Following LoRA's success, a variety of variant techniques have emerged. A brief summary:
| Technique | Core Idea | PEFT Support |
|---|---|---|
| LoRA | Low-rank matrix decomposition (BA) | Yes |
| QLoRA | 4-bit quantization + LoRA | Yes |
| DoRA | Weight-Decomposed LoRA (direction/magnitude) | Yes |
| AdaLoRA | Importance-based dynamic rank allocation | Yes |
| LoHa | Hadamard Product-based decomposition | Yes |
| LoKr | Kronecker Product-based decomposition | Yes |
| PiSSA | Principal Singular Values-based init | Yes |
| rsLoRA | Rank-Stabilized scaling factor | Yes |
The HuggingFace PEFT library supports all of these techniques through LoraConfig options or separate Config classes, allowing experimentation with various methods with minimal code changes.
11. Conclusion
LoRA is a technique that dramatically reduces the cost of fine-tuning large language models, starting from the simple hypothesis that "the weight updates of pre-trained models have a low intrinsic rank." Three advantages — mathematical elegance, simplicity of implementation, and zero additional inference cost — have made LoRA the current standard for LLM fine-tuning.
With the advent of QLoRA, fine-tuning large models has become feasible on consumer-grade GPUs, and the HuggingFace PEFT library has made it possible to apply these techniques with just a few lines of code. If you are working on a project involving LLMs, LoRA is an essential technique to master.
References
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv. https://arxiv.org/abs/2106.09685
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv. https://arxiv.org/abs/2305.14314
Aghajanyan, A., Gupta, S., & Zettlemoyer, L. (2020). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. arXiv. https://arxiv.org/abs/2012.13255
HuggingFace PEFT - LoRA Documentation. https://huggingface.co/docs/peft/en/package_reference/lora
HuggingFace PEFT - LoRA Methods Guide. https://huggingface.co/docs/peft/en/task_guides/lora_based_methods
HuggingFace PEFT GitHub Repository. https://github.com/huggingface/peft
Microsoft LoRA GitHub Repository. https://github.com/microsoft/LoRA
Raschka, S. (2023). Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation). Sebastian Raschka's Magazine. https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms
Biderman, S., et al. (2024). LoRA vs Full Fine-tuning: An Illusion of Equivalence. arXiv. https://arxiv.org/abs/2410.21228
Kalajdzievski, D. (2023). Rank-Stabilized LoRA (rsLoRA). arXiv. https://arxiv.org/abs/2312.03732
Liu, S., et al. (2024). DoRA: Weight-Decomposed Low-Rank Adaptation. arXiv. https://arxiv.org/abs/2402.09353
HuggingFace Blog - Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA. https://huggingface.co/blog/4bit-transformers-bitsandbytes