- Authors

- Name
- Youngju Kim
- @fjvbn20031
LLM Fine-tuning Complete Guide: Master LoRA, QLoRA, RLHF, and DPO
With powerful open-source LLMs like LLaMA 3, Mistral, and Gemma now publicly available, fine-tuning them for specific domains and tasks has become a core skill for AI engineers. This guide covers every major LLM fine-tuning technique from Full Fine-tuning through LoRA, QLoRA, RLHF, and DPO — with complete, production-ready code at every step.
1. Why Fine-tune?
1.1 Limitations of Pretrained Models
Large-scale LLMs are pretrained on vast internet text and acquire impressive general capabilities. Used directly, however, they have several practical limitations:
Domain knowledge gaps: Even GPT-4 does not know your company's internal documentation or the latest medical protocols published after its training cutoff.
No instruction-following by default: Base models are trained to predict the next token — not to follow instructions. A base model asked to "find the bug in this code" may just continue generating plausible text rather than helping.
Output format control: Making a model reliably produce a specific JSON schema or markdown structure is extremely difficult with prompting alone.
Safety and alignment issues: Pretrained models can generate harmful content or behave inconsistently when faced with edge-case inputs.
1.2 Benefits of Fine-tuning
Fine-tuning is the process of further training a pretrained model's weights on new data for a specific purpose:
- Domain adaptation: Acquire specialized terminology, knowledge, and style
- Task specialization: Maximize performance on classification, extraction, or generation
- Behavior control: Learn desired response style, format, and safety boundaries
- Cost efficiency: A fine-tuned small model can replace expensive large model API calls
1.3 Fine-tuning vs Prompt Engineering
Prompt engineering is fast and free — but limited:
| Criterion | Prompt Engineering | Fine-tuning |
|---|---|---|
| Implementation effort | Low | Medium-High |
| Cost | Runtime token cost | One-time training cost |
| Performance ceiling | Limited by base model | Can exceed base |
| Output consistency | Low | High |
| Privacy | Data sent to API | Local execution possible |
| Latency | Long prompts = slow | Short inputs possible |
1.4 Types of Fine-tuning
Fine-tuning approaches fall into three major categories:
- Full Fine-tuning: Update all parameters (most powerful, most expensive)
- PEFT (Parameter-Efficient Fine-Tuning): Update only a small fraction of parameters
- LoRA, QLoRA, Prefix Tuning, Adapter, etc.
- RLHF/DPO: Learn from human preference data (alignment/safety)
2. Full Fine-tuning
2.1 Overview
Full fine-tuning updates every model parameter on new data. Theoretically the most powerful approach, but practical limitations make it rarely the first choice.
2.2 Memory Requirements
Full fine-tuning a 7B parameter model requires approximately:
- Model weights (BF16): 7B × 2 bytes = 14 GB
- Gradients: 14 GB (same as weights)
- Optimizer states (AdamW): 14 GB × 2 = 28 GB (first and second moments)
- Activations: Several GB depending on batch size and sequence length
- Total: 60+ GB
A single A100 80GB is barely enough for a 7B full fine-tune. A 70B model requires multiple high-end GPUs.
2.3 When to Use Full Fine-tuning
- Domain is very different from pretraining distribution (highly specialized medical/legal text)
- Sufficient GPU resources are available
- Maximum performance is absolutely required
- Continual pretraining on new text corpora
2.4 Catastrophic Forgetting
The biggest risk with full fine-tuning is catastrophic forgetting: training on new data degrades previously acquired knowledge.
Mitigation strategies:
- Low learning rate: Use 1e-5 or below to preserve existing knowledge
- Data mixing: Mix original pretraining data with new data
- EWC (Elastic Weight Consolidation): Regularize changes to important weights
- Use PEFT instead: Training only new parameters has no catastrophic forgetting
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
import torch
# Full fine-tuning example
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
dataset = load_dataset("your_dataset")
def tokenize_function(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=2048,
padding="max_length",
)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments(
output_dir="./full_ft_output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5, # Low LR to preserve knowledge
weight_decay=0.01,
bf16=True,
logging_steps=100,
save_steps=500,
evaluation_strategy="steps",
eval_steps=500,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
)
trainer.train()
3. LoRA (Low-Rank Adaptation)
3.1 The Core Idea
LoRA (Hu et al., 2021) is a game-changer for LLM fine-tuning. The key observation: weight changes during fine-tuning have intrinsically low rank.
Instead of updating the full weight matrix W (d×k), represent the change as the product of two small matrices B (d×r) and A (r×k):
W' = W + delta_W = W + B * A
Here, r is the rank and is chosen to be much smaller than both d and k.
Parameter count comparison:
- Original delta_W: d × k parameters
- LoRA B + A: r × (d + k) parameters
- For d=k=4096, r=16: 16,777,216 vs 131,072 — a 128x reduction!
3.2 Formula Details
Initialization:
- A: Gaussian random initialization
- B: Zero initialization (ensures delta_W = 0 at training start)
Forward pass:
h = x * W^T + x * (B * A)^T * (alpha / r)
The alpha/r ratio acts like a learning rate multiplier for the LoRA updates.
3.3 Choosing Rank r
Rank r is LoRA's most important hyperparameter:
- r=4 or r=8: Lightweight experiments, simple tasks
- r=16: Good balance for most tasks — recommended starting point
- r=32 or r=64: Complex tasks requiring more capacity
- r=128+: When performance close to full fine-tuning is needed
Start with r=16 and increase if performance is insufficient.
3.4 The alpha Hyperparameter
Alpha scales the LoRA update magnitude. The actual scale factor applied is alpha/r:
- alpha = r: scale factor = 1 (common choice)
- alpha = 2r: scale factor = 2 (stronger LoRA updates)
- Can be tuned independently of the learning rate
3.5 Which Layers to Apply LoRA To?
The original paper applied LoRA only to Q and V projection matrices. Experiments show:
- Q, K, V, O (all attention projections): Generally good
- + MLP layers: Often better for complex tasks
- All linear layers: Near full fine-tuning performance
HuggingFace PEFT defaults to Q and V. For complex tasks, applying to all linear layers is recommended.
3.6 LoRA with HuggingFace PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch
# LoRA configuration
lora_config = LoraConfig(
r=16,
lora_alpha=32, # alpha = 2*r
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model_name = "meta-llama/Llama-3.2-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Create LoRA model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Example output: trainable params: 41,943,040 || all params: 3,254,779,904 || trainable%: 1.29%
# Inspect trainable parameters
for name, param in model.named_parameters():
if param.requires_grad:
print(f"Trainable: {name}, shape: {param.shape}")
4. QLoRA (Quantized LoRA)
4.1 What Is QLoRA?
QLoRA (Dettmers et al., 2023) makes LoRA even more memory-efficient by training LoRA adapters on top of a 4-bit quantized base model.
Three core techniques:
- 4-bit NF4 quantization: Compress base model weights to 4 bits
- Double quantization: Quantize the quantization constants themselves
- Paged optimizers: Page optimizer states between CPU RAM and GPU
4.2 4-bit NF4 Quantization
NF4 (NormalFloat4) is a 4-bit data type optimized for normally distributed weights — which LLM weights tend to follow.
NF4 is information-theoretically optimal for normally distributed data: each quantization bin covers equal probability mass.
Memory savings:
- FP16 → INT4: 4x compression
- 70B model: 140GB (FP16) → 35GB (4-bit) — trainable on 2×24GB consumer GPUs
4.3 Double Quantization
Quantization constants themselves occupy memory (roughly 32 bits per group of 64 weights). Double Quantization quantizes these constants to 8 bits, saving approximately 0.37 bits per parameter.
4.4 Paged Optimizers
During training of long sequences, peak memory spikes can cause OOM errors. Paged Optimizers use NVIDIA unified memory to automatically offload optimizer states to CPU RAM when GPU memory is full, then page them back when needed.
4.5 Complete QLoRA Training Code
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from datasets import load_dataset
import torch
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# Load model in 4-bit
model_name = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# Prepare model for k-bit training
# (casts LayerNorm to FP32, handles embedding layers, etc.)
model = prepare_model_for_kbit_training(model)
# LoRA config
lora_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Dataset preparation - Alpaca format
dataset = load_dataset("tatsu-lab/alpaca", split="train")
def format_instruction(sample):
instruction = sample["instruction"]
input_text = sample.get("input", "")
output = sample["output"]
if input_text:
text = f"""### Instruction:
{instruction}
### Input:
{input_text}
### Response:
{output}"""
else:
text = f"""### Instruction:
{instruction}
### Response:
{output}"""
return {"text": text}
formatted_dataset = dataset.map(format_instruction)
# Training config
training_args = TrainingArguments(
output_dir="./qlora_output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
gradient_checkpointing=True, # Save memory
optim="paged_adamw_32bit", # Paged Optimizer!
logging_steps=25,
save_strategy="epoch",
learning_rate=2e-4,
bf16=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
report_to="wandb",
run_name="llama-3-qlora",
)
# Only compute loss on response tokens
response_template = "### Response:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=formatted_dataset,
data_collator=collator,
tokenizer=tokenizer,
max_seq_length=2048,
dataset_text_field="text",
packing=False,
)
trainer.train()
trainer.save_model("./qlora_adapter")
tokenizer.save_pretrained("./qlora_adapter")
5. Other PEFT Methods
5.1 Prefix Tuning
Prefix Tuning prepends learnable "virtual token" embeddings to K and V at every Transformer layer. Base model weights are frozen entirely.
from peft import PrefixTuningConfig, get_peft_model
prefix_config = PrefixTuningConfig(
task_type="CAUSAL_LM",
num_virtual_tokens=20,
prefix_projection=True, # Project prefix through MLP
)
model = get_peft_model(base_model, prefix_config)
Prefix Tuning performs well on seq2seq tasks but generally underperforms LoRA.
5.2 Prompt Tuning
The simplest PEFT method. Adds only learnable soft prompt embeddings before the input — no changes to the model itself.
from peft import PromptTuningConfig, PromptTuningInit
prompt_config = PromptTuningConfig(
task_type="CAUSAL_LM",
prompt_tuning_init=PromptTuningInit.TEXT,
num_virtual_tokens=8,
prompt_tuning_init_text="Classify the sentiment:",
tokenizer_name_or_path=model_name,
)
Performs better as model size increases. Extremely parameter-efficient but has a lower performance ceiling.
5.3 IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations)
IA3 achieves LoRA-like performance with roughly 1/10 the parameters. It multiplies learned vectors into K, V, and the FFN activations.
from peft import IA3Config
ia3_config = IA3Config(
task_type="CAUSAL_LM",
target_modules=["k_proj", "v_proj", "down_proj"],
feedforward_modules=["down_proj"],
)
5.4 Adapter
Adapter layers insert small bottleneck modules inside each Transformer layer (down-projection → nonlinearity → up-projection). Proposed before LoRA, they introduce slight inference latency compared to LoRA, which can be merged into the base model weights.
6. Instruction Tuning
6.1 What Is Instruction Tuning?
A base LLM is trained to continue text — not to follow instructions. Instruction tuning fine-tunes the model on instruction-response pairs so it learns to be helpful. Stanford Alpaca (2023) popularized this approach.
6.2 Dataset Formats
Alpaca format:
{
"instruction": "Find the greatest common divisor of two numbers.",
"input": "24, 36",
"output": "The GCD of 24 and 36 is 12.\n\nCalculation:\n- 24 = 2^3 * 3\n- 36 = 2^2 * 3^2\n- GCD = 2^2 * 3 = 12"
}
ChatML format (OpenAI standard):
<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
Implement quicksort in Python.<|im_end|>
<|im_start|>assistant
Here's a quicksort implementation in Python...
ShareGPT format (conversational):
{
"conversations": [
{ "from": "human", "value": "question text" },
{ "from": "gpt", "value": "answer text" },
{ "from": "human", "value": "follow-up question" },
{ "from": "gpt", "value": "follow-up answer" }
]
}
6.3 Data Quality Over Quantity
The LIMA paper (Less is More for Alignment, 2023) showed that just 1,000 high-quality examples are sufficient to produce strong instruction-following behavior. Quality matters far more than quantity.
High-quality instruction data criteria:
- Diversity: Covers many task types and domains
- Clarity: Instructions are unambiguous
- Accuracy: Responses are factually correct
- Consistency: Same response style across similar instructions
- Appropriate length: Only as long as necessary
6.4 Formatting and Tokenization
def format_alpaca_prompt(sample: dict) -> str:
instruction = sample["instruction"]
input_text = sample.get("input", "")
output = sample["output"]
if input_text:
return f"""### Instruction:
{instruction}
### Input:
{input_text}
### Response:
{output}"""
else:
return f"""### Instruction:
{instruction}
### Response:
{output}"""
def tokenize_with_label_masking(sample, tokenizer, max_length=2048):
"""Mask everything before the response — only compute loss on the answer"""
full_text = sample["text"]
tokenized = tokenizer(
full_text, truncation=True, max_length=max_length, return_tensors="pt"
)
input_ids = tokenized["input_ids"][0]
labels = input_ids.clone()
response_start_str = "### Response:"
response_token_ids = tokenizer.encode(response_start_str, add_special_tokens=False)
for i in range(len(input_ids) - len(response_token_ids)):
if input_ids[i:i+len(response_token_ids)].tolist() == response_token_ids:
labels[:i + len(response_token_ids)] = -100
break
return {"input_ids": input_ids, "labels": labels}
7. RLHF (Reinforcement Learning from Human Feedback)
7.1 RLHF Overview
RLHF is the alignment technique behind ChatGPT, Claude, and Gemini. It trains models to be more helpful, harmless, and honest by learning from human preference judgments.
Three-stage pipeline:
- SFT (Supervised Fine-Tuning): Fine-tune on high-quality demonstration data
- Reward Model training: Train a model to score response quality
- RL optimization (PPO): Optimize the policy (LLM) using reward signals
7.2 Stage 1: Supervised Fine-Tuning
Fine-tune the base model on high-quality conversation data. This stage teaches basic instruction-following.
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
sft_config = SFTConfig(
output_dir="./sft_model",
max_seq_length=2048,
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=2e-4,
bf16=True,
optim="adamw_torch_fused",
logging_steps=10,
save_steps=100,
warmup_ratio=0.05,
dataset_text_field="text",
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=sft_dataset,
peft_config=lora_config,
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./sft_model")
7.3 Stage 2: Reward Model Training
The Reward Model (RM) learns from human comparisons of two responses. It uses the same LLM architecture with an added linear head that outputs a scalar reward.
Preference data format:
{
"prompt": "Explain climate change",
"chosen": "Climate change refers to long-term shifts in global temperatures... (detailed, accurate)",
"rejected": "It's just the Earth getting warmer. (vague, incomplete)"
}
from trl import RewardTrainer, RewardConfig
reward_config = RewardConfig(
output_dir="./reward_model",
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=1e-5,
bf16=True,
max_length=2048,
logging_steps=10,
remove_unused_columns=False,
)
reward_trainer = RewardTrainer(
model=reward_model, # Based on SFT model
args=reward_config,
train_dataset=preference_dataset,
tokenizer=tokenizer,
peft_config=lora_config,
)
reward_trainer.train()
7.4 Stage 3: PPO Policy Optimization
PPO (Proximal Policy Optimization) optimizes the SFT model using the reward signal.
Core objective:
L = E[r(x, y)] - beta * KL(pi_theta || pi_ref)
- r(x, y): reward from the Reward Model
- KL divergence penalty: prevents the policy from drifting too far from the SFT reference
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
ppo_config = PPOConfig(
model_name=sft_model_path,
learning_rate=1.41e-5,
batch_size=128,
mini_batch_size=4,
gradient_accumulation_steps=1,
optimize_cuda_cache=True,
early_stopping=True,
target_kl=0.1,
ppo_epochs=4,
seed=42,
init_kl_coef=0.2,
adap_kl_ctrl=True,
)
policy_model = AutoModelForCausalLMWithValueHead.from_pretrained(sft_model_path)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(sft_model_path)
ppo_trainer = PPOTrainer(
config=ppo_config,
model=policy_model,
ref_model=ref_model,
tokenizer=tokenizer,
dataset=rl_dataset,
)
# PPO training loop
for batch in ppo_trainer.dataloader:
query_tensors = batch["input_ids"]
# 1. Generate responses from policy
response_tensors = ppo_trainer.generate(
query_tensors,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
)
# 2. Score with Reward Model
rewards = [
reward_model.compute_reward(q, r)
for q, r in zip(query_tensors, response_tensors)
]
# 3. PPO update
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)
8. DPO (Direct Preference Optimization)
8.1 RLHF Complexity Problem
RLHF is powerful but complex:
- Requires training a separate Reward Model
- PPO has many sensitive hyperparameters
- Training instability is common
- Requires 4 models in memory simultaneously (policy, reference, reward, critic)
8.2 The DPO Insight
Rafailov et al. (2023) proved that you can optimize directly on preference data without a reward model. The key insight: the optimal RLHF policy can be expressed analytically in terms of the policy's own log-likelihood ratios. Substituting this back gives a loss function that operates directly on preference pairs:
L_DPO = -E[log sigma(
beta * (log pi(y_w|x) - log pi_ref(y_w|x)) -
beta * (log pi(y_l|x) - log pi_ref(y_l|x))
)]
Where:
- y_w: preferred response (chosen)
- y_l: dispreferred response (rejected)
- pi: model being trained
- pi_ref: reference policy (frozen SFT model)
- beta: KL penalty coefficient
8.3 Preference Data Format
preference_data = {
"prompt": "How do I sort a list in Python?",
"chosen": "Python provides two main ways to sort lists:\n\n1. The sort() method — sorts in place:\n```python\nmy_list = [3, 1, 4, 1, 5]\nmy_list.sort() # modifies my_list directly\n```\n\n2. The sorted() function — returns a new sorted list:\n```python\noriginal = [3, 1, 4, 1, 5]\nsorted_list = sorted(original) # original unchanged\n```\n\nBoth support reverse=True and a key function for custom sorting.",
"rejected": "Use list.sort() or sorted()."
}
8.4 DPO Training with trl
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
dpo_config = DPOConfig(
output_dir="./dpo_model",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=5e-7, # Very low LR for DPO
bf16=True,
beta=0.1, # KL penalty coefficient
max_length=2048,
max_prompt_length=512,
remove_unused_columns=False,
logging_steps=10,
save_steps=100,
warmup_steps=100,
report_to="wandb",
)
# Load preference dataset
# Format: {"prompt": ..., "chosen": ..., "rejected": ...}
preference_dataset = load_dataset("Anthropic/hh-rlhf", split="train")
def format_hh_rlhf(sample):
"""Parse HH-RLHF dataset format"""
return {
"prompt": sample["chosen"].rsplit("\nAssistant:", 1)[0] + "\nAssistant:",
"chosen": sample["chosen"].rsplit("\nAssistant:", 1)[1].strip(),
"rejected": sample["rejected"].rsplit("\nAssistant:", 1)[1].strip(),
}
formatted_dataset = preference_dataset.map(format_hh_rlhf)
# Start DPO from SFT model
sft_model = AutoModelForCausalLM.from_pretrained(
sft_model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
)
sft_model = get_peft_model(sft_model, lora_config)
dpo_trainer = DPOTrainer(
model=sft_model,
ref_model=None, # None creates reference automatically
args=dpo_config,
train_dataset=formatted_dataset["train"],
eval_dataset=formatted_dataset["test"],
tokenizer=tokenizer,
peft_config=lora_config,
)
dpo_trainer.train()
dpo_trainer.save_model("./dpo_final")
8.5 RLHF vs DPO Comparison
| Property | RLHF (PPO) | DPO |
|---|---|---|
| Implementation complexity | High | Low |
| Models in memory | 4 | 2 |
| Training stability | Low | High |
| Memory usage | High | Moderate |
| Online learning | Yes | Difficult |
| Performance | Generally higher | Comparable or slightly lower |
| Real-time feedback | Yes | No |
In practice, most teams start with DPO for its simplicity, then consider RLHF only if DPO is insufficient.
9. Data Preparation
9.1 Data Quality Standards
Data quality determines 80% of fine-tuning performance. High-quality data criteria:
- Accuracy: No factual errors
- Completeness: Fully answers the question
- Clarity: Unambiguous and easy to understand
- Format consistency: All examples follow the same format
- Non-toxic: No harmful or biased content
- Deduplicated: Near-duplicates removed
9.2 ChatML Format Processing
def create_chatml_prompt(conversation: list) -> str:
"""Convert multi-turn conversation to ChatML format"""
messages = []
for turn in conversation:
role = turn["role"] # system, user, assistant
content = turn["content"]
messages.append(f"<|im_start|>{role}\n{content}<|im_end|>")
return "\n".join(messages) + "\n<|im_start|>assistant\n"
def tokenize_with_response_masking(sample, tokenizer, max_length=2048):
"""Only compute loss on assistant turns"""
full_text = sample["text"]
tokenized = tokenizer(
full_text, truncation=True, max_length=max_length, return_tensors="pt"
)
input_ids = tokenized["input_ids"][0]
labels = input_ids.clone()
assistant_token_ids = tokenizer.encode(
"<|im_start|>assistant\n", add_special_tokens=False
)
end_token_ids = tokenizer.encode("<|im_end|>", add_special_tokens=False)
in_assistant = False
i = 0
while i < len(input_ids):
if input_ids[i:i+len(assistant_token_ids)].tolist() == assistant_token_ids:
in_assistant = True
labels[i:i+len(assistant_token_ids)] = -100
i += len(assistant_token_ids)
continue
if input_ids[i:i+len(end_token_ids)].tolist() == end_token_ids and in_assistant:
in_assistant = False
if not in_assistant:
labels[i] = -100
i += 1
return {"input_ids": input_ids, "labels": labels}
9.3 Data Cleaning Pipeline
from datasets import Dataset
import hashlib
import re
def deduplicate_dataset(dataset: Dataset, text_field: str = "text") -> Dataset:
"""Remove near-duplicates by MD5 hash"""
seen = set()
keep = []
for i, sample in enumerate(dataset):
h = hashlib.md5(sample[text_field].encode()).hexdigest()
if h not in seen:
seen.add(h)
keep.append(i)
return dataset.select(keep)
def quality_filter(sample: dict) -> bool:
"""Basic quality filtering"""
text = sample.get("output", sample.get("text", ""))
words = text.split()
# Too short
if len(words) < 10:
return False
# Suspiciously long
if len(words) > 2000:
return False
# Mostly URLs
url_count = sum(1 for w in words if w.startswith("http"))
if len(words) > 0 and url_count / len(words) > 0.3:
return False
# Mostly numbers (unlikely to be useful instruction data)
num_count = sum(1 for w in words if re.match(r'^\d+$', w))
if len(words) > 0 and num_count / len(words) > 0.5:
return False
return True
# Run pipeline
raw_dataset = load_dataset("your_dataset")["train"]
filtered = raw_dataset.filter(quality_filter)
deduped = deduplicate_dataset(filtered)
print(f"After cleaning: {len(deduped)} examples (was {len(raw_dataset)})")
10. Production Fine-tuning Pipeline
10.1 Complete Llama 3 QLoRA Fine-tuning
#!/usr/bin/env python3
"""
Production Llama 3 QLoRA Fine-tuning Pipeline
"""
import os
import torch
import wandb
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
EarlyStoppingCallback,
)
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer, SFTConfig, DataCollatorForCompletionOnlyLM
# ==============================
# Configuration
# ==============================
MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
OUTPUT_DIR = "./llama3-qlora-output"
DATASET_NAME = "iamtarun/python_code_instructions_18k_alpaca"
MAX_SEQ_LENGTH = 2048
LORA_R = 64
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
BATCH_SIZE = 4
GRAD_ACCUM = 4
LEARNING_RATE = 2e-4
NUM_EPOCHS = 3
# ==============================
# Initialize W&B
# ==============================
wandb.init(
project="llm-finetuning",
name=f"llama3-qlora-code",
config={
"model": MODEL_NAME,
"lora_r": LORA_R,
"lora_alpha": LORA_ALPHA,
"lr": LEARNING_RATE,
"epochs": NUM_EPOCHS,
},
)
# ==============================
# 4-bit quantization
# ==============================
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# ==============================
# Load model and tokenizer
# ==============================
print(f"Loading {MODEL_NAME}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
model.config.use_cache = False
model.config.pretraining_tp = 1
model = prepare_model_for_kbit_training(model)
# ==============================
# LoRA setup
# ==============================
lora_config = LoraConfig(
r=LORA_R,
lora_alpha=LORA_ALPHA,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=LORA_DROPOUT,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# ==============================
# Dataset
# ==============================
dataset = load_dataset(DATASET_NAME, split="train")
dataset = dataset.train_test_split(test_size=0.05, seed=42)
def format_prompt(sample):
return {
"text": f"""### Instruction:
{sample['instruction']}
### Input:
{sample.get('input', '')}
### Response:
{sample['output']}"""
}
train_dataset = dataset["train"].map(format_prompt)
eval_dataset = dataset["test"].map(format_prompt)
# ==============================
# Training
# ==============================
training_config = SFTConfig(
output_dir=OUTPUT_DIR,
num_train_epochs=NUM_EPOCHS,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
gradient_accumulation_steps=GRAD_ACCUM,
gradient_checkpointing=True,
optim="paged_adamw_32bit",
learning_rate=LEARNING_RATE,
bf16=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
logging_steps=25,
save_strategy="epoch",
evaluation_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
report_to="wandb",
max_seq_length=MAX_SEQ_LENGTH,
dataset_text_field="text",
packing=False,
group_by_length=True,
)
response_template = "### Response:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
trainer = SFTTrainer(
model=model,
args=training_config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=collator,
tokenizer=tokenizer,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)
print("Starting training...")
trainer.train()
print("Saving...")
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
wandb.finish()
print(f"Done! Saved to {OUTPUT_DIR}")
10.2 Merging LoRA Adapters
Merge LoRA weights into the base model for deployment:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load base model (on CPU to save VRAM)
base_model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.bfloat16,
device_map="cpu",
)
peft_model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
print("Merging LoRA weights...")
merged_model = peft_model.merge_and_unload()
MERGED_DIR = "./llama3-merged"
merged_model.save_pretrained(MERGED_DIR, safe_serialization=True)
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(MERGED_DIR)
print(f"Merged model saved to {MERGED_DIR}")
10.3 Deploy with Ollama
# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./llama3-merged
TEMPLATE """### Instruction:
{{ .Prompt }}
### Response:
"""
PARAMETER stop "### Instruction:"
PARAMETER temperature 0.7
PARAMETER top_p 0.9
EOF
# Build and test
ollama create my-llama3 -f Modelfile
ollama run my-llama3 "Implement quicksort in Python"
10.4 Deploy with vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="./llama3-merged",
dtype="bfloat16",
max_model_len=4096,
tensor_parallel_size=1,
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512,
)
prompts = [
"### Instruction:\nImplement a binary search tree in Python\n\n### Response:\n",
"### Instruction:\nExplain the CAP theorem\n\n### Response:\n",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
print("---")
11. Evaluation
11.1 Perplexity
The fundamental language model metric. Lower is better.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import math
def compute_perplexity(
model,
tokenizer,
texts: list,
max_length: int = 1024,
stride: int = 512,
) -> float:
"""Sliding window perplexity computation"""
combined_text = "\n\n".join(texts[:100])
encodings = tokenizer(combined_text, return_tensors="pt")
seq_len = encodings.input_ids.size(1)
nlls = []
for begin_loc in range(0, seq_len, stride):
end_loc = min(begin_loc + max_length, seq_len)
trg_len = end_loc - begin_loc - (stride if begin_loc > 0 else 0)
input_ids = encodings.input_ids[:, begin_loc:end_loc].to(model.device)
target_ids = input_ids.clone()
target_ids[:, :-trg_len] = -100
with torch.no_grad():
outputs = model(input_ids, labels=target_ids)
neg_log_likelihood = outputs.loss * trg_len
nlls.append(neg_log_likelihood)
if end_loc == seq_len:
break
ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
return ppl.item()
11.2 ROUGE (Summarization)
from rouge_score import rouge_scorer
def evaluate_rouge(predictions: list, references: list) -> dict:
scorer = rouge_scorer.RougeScorer(
['rouge1', 'rouge2', 'rougeL'], use_stemmer=True
)
scores = {"rouge1": [], "rouge2": [], "rougeL": []}
for pred, ref in zip(predictions, references):
score = scorer.score(ref, pred)
scores["rouge1"].append(score["rouge1"].fmeasure)
scores["rouge2"].append(score["rouge2"].fmeasure)
scores["rougeL"].append(score["rougeL"].fmeasure)
return {k: sum(v) / len(v) for k, v in scores.items()}
11.3 LM-Eval Harness
OpenLM Research's open-source evaluation framework for automated benchmark evaluation:
pip install lm-eval
lm_eval --model hf \
--model_args pretrained=./llama3-merged,dtype=bfloat16 \
--tasks hellaswag,arc_easy,arc_challenge,winogrande,mmlu \
--num_fewshot 0 \
--batch_size 8 \
--output_path ./eval_results
11.4 MT-Bench
Uses GPT-4 as a judge to score multi-turn conversation quality from 1-10:
pip install fschat
# Generate model answers
python -m fastchat.llm_judge.gen_model_answer \
--model-path ./llama3-merged \
--model-id llama3-finetuned \
--bench-name mt_bench
# GPT-4 judgment
python -m fastchat.llm_judge.gen_judgment \
--model-list llama3-finetuned \
--judge-model gpt-4
# Show results
python -m fastchat.llm_judge.show_result \
--model-list llama3-finetuned
Summary
The landscape of LLM fine-tuning has transformed dramatically. Training that required large GPU clusters just a few years ago can now be done on a single consumer GPU.
Core techniques covered:
- Full Fine-tuning: Maximum performance, maximum resources
- LoRA: 99%+ parameter reduction via low-rank matrix decomposition
- QLoRA: 4-bit quantization + LoRA — enables 7B–70B training on one GPU
- Instruction Tuning: Teaching instruction-following behavior
- RLHF: Human preference alignment via three-stage pipeline
- DPO: Direct preference optimization without a reward model
Practical recommendations:
- Start with QLoRA + DPO — the most practical combination for most teams
- Invest most of your time in data quality, not hyperparameter tuning
- 1,000 high-quality examples outperform 10,000 low-quality ones
- Track every experiment with Weights and Biases
- Perplexity improvements do not always mean better user experience — validate with MT-Bench
References
- Hu et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." — https://arxiv.org/abs/2106.09685
- Dettmers et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." — https://arxiv.org/abs/2305.14314
- Rafailov et al. (2023). "Direct Preference Optimization." — https://arxiv.org/abs/2305.18290
- Ouyang et al. (2022). "Training language models to follow instructions with human feedback." — https://arxiv.org/abs/2203.02155
- Zhou et al. (2023). "LIMA: Less Is More for Alignment." — https://arxiv.org/abs/2305.11206
- HuggingFace PEFT — https://huggingface.co/docs/peft/
- HuggingFace TRL — https://huggingface.co/docs/trl/
- PEFT GitHub — https://github.com/huggingface/peft
- LM Evaluation Harness — https://github.com/EleutherAI/lm-evaluation-harness