Skip to content
Published on

LLM Fine-tuning Complete Guide: Master LoRA, QLoRA, RLHF, and DPO

Authors

LLM Fine-tuning Complete Guide: Master LoRA, QLoRA, RLHF, and DPO

With powerful open-source LLMs like LLaMA 3, Mistral, and Gemma now publicly available, fine-tuning them for specific domains and tasks has become a core skill for AI engineers. This guide covers every major LLM fine-tuning technique from Full Fine-tuning through LoRA, QLoRA, RLHF, and DPO — with complete, production-ready code at every step.


1. Why Fine-tune?

1.1 Limitations of Pretrained Models

Large-scale LLMs are pretrained on vast internet text and acquire impressive general capabilities. Used directly, however, they have several practical limitations:

Domain knowledge gaps: Even GPT-4 does not know your company's internal documentation or the latest medical protocols published after its training cutoff.

No instruction-following by default: Base models are trained to predict the next token — not to follow instructions. A base model asked to "find the bug in this code" may just continue generating plausible text rather than helping.

Output format control: Making a model reliably produce a specific JSON schema or markdown structure is extremely difficult with prompting alone.

Safety and alignment issues: Pretrained models can generate harmful content or behave inconsistently when faced with edge-case inputs.

1.2 Benefits of Fine-tuning

Fine-tuning is the process of further training a pretrained model's weights on new data for a specific purpose:

  • Domain adaptation: Acquire specialized terminology, knowledge, and style
  • Task specialization: Maximize performance on classification, extraction, or generation
  • Behavior control: Learn desired response style, format, and safety boundaries
  • Cost efficiency: A fine-tuned small model can replace expensive large model API calls

1.3 Fine-tuning vs Prompt Engineering

Prompt engineering is fast and free — but limited:

CriterionPrompt EngineeringFine-tuning
Implementation effortLowMedium-High
CostRuntime token costOne-time training cost
Performance ceilingLimited by base modelCan exceed base
Output consistencyLowHigh
PrivacyData sent to APILocal execution possible
LatencyLong prompts = slowShort inputs possible

1.4 Types of Fine-tuning

Fine-tuning approaches fall into three major categories:

  1. Full Fine-tuning: Update all parameters (most powerful, most expensive)
  2. PEFT (Parameter-Efficient Fine-Tuning): Update only a small fraction of parameters
    • LoRA, QLoRA, Prefix Tuning, Adapter, etc.
  3. RLHF/DPO: Learn from human preference data (alignment/safety)

2. Full Fine-tuning

2.1 Overview

Full fine-tuning updates every model parameter on new data. Theoretically the most powerful approach, but practical limitations make it rarely the first choice.

2.2 Memory Requirements

Full fine-tuning a 7B parameter model requires approximately:

  • Model weights (BF16): 7B × 2 bytes = 14 GB
  • Gradients: 14 GB (same as weights)
  • Optimizer states (AdamW): 14 GB × 2 = 28 GB (first and second moments)
  • Activations: Several GB depending on batch size and sequence length
  • Total: 60+ GB

A single A100 80GB is barely enough for a 7B full fine-tune. A 70B model requires multiple high-end GPUs.

2.3 When to Use Full Fine-tuning

  • Domain is very different from pretraining distribution (highly specialized medical/legal text)
  • Sufficient GPU resources are available
  • Maximum performance is absolutely required
  • Continual pretraining on new text corpora

2.4 Catastrophic Forgetting

The biggest risk with full fine-tuning is catastrophic forgetting: training on new data degrades previously acquired knowledge.

Mitigation strategies:

  • Low learning rate: Use 1e-5 or below to preserve existing knowledge
  • Data mixing: Mix original pretraining data with new data
  • EWC (Elastic Weight Consolidation): Regularize changes to important weights
  • Use PEFT instead: Training only new parameters has no catastrophic forgetting
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
import torch

# Full fine-tuning example
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

dataset = load_dataset("your_dataset")

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=2048,
        padding="max_length",
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments(
    output_dir="./full_ft_output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,          # Low LR to preserve knowledge
    weight_decay=0.01,
    bf16=True,
    logging_steps=100,
    save_steps=500,
    evaluation_strategy="steps",
    eval_steps=500,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
)

trainer.train()

3. LoRA (Low-Rank Adaptation)

3.1 The Core Idea

LoRA (Hu et al., 2021) is a game-changer for LLM fine-tuning. The key observation: weight changes during fine-tuning have intrinsically low rank.

Instead of updating the full weight matrix W (d×k), represent the change as the product of two small matrices B (d×r) and A (r×k):

W' = W + delta_W = W + B * A

Here, r is the rank and is chosen to be much smaller than both d and k.

Parameter count comparison:

  • Original delta_W: d × k parameters
  • LoRA B + A: r × (d + k) parameters
  • For d=k=4096, r=16: 16,777,216 vs 131,072 — a 128x reduction!

3.2 Formula Details

Initialization:

  • A: Gaussian random initialization
  • B: Zero initialization (ensures delta_W = 0 at training start)

Forward pass:

h = x * W^T + x * (B * A)^T * (alpha / r)

The alpha/r ratio acts like a learning rate multiplier for the LoRA updates.

3.3 Choosing Rank r

Rank r is LoRA's most important hyperparameter:

  • r=4 or r=8: Lightweight experiments, simple tasks
  • r=16: Good balance for most tasks — recommended starting point
  • r=32 or r=64: Complex tasks requiring more capacity
  • r=128+: When performance close to full fine-tuning is needed

Start with r=16 and increase if performance is insufficient.

3.4 The alpha Hyperparameter

Alpha scales the LoRA update magnitude. The actual scale factor applied is alpha/r:

  • alpha = r: scale factor = 1 (common choice)
  • alpha = 2r: scale factor = 2 (stronger LoRA updates)
  • Can be tuned independently of the learning rate

3.5 Which Layers to Apply LoRA To?

The original paper applied LoRA only to Q and V projection matrices. Experiments show:

  • Q, K, V, O (all attention projections): Generally good
  • + MLP layers: Often better for complex tasks
  • All linear layers: Near full fine-tuning performance

HuggingFace PEFT defaults to Q and V. For complex tasks, applying to all linear layers is recommended.

3.6 LoRA with HuggingFace PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,                 # alpha = 2*r
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model_name = "meta-llama/Llama-3.2-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Create LoRA model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Example output: trainable params: 41,943,040 || all params: 3,254,779,904 || trainable%: 1.29%

# Inspect trainable parameters
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"Trainable: {name}, shape: {param.shape}")

4. QLoRA (Quantized LoRA)

4.1 What Is QLoRA?

QLoRA (Dettmers et al., 2023) makes LoRA even more memory-efficient by training LoRA adapters on top of a 4-bit quantized base model.

Three core techniques:

  1. 4-bit NF4 quantization: Compress base model weights to 4 bits
  2. Double quantization: Quantize the quantization constants themselves
  3. Paged optimizers: Page optimizer states between CPU RAM and GPU

4.2 4-bit NF4 Quantization

NF4 (NormalFloat4) is a 4-bit data type optimized for normally distributed weights — which LLM weights tend to follow.

NF4 is information-theoretically optimal for normally distributed data: each quantization bin covers equal probability mass.

Memory savings:

  • FP16 → INT4: 4x compression
  • 70B model: 140GB (FP16) → 35GB (4-bit) — trainable on 2×24GB consumer GPUs

4.3 Double Quantization

Quantization constants themselves occupy memory (roughly 32 bits per group of 64 weights). Double Quantization quantizes these constants to 8 bits, saving approximately 0.37 bits per parameter.

4.4 Paged Optimizers

During training of long sequences, peak memory spikes can cause OOM errors. Paged Optimizers use NVIDIA unified memory to automatically offload optimizer states to CPU RAM when GPU memory is full, then page them back when needed.

4.5 Complete QLoRA Training Code

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from datasets import load_dataset
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model in 4-bit
model_name = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Prepare model for k-bit training
# (casts LayerNorm to FP32, handles embedding layers, etc.)
model = prepare_model_for_kbit_training(model)

# LoRA config
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Dataset preparation - Alpaca format
dataset = load_dataset("tatsu-lab/alpaca", split="train")

def format_instruction(sample):
    instruction = sample["instruction"]
    input_text = sample.get("input", "")
    output = sample["output"]

    if input_text:
        text = f"""### Instruction:
{instruction}

### Input:
{input_text}

### Response:
{output}"""
    else:
        text = f"""### Instruction:
{instruction}

### Response:
{output}"""

    return {"text": text}

formatted_dataset = dataset.map(format_instruction)

# Training config
training_args = TrainingArguments(
    output_dir="./qlora_output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,       # Save memory
    optim="paged_adamw_32bit",         # Paged Optimizer!
    logging_steps=25,
    save_strategy="epoch",
    learning_rate=2e-4,
    bf16=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    report_to="wandb",
    run_name="llama-3-qlora",
)

# Only compute loss on response tokens
response_template = "### Response:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=formatted_dataset,
    data_collator=collator,
    tokenizer=tokenizer,
    max_seq_length=2048,
    dataset_text_field="text",
    packing=False,
)

trainer.train()
trainer.save_model("./qlora_adapter")
tokenizer.save_pretrained("./qlora_adapter")

5. Other PEFT Methods

5.1 Prefix Tuning

Prefix Tuning prepends learnable "virtual token" embeddings to K and V at every Transformer layer. Base model weights are frozen entirely.

from peft import PrefixTuningConfig, get_peft_model

prefix_config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,
    prefix_projection=True,   # Project prefix through MLP
)

model = get_peft_model(base_model, prefix_config)

Prefix Tuning performs well on seq2seq tasks but generally underperforms LoRA.

5.2 Prompt Tuning

The simplest PEFT method. Adds only learnable soft prompt embeddings before the input — no changes to the model itself.

from peft import PromptTuningConfig, PromptTuningInit

prompt_config = PromptTuningConfig(
    task_type="CAUSAL_LM",
    prompt_tuning_init=PromptTuningInit.TEXT,
    num_virtual_tokens=8,
    prompt_tuning_init_text="Classify the sentiment:",
    tokenizer_name_or_path=model_name,
)

Performs better as model size increases. Extremely parameter-efficient but has a lower performance ceiling.

5.3 IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations)

IA3 achieves LoRA-like performance with roughly 1/10 the parameters. It multiplies learned vectors into K, V, and the FFN activations.

from peft import IA3Config

ia3_config = IA3Config(
    task_type="CAUSAL_LM",
    target_modules=["k_proj", "v_proj", "down_proj"],
    feedforward_modules=["down_proj"],
)

5.4 Adapter

Adapter layers insert small bottleneck modules inside each Transformer layer (down-projection → nonlinearity → up-projection). Proposed before LoRA, they introduce slight inference latency compared to LoRA, which can be merged into the base model weights.


6. Instruction Tuning

6.1 What Is Instruction Tuning?

A base LLM is trained to continue text — not to follow instructions. Instruction tuning fine-tunes the model on instruction-response pairs so it learns to be helpful. Stanford Alpaca (2023) popularized this approach.

6.2 Dataset Formats

Alpaca format:

{
  "instruction": "Find the greatest common divisor of two numbers.",
  "input": "24, 36",
  "output": "The GCD of 24 and 36 is 12.\n\nCalculation:\n- 24 = 2^3 * 3\n- 36 = 2^2 * 3^2\n- GCD = 2^2 * 3 = 12"
}

ChatML format (OpenAI standard):

<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
Implement quicksort in Python.<|im_end|>
<|im_start|>assistant
Here's a quicksort implementation in Python...

ShareGPT format (conversational):

{
  "conversations": [
    { "from": "human", "value": "question text" },
    { "from": "gpt", "value": "answer text" },
    { "from": "human", "value": "follow-up question" },
    { "from": "gpt", "value": "follow-up answer" }
  ]
}

6.3 Data Quality Over Quantity

The LIMA paper (Less is More for Alignment, 2023) showed that just 1,000 high-quality examples are sufficient to produce strong instruction-following behavior. Quality matters far more than quantity.

High-quality instruction data criteria:

  • Diversity: Covers many task types and domains
  • Clarity: Instructions are unambiguous
  • Accuracy: Responses are factually correct
  • Consistency: Same response style across similar instructions
  • Appropriate length: Only as long as necessary

6.4 Formatting and Tokenization

def format_alpaca_prompt(sample: dict) -> str:
    instruction = sample["instruction"]
    input_text = sample.get("input", "")
    output = sample["output"]

    if input_text:
        return f"""### Instruction:
{instruction}

### Input:
{input_text}

### Response:
{output}"""
    else:
        return f"""### Instruction:
{instruction}

### Response:
{output}"""


def tokenize_with_label_masking(sample, tokenizer, max_length=2048):
    """Mask everything before the response — only compute loss on the answer"""
    full_text = sample["text"]
    tokenized = tokenizer(
        full_text, truncation=True, max_length=max_length, return_tensors="pt"
    )

    input_ids = tokenized["input_ids"][0]
    labels = input_ids.clone()

    response_start_str = "### Response:"
    response_token_ids = tokenizer.encode(response_start_str, add_special_tokens=False)

    for i in range(len(input_ids) - len(response_token_ids)):
        if input_ids[i:i+len(response_token_ids)].tolist() == response_token_ids:
            labels[:i + len(response_token_ids)] = -100
            break

    return {"input_ids": input_ids, "labels": labels}

7. RLHF (Reinforcement Learning from Human Feedback)

7.1 RLHF Overview

RLHF is the alignment technique behind ChatGPT, Claude, and Gemini. It trains models to be more helpful, harmless, and honest by learning from human preference judgments.

Three-stage pipeline:

  1. SFT (Supervised Fine-Tuning): Fine-tune on high-quality demonstration data
  2. Reward Model training: Train a model to score response quality
  3. RL optimization (PPO): Optimize the policy (LLM) using reward signals

7.2 Stage 1: Supervised Fine-Tuning

Fine-tune the base model on high-quality conversation data. This stage teaches basic instruction-following.

from trl import SFTTrainer, SFTConfig
from peft import LoraConfig

sft_config = SFTConfig(
    output_dir="./sft_model",
    max_seq_length=2048,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    bf16=True,
    optim="adamw_torch_fused",
    logging_steps=10,
    save_steps=100,
    warmup_ratio=0.05,
    dataset_text_field="text",
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=sft_dataset,
    peft_config=lora_config,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./sft_model")

7.3 Stage 2: Reward Model Training

The Reward Model (RM) learns from human comparisons of two responses. It uses the same LLM architecture with an added linear head that outputs a scalar reward.

Preference data format:

{
  "prompt": "Explain climate change",
  "chosen": "Climate change refers to long-term shifts in global temperatures... (detailed, accurate)",
  "rejected": "It's just the Earth getting warmer. (vague, incomplete)"
}
from trl import RewardTrainer, RewardConfig

reward_config = RewardConfig(
    output_dir="./reward_model",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-5,
    bf16=True,
    max_length=2048,
    logging_steps=10,
    remove_unused_columns=False,
)

reward_trainer = RewardTrainer(
    model=reward_model,    # Based on SFT model
    args=reward_config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
    peft_config=lora_config,
)

reward_trainer.train()

7.4 Stage 3: PPO Policy Optimization

PPO (Proximal Policy Optimization) optimizes the SFT model using the reward signal.

Core objective:

L = E[r(x, y)] - beta * KL(pi_theta || pi_ref)
  • r(x, y): reward from the Reward Model
  • KL divergence penalty: prevents the policy from drifting too far from the SFT reference
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

ppo_config = PPOConfig(
    model_name=sft_model_path,
    learning_rate=1.41e-5,
    batch_size=128,
    mini_batch_size=4,
    gradient_accumulation_steps=1,
    optimize_cuda_cache=True,
    early_stopping=True,
    target_kl=0.1,
    ppo_epochs=4,
    seed=42,
    init_kl_coef=0.2,
    adap_kl_ctrl=True,
)

policy_model = AutoModelForCausalLMWithValueHead.from_pretrained(sft_model_path)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(sft_model_path)

ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=policy_model,
    ref_model=ref_model,
    tokenizer=tokenizer,
    dataset=rl_dataset,
)

# PPO training loop
for batch in ppo_trainer.dataloader:
    query_tensors = batch["input_ids"]

    # 1. Generate responses from policy
    response_tensors = ppo_trainer.generate(
        query_tensors,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
    )

    # 2. Score with Reward Model
    rewards = [
        reward_model.compute_reward(q, r)
        for q, r in zip(query_tensors, response_tensors)
    ]

    # 3. PPO update
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

8. DPO (Direct Preference Optimization)

8.1 RLHF Complexity Problem

RLHF is powerful but complex:

  • Requires training a separate Reward Model
  • PPO has many sensitive hyperparameters
  • Training instability is common
  • Requires 4 models in memory simultaneously (policy, reference, reward, critic)

8.2 The DPO Insight

Rafailov et al. (2023) proved that you can optimize directly on preference data without a reward model. The key insight: the optimal RLHF policy can be expressed analytically in terms of the policy's own log-likelihood ratios. Substituting this back gives a loss function that operates directly on preference pairs:

L_DPO = -E[log sigma(
    beta * (log pi(y_w|x) - log pi_ref(y_w|x)) -
    beta * (log pi(y_l|x) - log pi_ref(y_l|x))
)]

Where:

  • y_w: preferred response (chosen)
  • y_l: dispreferred response (rejected)
  • pi: model being trained
  • pi_ref: reference policy (frozen SFT model)
  • beta: KL penalty coefficient

8.3 Preference Data Format

preference_data = {
    "prompt": "How do I sort a list in Python?",
    "chosen": "Python provides two main ways to sort lists:\n\n1. The sort() method — sorts in place:\n```python\nmy_list = [3, 1, 4, 1, 5]\nmy_list.sort()  # modifies my_list directly\n```\n\n2. The sorted() function — returns a new sorted list:\n```python\noriginal = [3, 1, 4, 1, 5]\nsorted_list = sorted(original)  # original unchanged\n```\n\nBoth support reverse=True and a key function for custom sorting.",
    "rejected": "Use list.sort() or sorted()."
}

8.4 DPO Training with trl

from trl import DPOTrainer, DPOConfig
from datasets import load_dataset

dpo_config = DPOConfig(
    output_dir="./dpo_model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=5e-7,            # Very low LR for DPO
    bf16=True,
    beta=0.1,                      # KL penalty coefficient
    max_length=2048,
    max_prompt_length=512,
    remove_unused_columns=False,
    logging_steps=10,
    save_steps=100,
    warmup_steps=100,
    report_to="wandb",
)

# Load preference dataset
# Format: {"prompt": ..., "chosen": ..., "rejected": ...}
preference_dataset = load_dataset("Anthropic/hh-rlhf", split="train")

def format_hh_rlhf(sample):
    """Parse HH-RLHF dataset format"""
    return {
        "prompt": sample["chosen"].rsplit("\nAssistant:", 1)[0] + "\nAssistant:",
        "chosen": sample["chosen"].rsplit("\nAssistant:", 1)[1].strip(),
        "rejected": sample["rejected"].rsplit("\nAssistant:", 1)[1].strip(),
    }

formatted_dataset = preference_dataset.map(format_hh_rlhf)

# Start DPO from SFT model
sft_model = AutoModelForCausalLM.from_pretrained(
    sft_model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
sft_model = get_peft_model(sft_model, lora_config)

dpo_trainer = DPOTrainer(
    model=sft_model,
    ref_model=None,           # None creates reference automatically
    args=dpo_config,
    train_dataset=formatted_dataset["train"],
    eval_dataset=formatted_dataset["test"],
    tokenizer=tokenizer,
    peft_config=lora_config,
)

dpo_trainer.train()
dpo_trainer.save_model("./dpo_final")

8.5 RLHF vs DPO Comparison

PropertyRLHF (PPO)DPO
Implementation complexityHighLow
Models in memory42
Training stabilityLowHigh
Memory usageHighModerate
Online learningYesDifficult
PerformanceGenerally higherComparable or slightly lower
Real-time feedbackYesNo

In practice, most teams start with DPO for its simplicity, then consider RLHF only if DPO is insufficient.


9. Data Preparation

9.1 Data Quality Standards

Data quality determines 80% of fine-tuning performance. High-quality data criteria:

  1. Accuracy: No factual errors
  2. Completeness: Fully answers the question
  3. Clarity: Unambiguous and easy to understand
  4. Format consistency: All examples follow the same format
  5. Non-toxic: No harmful or biased content
  6. Deduplicated: Near-duplicates removed

9.2 ChatML Format Processing

def create_chatml_prompt(conversation: list) -> str:
    """Convert multi-turn conversation to ChatML format"""
    messages = []
    for turn in conversation:
        role = turn["role"]    # system, user, assistant
        content = turn["content"]
        messages.append(f"<|im_start|>{role}\n{content}<|im_end|>")
    return "\n".join(messages) + "\n<|im_start|>assistant\n"


def tokenize_with_response_masking(sample, tokenizer, max_length=2048):
    """Only compute loss on assistant turns"""
    full_text = sample["text"]
    tokenized = tokenizer(
        full_text, truncation=True, max_length=max_length, return_tensors="pt"
    )

    input_ids = tokenized["input_ids"][0]
    labels = input_ids.clone()

    assistant_token_ids = tokenizer.encode(
        "<|im_start|>assistant\n", add_special_tokens=False
    )
    end_token_ids = tokenizer.encode("<|im_end|>", add_special_tokens=False)

    in_assistant = False
    i = 0
    while i < len(input_ids):
        if input_ids[i:i+len(assistant_token_ids)].tolist() == assistant_token_ids:
            in_assistant = True
            labels[i:i+len(assistant_token_ids)] = -100
            i += len(assistant_token_ids)
            continue
        if input_ids[i:i+len(end_token_ids)].tolist() == end_token_ids and in_assistant:
            in_assistant = False
        if not in_assistant:
            labels[i] = -100
        i += 1

    return {"input_ids": input_ids, "labels": labels}

9.3 Data Cleaning Pipeline

from datasets import Dataset
import hashlib
import re


def deduplicate_dataset(dataset: Dataset, text_field: str = "text") -> Dataset:
    """Remove near-duplicates by MD5 hash"""
    seen = set()
    keep = []
    for i, sample in enumerate(dataset):
        h = hashlib.md5(sample[text_field].encode()).hexdigest()
        if h not in seen:
            seen.add(h)
            keep.append(i)
    return dataset.select(keep)


def quality_filter(sample: dict) -> bool:
    """Basic quality filtering"""
    text = sample.get("output", sample.get("text", ""))
    words = text.split()

    # Too short
    if len(words) < 10:
        return False

    # Suspiciously long
    if len(words) > 2000:
        return False

    # Mostly URLs
    url_count = sum(1 for w in words if w.startswith("http"))
    if len(words) > 0 and url_count / len(words) > 0.3:
        return False

    # Mostly numbers (unlikely to be useful instruction data)
    num_count = sum(1 for w in words if re.match(r'^\d+$', w))
    if len(words) > 0 and num_count / len(words) > 0.5:
        return False

    return True


# Run pipeline
raw_dataset = load_dataset("your_dataset")["train"]
filtered = raw_dataset.filter(quality_filter)
deduped = deduplicate_dataset(filtered)
print(f"After cleaning: {len(deduped)} examples (was {len(raw_dataset)})")

10. Production Fine-tuning Pipeline

10.1 Complete Llama 3 QLoRA Fine-tuning

#!/usr/bin/env python3
"""
Production Llama 3 QLoRA Fine-tuning Pipeline
"""

import os
import torch
import wandb
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    EarlyStoppingCallback,
)
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer, SFTConfig, DataCollatorForCompletionOnlyLM


# ==============================
# Configuration
# ==============================
MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
OUTPUT_DIR = "./llama3-qlora-output"
DATASET_NAME = "iamtarun/python_code_instructions_18k_alpaca"
MAX_SEQ_LENGTH = 2048

LORA_R = 64
LORA_ALPHA = 16
LORA_DROPOUT = 0.05

BATCH_SIZE = 4
GRAD_ACCUM = 4
LEARNING_RATE = 2e-4
NUM_EPOCHS = 3

# ==============================
# Initialize W&B
# ==============================
wandb.init(
    project="llm-finetuning",
    name=f"llama3-qlora-code",
    config={
        "model": MODEL_NAME,
        "lora_r": LORA_R,
        "lora_alpha": LORA_ALPHA,
        "lr": LEARNING_RATE,
        "epochs": NUM_EPOCHS,
    },
)

# ==============================
# 4-bit quantization
# ==============================
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# ==============================
# Load model and tokenizer
# ==============================
print(f"Loading {MODEL_NAME}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
model.config.use_cache = False
model.config.pretraining_tp = 1
model = prepare_model_for_kbit_training(model)

# ==============================
# LoRA setup
# ==============================
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# ==============================
# Dataset
# ==============================
dataset = load_dataset(DATASET_NAME, split="train")
dataset = dataset.train_test_split(test_size=0.05, seed=42)

def format_prompt(sample):
    return {
        "text": f"""### Instruction:
{sample['instruction']}

### Input:
{sample.get('input', '')}

### Response:
{sample['output']}"""
    }

train_dataset = dataset["train"].map(format_prompt)
eval_dataset = dataset["test"].map(format_prompt)

# ==============================
# Training
# ==============================
training_config = SFTConfig(
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUM,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    learning_rate=LEARNING_RATE,
    bf16=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=25,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    report_to="wandb",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_text_field="text",
    packing=False,
    group_by_length=True,
)

response_template = "### Response:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

trainer = SFTTrainer(
    model=model,
    args=training_config,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=collator,
    tokenizer=tokenizer,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

print("Starting training...")
trainer.train()

print("Saving...")
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
wandb.finish()
print(f"Done! Saved to {OUTPUT_DIR}")

10.2 Merging LoRA Adapters

Merge LoRA weights into the base model for deployment:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load base model (on CPU to save VRAM)
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="cpu",
)

peft_model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)

print("Merging LoRA weights...")
merged_model = peft_model.merge_and_unload()

MERGED_DIR = "./llama3-merged"
merged_model.save_pretrained(MERGED_DIR, safe_serialization=True)
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(MERGED_DIR)

print(f"Merged model saved to {MERGED_DIR}")

10.3 Deploy with Ollama

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./llama3-merged

TEMPLATE """### Instruction:
{{ .Prompt }}

### Response:
"""

PARAMETER stop "### Instruction:"
PARAMETER temperature 0.7
PARAMETER top_p 0.9
EOF

# Build and test
ollama create my-llama3 -f Modelfile
ollama run my-llama3 "Implement quicksort in Python"

10.4 Deploy with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="./llama3-merged",
    dtype="bfloat16",
    max_model_len=4096,
    tensor_parallel_size=1,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

prompts = [
    "### Instruction:\nImplement a binary search tree in Python\n\n### Response:\n",
    "### Instruction:\nExplain the CAP theorem\n\n### Response:\n",
]

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)
    print("---")

11. Evaluation

11.1 Perplexity

The fundamental language model metric. Lower is better.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import math


def compute_perplexity(
    model,
    tokenizer,
    texts: list,
    max_length: int = 1024,
    stride: int = 512,
) -> float:
    """Sliding window perplexity computation"""
    combined_text = "\n\n".join(texts[:100])
    encodings = tokenizer(combined_text, return_tensors="pt")
    seq_len = encodings.input_ids.size(1)
    nlls = []

    for begin_loc in range(0, seq_len, stride):
        end_loc = min(begin_loc + max_length, seq_len)
        trg_len = end_loc - begin_loc - (stride if begin_loc > 0 else 0)

        input_ids = encodings.input_ids[:, begin_loc:end_loc].to(model.device)
        target_ids = input_ids.clone()
        target_ids[:, :-trg_len] = -100

        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
            neg_log_likelihood = outputs.loss * trg_len

        nlls.append(neg_log_likelihood)
        if end_loc == seq_len:
            break

    ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
    return ppl.item()

11.2 ROUGE (Summarization)

from rouge_score import rouge_scorer


def evaluate_rouge(predictions: list, references: list) -> dict:
    scorer = rouge_scorer.RougeScorer(
        ['rouge1', 'rouge2', 'rougeL'], use_stemmer=True
    )
    scores = {"rouge1": [], "rouge2": [], "rougeL": []}

    for pred, ref in zip(predictions, references):
        score = scorer.score(ref, pred)
        scores["rouge1"].append(score["rouge1"].fmeasure)
        scores["rouge2"].append(score["rouge2"].fmeasure)
        scores["rougeL"].append(score["rougeL"].fmeasure)

    return {k: sum(v) / len(v) for k, v in scores.items()}

11.3 LM-Eval Harness

OpenLM Research's open-source evaluation framework for automated benchmark evaluation:

pip install lm-eval

lm_eval --model hf \
    --model_args pretrained=./llama3-merged,dtype=bfloat16 \
    --tasks hellaswag,arc_easy,arc_challenge,winogrande,mmlu \
    --num_fewshot 0 \
    --batch_size 8 \
    --output_path ./eval_results

11.4 MT-Bench

Uses GPT-4 as a judge to score multi-turn conversation quality from 1-10:

pip install fschat

# Generate model answers
python -m fastchat.llm_judge.gen_model_answer \
    --model-path ./llama3-merged \
    --model-id llama3-finetuned \
    --bench-name mt_bench

# GPT-4 judgment
python -m fastchat.llm_judge.gen_judgment \
    --model-list llama3-finetuned \
    --judge-model gpt-4

# Show results
python -m fastchat.llm_judge.show_result \
    --model-list llama3-finetuned

Summary

The landscape of LLM fine-tuning has transformed dramatically. Training that required large GPU clusters just a few years ago can now be done on a single consumer GPU.

Core techniques covered:

  1. Full Fine-tuning: Maximum performance, maximum resources
  2. LoRA: 99%+ parameter reduction via low-rank matrix decomposition
  3. QLoRA: 4-bit quantization + LoRA — enables 7B–70B training on one GPU
  4. Instruction Tuning: Teaching instruction-following behavior
  5. RLHF: Human preference alignment via three-stage pipeline
  6. DPO: Direct preference optimization without a reward model

Practical recommendations:

  • Start with QLoRA + DPO — the most practical combination for most teams
  • Invest most of your time in data quality, not hyperparameter tuning
  • 1,000 high-quality examples outperform 10,000 low-quality ones
  • Track every experiment with Weights and Biases
  • Perplexity improvements do not always mean better user experience — validate with MT-Bench

References