- Published on
LLM Fine-tuning Frameworks 2026 — A Deep Dive into Axolotl, Unsloth, LLaMA-Factory, TRL, PEFT, and TorchTune
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Prologue — Why Fine-tuning, Again?
In early 2025 the line went around that "fine-tuning is dead." GPT-4o, Claude 3.5, and Gemini 1.5 had pushed context windows past 1M tokens, and RAG plus few-shot prompting seemed to solve almost everything. Then the mood shifted in the second half of 2025. DeepSeek R1's GRPO paper, Meta's release of LLaMA 3.3 and 4, and a steady drumbeat of case studies showing that a small 7B model tuned for a domain could be cheaper and faster than calling GPT-4 — all of it pushed fine-tuning back to the top of the toolbox.
Here is the landscape as of May 2026.
- Open-source frameworks have crystallized into a six-way race: Axolotl, Unsloth, LLaMA-Factory, TRL, PEFT, and TorchTune.
- Cloud fine-tuning APIs have settled into five standards: OpenAI, Anthropic, Cohere, Together, and Modal.
- Algorithms layer on top of SFT in this order: DPO, GRPO, KTO, IPO.
- Distributed training defaults to QLoRA plus FSDP plus DeepSpeed Zero.
This post walks through that map in 12–14 chapters. Who does what well, and what you should pick for your situation.
1. The 2026 LLM Fine-tuning Map — Three Camps
Lining up every tool side by side makes nothing easier to compare. Split them into three camps first.
| Camp | Representative tools | Primary users |
|---|---|---|
| Open-source frameworks | Axolotl, Unsloth, LLaMA-Factory, TRL, PEFT, TorchTune, LLM Foundry | Academics, startups, solo developers |
| Cloud fine-tuning APIs | OpenAI, Anthropic, Cohere, Together, Modal, Fireworks | Enterprise, product teams |
| Vertically integrated foundation labs | Upstage, Sakana, Mistral, Cohere Labs, OpenAI Custom Models | Research labs, manufacturers, R AND D-led startups |
The split is not perfect. Together is a cloud API but exposes LoRA fine-tuning almost identically to the open-source stack. Modal is more an infrastructure layer renting GPUs. Even so, holding these three camps in mind makes the axes of choice visible.
The open-source camp says: "Give us a GPU and we will handle the rest." Axolotl, Unsloth, and LLaMA-Factory let you define a full pipeline with a single YAML or config file, and PEFT, TRL, and Accelerate sit underneath as libraries.
The cloud API camp says: "Throw us your data and we will handle the GPU, tuning, and serving." OpenAI, Anthropic, and Cohere tune their own models only; Together, Modal, and Fireworks let you tune open models like Llama, Qwen, DeepSeek, and Mistral.
The foundation lab camp is the place where companies that make their own models also sell domain fine-tuning on top. Upstage in Korea; Sakana, ELYZA, and PFN in Japan; Mistral and Cohere in the US occupy this seat.
The 2026 trend is that all three camps are encroaching on each other. OpenAI shipped Reinforcement Fine-Tuning (RFT), Anthropic exposed Constitutional Finetuning, Together is scaling its own clusters, and Upstage is going global as a SaaS. The lines blur.
2. Why Fine-tune? — A RAG vs Fine-tuning vs Few-shot Decision Table
Before picking a fine-tuning tool, you have to ask whether you should fine-tune at all. The decision table looks like this.
| Situation | Recommendation | Reason |
|---|---|---|
| Facts or documents change frequently | RAG | Just refresh the index; no model retraining |
| Output format or style must be consistent | Fine-tuning (SFT) | System prompts cannot give 100% consistency |
| Domain-specific vocabulary or jargon | Fine-tuning (SFT) + RAG | Model handles style; RAG handles facts |
| Reflect human preferences (politeness, safety) | DPO / GRPO / KTO | Train on paired or scalar preference signals |
| Cut per-request cost and latency | Fine-tuning (small) | A tuned 7B is cheaper than calling GPT-4 |
| Need a PoC in a week | Few-shot prompting | No training; instant validation |
| Long-chain reasoning like code or math | GRPO + RL | RL deepens reasoning ability |
| Fewer than 100 data points | Few-shot or PEFT with small r | Small data fits LoRA r=4–8 |
| More than 100k data points | Full SFT or LoRA r=64+ | Large data justifies full tune or big adapter |
| Need a model you own as IP | Self-host with Axolotl/LLM Foundry | You own the weights |
Three core principles.
- Facts to RAG, behavior to fine-tuning. What the model needs to know goes into RAG; how the model behaves goes into fine-tuning.
- SFT first, RL later. No RL algorithm works well without a seeded SFT base. Get to a stable region with SFT first, then layer DPO, GRPO, or KTO on top.
- Validate fast with a small model. A 7B LoRA tune takes 1–2 hours. Run it small and fast first, then scale up to 70B.
With those principles in mind, into the tools.
3. PEFT — The Basics of LoRA, QLoRA, AdaLoRA, and IA3
PEFT (Parameter-Efficient Fine-Tuning) is the umbrella term for "don't train all the weights; train a small adapter." It started with Microsoft's 2021 LoRA paper and is the de facto standard in 2026.
LoRA (Low-Rank Adaptation). Add two small matrices A and B to the large weight matrix W and train only A and B. Even small ranks like r=8/16/32 produce near-full-tune quality. Weight updates shrink by 100x to 1000x, and so do memory and disk.
QLoRA. Tim Dettmers's 2023 paper. Quantize the base model to 4-bit and train only the LoRA in 16-bit. This is the technique that made it possible to tune 70B models on a single A100 80GB. NF4 (NormalFloat 4) quantization, double quantization, and paged optimizer are the key ingredients.
AdaLoRA. Adjusts rank dynamically during training, giving more rank to important layers and less to unimportant ones. From Microsoft Research.
IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations). Trains even fewer parameters than LoRA — just three vectors per layer. Said to be more stable than LoRA on very small datasets under 100 examples.
DoRA (Weight-Decomposed Low-Rank Adaptation). Proposed by NVIDIA in 2024. Decomposes weights into magnitude and direction, training only the direction via LoRA. Closer to full-tune quality.
The 2026 default cheat sheet looks like this.
- Tight GPU memory or 70B model → QLoRA + r=16, alpha=32.
- Plenty of GPU, 7–13B model → LoRA + r=64, alpha=128.
- Full tune affordable → DoRA or full tune.
- Fewer than 100 examples → IA3 or LoRA r=4.
The PEFT library itself is built by Hugging Face, and almost every framework (Axolotl, Unsloth, LLaMA-Factory, TRL, TorchTune) calls into PEFT internally. PEFT is the foundation; the frameworks sit on top.
4. Axolotl — The Most Popular Open Source
Axolotl started at the OpenAccess AI Collective and was incorporated as Axolotl AI in 2024. It raised a seed round in 2025 and crossed 9k GitHub stars. In one line: "a wrapper that lets you train Llama, Mistral, Qwen, DeepSeek, and other open models with full tune, LoRA, QLoRA, or DPO from a single YAML config."
Why did Axolotl become number one? Three calls were decisive.
- YAML-config-centric. Dataset, model, hyperparameters, and distribution strategy all bundled into a single YAML. One command launches the run.
- All algorithms supported. SFT, LoRA, QLoRA, DPO, ORPO, KTO, GRPO, reinforcement learning, continual pretraining — all in one tool. Calls into PEFT, TRL, Accelerate, and DeepSpeed underneath.
- Automatic format conversion. Detects ShareGPT, Alpaca, ChatML, and OpenAI dataset formats and converts them automatically.
Basic usage example.
# axolotl-llama3-lora.yml
base_model: meta-llama/Llama-3.1-8B-Instruct
load_in_4bit: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
datasets:
- path: tatsu-lab/alpaca
type: alpaca
sequence_len: 4096
val_set_size: 0.05
num_epochs: 3
optimizer: adamw_torch
learning_rate: 0.0002
gradient_accumulation_steps: 4
micro_batch_size: 2
flash_attention: true
deepspeed: deepspeed_configs/zero2.json
This one file QLoRA-tunes Llama 3.1 8B on the Alpaca dataset. Launch with one command: axolotl train axolotl-llama3-lora.yml.
What Axolotl does well.
- Flexibility. Llama, Mistral, Qwen, DeepSeek, Phi, Gemma, Mixtral — almost every open model supported.
- Algorithm coverage. SFT, DPO, GRPO, KTO, ORPO, and CPT (continual pretraining). New paper algorithms tend to land within one to two weeks.
- Community. Thousands on Discord, weekly PR merges. Big teams like NousResearch and DeepSeek share their setups.
Weak spots.
- YAML debugging hell. So many options that one wrong setting buries errors deep in a stack trace.
- Memory optimization trails Unsloth. Some configurations that OOM in Axolotl still run in Unsloth on the same GPU.
When to choose Axolotl.
- You want to compare multiple algorithms (SFT then DPO then GRPO) inside one tool.
- You need multi-node distributed training → Axolotl plus DeepSpeed Zero-3.
- You are tuning vision-multimodal models (Llava, Qwen-VL) — Axolotl adds them quickly.
5. Unsloth — Two Times Faster QLoRA
Unsloth is a fine-tuning library built by Daniel and Michael Han, two brothers from Australia. It raised a seed round in 2024 and crossed 15k GitHub stars. The slogan: "2x faster, 50% less memory." It really is 1.5x to 2x faster than Axolotl on the same GPU and dataset, and uses 30–50% less memory.
How is it that fast? Unsloth does not lean on PyTorch's default autograd. The hot-path operations in training (LoRA forward and backward, RoPE, RMSNorm, SwiGLU, and others) are written as direct Triton kernels. Memory allocation is more aggressively reused than the PyTorch default. Unsloth Gradient Checkpointing, an in-house implementation, saves another 30% of memory compared to PyTorch's default.
Basic usage example.
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3.1-8b-instruct-bnb-4bit",
max_seq_length=4096,
dtype=None,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=32,
use_gradient_checkpointing="unsloth",
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
tokenizer=tokenizer,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
num_train_epochs=3,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
output_dir="outputs",
),
)
trainer.train()
Hugging Face TRL's SFTTrainer is reused as is; only the model loading goes through Unsloth. This keeps it compatible with existing TRL code.
What Unsloth does well.
- Single-GPU champion. Tuning 7–13B models is fast and stable on one A100, one H100, or even an RTX 4090.
- Memory efficiency. A 24GB GPU can run 70B QLoRA. Other tools OOM in that territory.
- Notebook-friendly. Official Colab and Kaggle notebooks that run out of the box.
Weak spots.
- Weak multi-GPU. Multi-GPU landed in 2025 but is not as stable as Axolotl yet. FSDP and DeepSpeed integration is partial.
- Narrower model coverage. Llama, Mistral, Qwen, Phi, Gemma, and DeepSeek mainline models all work, but lesser-known models sometimes need direct patches.
When to choose Unsloth.
- Solo developer, one GPU, notebook or Colab → Unsloth is basically the answer.
- Small team on a short cycle like a weekend hackathon → Unsloth gets results fastest.
- Need multi-node training → reach for Axolotl or TorchTune instead.
6. LLaMA-Factory — The Easy-to-Use Chinese Framework
LLaMA-Factory started with a Beihang University team in China. Released in 2023, it crossed 45k GitHub stars by May 2026. Less known in the English-speaking world than Axolotl, it has an enormous Chinese and East Asian user base.
Why LLaMA-Factory? Three differentiators.
- Web UI included. A single
llamafactory-cli webuiopens a browser interface to pick model, dataset, and hyperparameters and kick off training. Friendliest tool for anyone uncomfortable with CLI or YAML. - Massive model support. Llama, Mistral, Qwen, DeepSeek, ChatGLM, Yi, InternLM, Baichuan — coverage of Chinese models is unmatched.
- Algorithm completeness. SFT, reward model, PPO, DPO, ORPO, KTO, SimPO, BAdam — all under one tool. Running a full RLHF pipeline is simpler here than anywhere else.
Basic usage example (CLI).
llamafactory-cli train \
--stage sft \
--do_train True \
--model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
--dataset alpaca_en \
--template llama3 \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--output_dir saves/llama3-8b/lora/sft \
--overwrite_output_dir True \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--warmup_ratio 0.1
This one command LoRA-SFT-tunes Llama 3.1 8B on alpaca. The Web UI fills in the same options as a form.
What LLaMA-Factory does well.
- Lowest barrier to entry. Non-developers can attempt model tuning thanks to the Web UI.
- Number one for Chinese models. Qwen, DeepSeek, ChatGLM, Yi, and InternLM all land in LLaMA-Factory the fastest.
- Full-pipeline RLHF. SFT → reward model → PPO flows cleanly inside one tool.
Weak spots.
- Thin English documentation. The code is in English, but issues and discussions often default to Chinese. English-language community support is not as deep as Axolotl's.
- Conservative on new algorithms. Where Axolotl might add GRPO in a week, LLaMA-Factory usually takes two to four.
When to choose LLaMA-Factory.
- Non-developer or researcher wants to start in a Web UI → LLaMA-Factory.
- Tuning Qwen, DeepSeek, or ChatGLM → LLaMA-Factory.
- Want to compare SFT + RM + PPO in one place like a homework assignment → LLaMA-Factory.
7. Hugging Face TRL — RL plus DPO/GRPO/KTO
TRL (Transformer Reinforcement Learning) is the RL and preference-optimization library maintained by Hugging Face. It began as lvwerra's prototype in 2022 and joined the official Hugging Face lineup in 2024. By May 2026 it sits at 12k GitHub stars.
TRL is a library more than a framework. Axolotl, Unsloth, and LLaMA-Factory call into TRL internally. Using TRL directly is usually the choice of an algorithm researcher or anyone who needs a custom training loop.
TRL's supported trainers.
| Trainer | Algorithm | Use |
|---|---|---|
| SFTTrainer | Supervised fine-tuning | Supervised learning (chat or instruction tuning) |
| RewardTrainer | Pairwise reward model | RM stage of RLHF |
| PPOTrainer | Proximal Policy Optimization | Classic RLHF (InstructGPT style) |
| DPOTrainer | Direct Preference Optimization | Direct learning from paired preference data |
| GRPOTrainer | Group Relative Policy Optimization | DeepSeek R1 style RL |
| KTOTrainer | Kahneman-Tversky Optimization | Learn from good/bad binary signals |
| ORPOTrainer | Odds Ratio Preference Optimization | SFT + DPO combined in one pass |
| CPOTrainer | Contrastive Preference Optimization | DPO variant with improved stability |
| IPOTrainer | Identity Preference Optimization | Fixes DPO overfitting |
TRL plus vLLM for RL acceleration. The big shift in 2025 was vLLM integration into TRL. GRPO and PPO require the model to generate responses during training (rollouts), and default transformers generation is too slow. vLLM filled that gap, making RL training 5–10x faster. Axolotl and LLaMA-Factory simply ride that integration.
Basic usage example (DPO).
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
trainer = DPOTrainer(
model=model,
args=DPOConfig(
output_dir="dpo-llama",
beta=0.1,
learning_rate=5e-7,
num_train_epochs=1,
per_device_train_batch_size=2,
),
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
When to use TRL directly.
- Researching new algorithms for your own paper → TRL directly.
- Custom reward functions, custom rollouts → TRL directly.
- Goal is plain LoRA + DPO training → going through Axolotl, Unsloth, or LLaMA-Factory is enough.
8. PEFT (HF) — The Adapter Standard
The PEFT library is the adapter standard maintained by Hugging Face. It bundles LoRA, QLoRA, AdaLoRA, IA3, LoHa, LoKr, OFT, VeRA, DoRA, and X-LoRA behind one interface. By May 2026 it has 17k GitHub stars.
Almost every fine-tuning framework calls into PEFT. Axolotl, Unsloth, LLaMA-Factory, TRL, and TorchTune all rely on PEFT instead of implementing LoRA themselves. Because the adapter format (adapter_config.json plus adapter_model.safetensors) is compatible across them, a LoRA trained with Axolotl loads directly in vLLM, Unsloth, TGI, and Together.
PEFT's core abstraction.
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 8,037,261,312 || trainable%: 0.0848
Drop this model straight into TRL's SFTTrainer or DPOTrainer and only the LoRA trains.
Recent additions. Two changes landed in PEFT 0.10.x in 2025.
- VeRA (Vector-based Random Matrix Adaptation). Similar performance to LoRA with 10x fewer parameters. Good for multi-task training.
- X-LoRA (Mixture-of-Experts LoRA). Train multiple LoRAs and let a router pick the right one per input. Useful for multi-domain models.
Disk and memory efficiency. Full-tune weights for an 8B model are 16GB (bf16); a LoRA r=16 adapter is around 30MB. Loading 50 adapters onto a server and hot-swapping per request is now mainstream as of 2025. Mistral, Together, and Anthropic's Custom Models all use this idea.
9. TorchTune — PyTorch Official
TorchTune is the official fine-tuning library built directly by the PyTorch team. Released as 1.0 in 2024, it sits at 5k GitHub stars by May 2026. Later than the others, but its philosophy is clear: "PyTorch native, minimal external dependencies."
TorchTune's design philosophy.
- Recipes. Training loops are defined in Python files (recipes) rather than YAML configs. Easier to customize by editing the code directly.
- No magic. Barely depends on transformers, PEFT, or TRL. Its own model implementations, its own LoRA, its own training loop. Easy to read down to the metal.
- First-class adoption of new PyTorch features. FSDP2, torch.compile, Liger Kernels, Triton kernels — TorchTune lands them first.
Basic usage example.
# Use a prebuilt recipe
tune run lora_finetune_single_device \
--config llama3_2/8B_lora_single_device
# Or author your own recipe
tune ls # list available recipes
tune cp llama3_2/8B_lora_single_device my_recipe.yaml
# Edit my_recipe.yaml, then
tune run lora_finetune_single_device --config my_recipe.yaml
What TorchTune does well.
- Pure PyTorch. No transformers abstraction in the way; model code is direct, so training-loop debugging is straightforward.
- Distributed training. FSDP2 integration is the smoothest. Multi-node training is stable.
- Memory efficiency. Liger Kernels integration runs CrossEntropy, RMSNorm, and SwiGLU through fused kernels.
Weak spots.
- Narrower algorithm coverage. SFT, LoRA, QLoRA, DPO, PPO, and GRPO are there, but variants like KTO, ORPO, and SimPO are missing or arrive late.
- Narrower model coverage. Llama, Mistral, Gemma, Phi, Qwen, and DeepSeek are covered. Lesser-known models need to be added manually.
When to choose TorchTune.
- You want to crack open the training loop and debug it directly → TorchTune.
- You want new PyTorch features (FSDP2, torch.compile) early → TorchTune.
- You want a stable standard for a university course or lab → TorchTune (official PyTorch backing brings stability).
10. LLM Foundry (MosaicML → Databricks)
LLM Foundry was built by MosaicML, which Databricks acquired for 1.3 billion dollars in 2023. As of May 2026 it is the core training stack of the Databricks Mosaic AI platform. It is published on GitHub under Apache 2.0 with all code open.
LLM Foundry's strength is scale. Where Axolotl and Unsloth target a single machine or a small cluster, LLM Foundry assumes hundreds to thousands of GPUs from the start.
- StreamingDataset. Stream petabyte-scale data from S3, GCS, or Azure Blob during training. No pre-download needed.
- FSDP/HSDP optimization. On top of Composer (MosaicML's own library), distributed training efficiency is very high. MFU (Model FLOPs Utilization) can reach 50–60%.
- MPT model series. The training code for MosaicML's MPT-7B/30B/Foundation models is published as-is. Proof that "this code actually trained real models."
Basic usage example (Databricks Mosaic AI Training API).
from databricks.mosaic_ai import TrainingClient
client = TrainingClient()
run = client.create_training_run(
model="meta-llama/Llama-3.1-70B",
training_data="s3://my-bucket/sft-data/",
config={
"task": "INSTRUCTION_FINETUNE",
"training_duration": "3ep",
"learning_rate": 5e-7,
},
)
print(run.status) # PENDING -> RUNNING -> COMPLETED
Inside a Databricks workspace this is all you need. The trained model registers automatically in Unity Catalog and serves directly via Mosaic AI Model Serving.
When to choose LLM Foundry.
- Already a Databricks customer → natural choice.
- Full-tune 100B+ models, multi-node training → strongest on MFU efficiency and stability.
- Security and governance requirements (Unity Catalog) → other tools cannot match it.
Direct open-source use is slowly fading. On a generic GPU cluster, Axolotl or TorchTune is easier, and running LLM Foundry outside Databricks is a heavy setup. So LLM Foundry is settling into "the thing that gets used automatically inside Databricks."
11. Cloud — Modal, Together, OpenAI, Anthropic, Cohere
There is a path for teams that do not want to buy or rent raw GPUs: cloud fine-tuning APIs. As of May 2026 there are five distinct camps.
Modal. Serverless GPU infrastructure. Not fine-tuning specific, but its by-the-minute GPU rental made it a popular backend for fine-tuning workloads. You define a cloud GPU function with Python decorators.
import modal
app = modal.App("finetune-llama")
image = modal.Image.debian_slim().pip_install("axolotl", "unsloth")
@app.function(image=image, gpu="A100-80GB", timeout=3600)
def train(config_path: str):
import subprocess
subprocess.run(["axolotl", "train", config_path])
@app.local_entrypoint()
def main():
train.remote("config.yml")
Launch this with one command: modal run finetune.py. The GPU starts automatically and shuts down when training ends. Billed at 1.50–4 dollars per hour (A100/H100).
Together AI. An integrated platform for fine-tuning and serving open models like Llama, Qwen, Mistral, and DeepSeek. Fine-tuning supports LoRA or full tune.
together fine-tuning create \
--training-file file-xxx \
--model meta-llama/Llama-3.1-70B-Instruct-Reference \
--lora \
--lora-r 16
When training finishes the model registers automatically into Together inference. Billed per token or per dedicated endpoint hour.
OpenAI. Tunes their own models like GPT-4.1, GPT-4o-mini, and o4-mini. Late 2024 they shipped Reinforcement Fine-Tuning (RFT). RFT lets you define a grader function whose signal drives RL training.
from openai import OpenAI
client = OpenAI()
job = client.fine_tuning.jobs.create(
training_file="file-xxx",
model="gpt-4o-mini-2024-07-18",
method={"type": "supervised"}, # or "dpo", "reinforcement"
)
Anthropic. Started exposing fine-tuning with Claude 3.5 Haiku in late 2024. By May 2026, SFT and Constitutional Finetuning (alignment to constitutional values) are available for Claude 4 Sonnet and Haiku. Access often runs through a sales line; not every customer gets self-serve.
Cohere. SFT fine-tuning for the Command R/R+ models. Cohere's differentiator is explicit support for retrieval-aware fine-tuning, where the model is tuned with RAG context in mind.
Which cloud to pick.
| Situation | Recommendation |
|---|---|
| Open models and you want to take the weights with you | Modal or Together |
| OpenAI models only (ecosystem lock-in) | OpenAI fine-tuning |
| Claude models, enterprise sales relationship exists | Anthropic Claude finetuning |
| RAG-based chatbots, Cohere ecosystem | Cohere finetuning |
| Want to rent only the cloud while running your own code | Modal |
12. DPO / GRPO / KTO — Which Algorithm to Pick
Preference optimization algorithms exploded between 2024 and 2026. This chapter compares the four most important.
PPO (Proximal Policy Optimization). The classic RLHF algorithm used in the 2022 InstructGPT paper. You train a separate reward model and use its reward signal to run PPO and optimize the policy. Stable, but requires RM training and needs four models simultaneously (policy, ref, value, reward), which makes memory pressure heavy.
DPO (Direct Preference Optimization). Rafailov et al., Stanford 2023. The core idea: "Train a policy directly from paired preference data, no RM required." Only the policy and ref models are needed, halving the memory pressure. It became the de facto standard in 2024. Weaknesses: overfits easily and is very sensitive to the quality of the paired data.
KTO (Kahneman-Tversky Optimization). ContextualAI's Ethayarajh paper from 2024. Trains on binary "this response is good/bad" signals rather than pairs. Useful when you can't gather pairs (think thumbs up/down from customers). Models behavioral economics's prospect theory (loss aversion).
GRPO (Group Relative Policy Optimization). Introduced in DeepSeek's DeepSeekMath/R1 paper in 2024. A PPO variant that drops the value model. Generate K responses per prompt (usually 4–16) and use the mean reward across the K as the baseline for the advantage. No value-head training, so memory and code stay simple. Very strong on verifiable rewards like math and code.
| Algorithm | Data | Model count | Strength | Weakness |
|---|---|---|---|---|
| PPO | Pairs → train RM → reward | 4 (policy, ref, value, reward) | Stable, deep RL | Heavy, RM training required |
| DPO | Pairs (chosen / rejected) | 2 (policy, ref) | Light and fast | Easy to overfit |
| KTO | Binary (good/bad) | 2 (policy, ref) | Pairs not required | Weaker signal |
| GRPO | Verifiable rewards (answer match etc.) | 2 (policy, ref) + K rollouts | Strong on math/code | Rollout cost |
| ORPO | Pairs (combined with SFT) | 1 (policy only) | One-pass | Newer, less validated |
| SimPO | Pairs (reference-free) | 1 (policy only) | No ref model needed | Less stable |
| IPO | Pairs (DPO regularized) | 2 (policy, ref) | Prevents DPO overfit | Slower than DPO |
Recommended sequence in 2026.
- SFT. Always step one. 10k–100k high-quality instruction examples.
- DPO or KTO. DPO if you have preference pairs; KTO if you only have binary signals.
- GRPO. For domains with verifiable rewards like math, code, and logic.
Newer techniques like Mixture-of-Agents RLHF (multiple agents evaluating each other during training) have been tried since 2025, but they remain experimental.
13. FSDP / DeepSpeed Zero / QLoRA — Distributed Training
Training a large model with a single GPU does not work. You have to split weights, optimizer state, and gradients across multiple GPUs. Three core techniques solve this.
FSDP (Fully Sharded Data Parallel). PyTorch's official distributed strategy. Data parallel, but shards the weights, gradients, and optimizer state across GPUs. During forward and backward passes it gathers the weights it needs (all-gather), then redistributes after. Memory shrinks to 1/N across N GPUs, at the cost of higher communication.
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import MixedPrecision
import torch
model = FSDP(
model,
mixed_precision=MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.bfloat16,
buffer_dtype=torch.bfloat16,
),
)
DeepSpeed Zero. Microsoft's distributed training library. ZeRO has stages 1, 2, and 3.
- ZeRO-1: shards only optimizer state.
- ZeRO-2: shards optimizer plus gradient.
- ZeRO-3: shards optimizer, gradient, and weights — behaves almost identically to FSDP.
DeepSpeed shines with ZeRO-Infinity (CPU/NVMe offload) and ZeRO-Offload (push optimizer state to CPU) when GPU memory is truly tight. The downsides are shallower PyTorch integration than FSDP and slower adoption of new features like torch.compile and FSDP2.
QLoRA + FSDP. Tim Dettmers's centerpiece. 4-bit quantized base model plus LoRA adapter training plus FSDP distribution. You can fit a 70B model on 2x A100 80GB and LoRA-train it. Axolotl, Unsloth, TorchTune, and Hugging Face TRL all support this combination.
The 2026 default matrix.
| Model size | GPU | Recommended distribution strategy |
|---|---|---|
| 7B | 1xA100 40GB | LoRA, no distribution |
| 7B full tune | 1xH100 80GB | FSDP-1 (optimizer only) |
| 13B full tune | 4xA100 80GB | FSDP-2 (optimizer + grad) |
| 70B QLoRA | 2xA100 80GB | QLoRA + FSDP-3 |
| 70B full tune | 8xH100 80GB | FSDP-3 + activation checkpointing |
| 405B QLoRA | 4xH100 80GB | QLoRA + FSDP-3 + CPU offload |
| 405B full tune | 64+ x H100 | DeepSpeed ZeRO-3 or FSDP-3 + Megatron |
Liger Kernels. A Triton-kernel collection LinkedIn released in 2024. CrossEntropy, RMSNorm, SwiGLU, and RoPE run as fused kernels, cutting memory by 20–30% and adding 10–20% speed. Axolotl, Unsloth, and TorchTune all integrated it. By 2026 it is almost a default-on option.
14. Korea / Japan — Upstage, KT, LG AI, Sakana, Stockmark, ELYZA, PFN
Korea.
- Upstage. The Korean LLM startup behind the Solar model series. In 2024 Solar 10.7B topped the Hugging Face leaderboard. Has its own fine-tuning platform (Upstage AI Stack) and is known for "DUS (Depth-Up-Scaling)," an in-house model expansion technique. Selected as a core partner in the Korean government's domestic LLM project (K-LLM) in 2025.
- KT. Developed the Mi:dm 2.0 model in-house. Strong on Korean language and culture. With the 2025 launch of KT Cloud GPU infrastructure, the company also started offering external fine-tuning services.
- LG AI Research. The EXAONE 3.5/4.0 series. Many fine-tuning cases in specialized domains like chemistry, materials, and law. When EXAONE 3.5 went to open weights in 2024, external researchers could tune it directly.
- Domestic fine-tuning infrastructure. NHN Cloud, NAVER Cloud, and KT Cloud offer H100 and H200 GPUs in the range of 4,000–8,000 KRW per hour. About 30% cheaper than AWS or GCP's Korean regions.
Japan.
- Sakana AI. Tokyo-headquartered, founded by ex-Google Brain David Ha and Llion Jones (Transformer co-author). Took a Series A at 4.5 billion dollar valuation in 2024. Known for "evolutionary model merging," a distinctive approach where evolutionary algorithms blend multiple models into a new one. EvoLLM-JP series specialized in Japanese.
- Stockmark. Japanese LLM company specialized in financial and legal domains. Trains its own Japanese LLMs like Stockmark-13B.
- ELYZA. Startup spun out of the University of Tokyo. Released the "ELYZA-japanese-Llama" series by continual-pretraining and fine-tuning Llama 2/3 in Japanese. Acquired by KDDI in 2024.
- PFN (Preferred Networks). A giant in Japan. Trains its own LLMs in the PLaMo series and even builds its own MN-Core accelerator. Strong on industrial domain (manufacturing, healthcare) fine-tuning.
- Sakana's evolutionary model merging. A 2024 paper made waves. By evolving and blending two Japanese models (Shisa and ELYZA), Sakana produced a new model better than either parent. It showed that "you can improve models without training."
Common Korea / Japan patterns.
- The standard recipe is English-base models (Llama, Mistral) continual-pretrained in Korean or Japanese, then instruction-tuned on top.
- More companies are building their own GPU clusters. Dependence on US clouds is squeezed by cost and sovereignty concerns.
- Axolotl and LLaMA-Factory dominate. Unsloth is fast but its multi-GPU weakness keeps larger companies away.
15. Who Should Pick What — A Decision Guide
A persona-by-persona summary of the tools, algorithms, and infrastructure covered so far.
Persona A: Solo developer, one GPU (RTX 4090 or Colab Pro)
- Framework: Unsloth. Top single-GPU efficiency; runs the Colab notebook as-is.
- Algorithm: LoRA SFT then DPO. KTO and GRPO are hard to gather data for.
- Model size: 7–13B. 4-bit QLoRA fits on a 24GB GPU.
- Dataset: 1k–10k examples. Quality over quantity.
Persona B: Academic researcher, cluster of 4–8 GPUs
- Framework: TorchTune or Axolotl. Needs to dig into the training loop.
- Algorithm: For paper-writing, the best is direct implementation, with TRL's GRPOTrainer as a base.
- Model size: Scale from 7B to 70B incrementally.
- Distribution: FSDP-2/3 or DeepSpeed Zero-3.
Persona C: Seed–Series A startup
- Framework: Axolotl (fastest algorithm coverage).
- Infrastructure: Modal (serverless GPU) or Together (integrated training + serving).
- Algorithm: SFT then DPO. Consistency and style alignment come first.
- Model size: 8–13B. Balance of cost and quality.
- Data: 5k–50k domain examples plus general instruction data mixed in.
Persona D: Enterprise / 200+ people
- Framework: LLM Foundry (if already on Databricks) or own cluster plus Axolotl.
- Cloud: OpenAI fine-tuning (ecosystem lock-in) or Anthropic sales line.
- Algorithm: SFT plus DPO and ideally GRPO.
- Model size: 70B+. An in-house 70B tune becomes cheaper than GPT-4 calls.
- Governance: Self-hosted weights, audit logs, lineage tracking.
Persona E: Foundation model lab
- Framework: Build your own. Fork Axolotl, TorchTune, or LLM Foundry and patch in-house.
- Infrastructure: Hundreds to thousands of GPUs, RDMA/InfiniBand.
- Algorithm: Develop new algorithms, write papers.
- Data: In-house crawling and labeling pipelines.
Persona F: Korea / Japan domain specialization
- Base model: Llama 3.x, Qwen 2.5/3, or Upstage Solar / EXAONE / Stockmark / ELYZA.
- Recipe: Continual pretraining (1B+ tokens of domain text) → SFT (5k–50k instructions) → DPO.
- Infrastructure: NHN, NAVER, or KT Cloud (Korea); Sakura Internet, GMO, or PFN clusters (Japan).
- Framework: LLaMA-Factory (best Chinese model compatibility) or Axolotl (best Western compatibility).
One last line
LLM fine-tuning in 2026 is the territory where you can start with one GPU and never see the end with thousands. Start small, measure what works, then scale. The tools merely help; data quality and evaluation decide everything in the end.
References
- Axolotl — https://axolotl.ai/
- Axolotl GitHub — https://github.com/axolotl-ai-cloud/axolotl
- Unsloth — https://unsloth.ai/
- Unsloth GitHub — https://github.com/unslothai/unsloth
- LLaMA-Factory GitHub — https://github.com/hiyouga/LLaMA-Factory
- Hugging Face TRL — https://huggingface.co/docs/trl
- TRL GitHub — https://github.com/huggingface/trl
- PEFT — https://huggingface.co/docs/peft
- PEFT GitHub — https://github.com/huggingface/peft
- TorchTune — https://pytorch.org/torchtune/
- TorchTune GitHub — https://github.com/pytorch/torchtune
- LLM Foundry — https://github.com/mosaicml/llm-foundry
- Databricks Mosaic AI — https://www.databricks.com/product/machine-learning/mosaic-ai
- Modal — https://modal.com/
- Together AI Fine-tuning — https://docs.together.ai/docs/fine-tuning-overview
- OpenAI Fine-tuning — https://platform.openai.com/docs/guides/fine-tuning
- OpenAI Reinforcement Fine-Tuning — https://platform.openai.com/docs/guides/reinforcement-fine-tuning
- Anthropic Fine-tuning — https://docs.anthropic.com/en/docs/build-with-claude/fine-tuning
- Cohere Fine-tuning — https://docs.cohere.com/docs/fine-tuning
- LoRA paper (Hu et al., 2021) — https://arxiv.org/abs/2106.09685
- QLoRA paper (Dettmers et al., 2023) — https://arxiv.org/abs/2305.14314
- DoRA paper (Liu et al., 2024) — https://arxiv.org/abs/2402.09353
- DPO paper (Rafailov et al., 2023) — https://arxiv.org/abs/2305.18290
- KTO paper (Ethayarajh et al., 2024) — https://arxiv.org/abs/2402.01306
- GRPO / DeepSeekMath (Shao et al., 2024) — https://arxiv.org/abs/2402.03300
- DeepSeek R1 — https://arxiv.org/abs/2501.12948
- ORPO paper (Hong et al., 2024) — https://arxiv.org/abs/2403.07691
- SimPO paper (Meng et al., 2024) — https://arxiv.org/abs/2405.14734
- IPO paper (Azar et al., 2023) — https://arxiv.org/abs/2310.12036
- IA3 paper (Liu et al., 2022) — https://arxiv.org/abs/2205.05638
- AdaLoRA paper (Zhang et al., 2023) — https://arxiv.org/abs/2303.10512
- VeRA paper (Kopiczko et al., 2023) — https://arxiv.org/abs/2310.11454
- PyTorch FSDP — https://pytorch.org/docs/stable/fsdp.html
- DeepSpeed ZeRO — https://www.deepspeed.ai/tutorials/zero/
- Liger Kernels — https://github.com/linkedin/Liger-Kernel
- vLLM — https://github.com/vllm-project/vllm
- Mixture-of-Agents — https://arxiv.org/abs/2406.04692
- Upstage Solar — https://www.upstage.ai/feed/product/solarmini-introduction
- KT Mi:dm — https://www.kt.com/biz/mi-dm.html
- LG AI EXAONE — https://www.lgresearch.ai/exaone
- Sakana AI — https://sakana.ai/
- Sakana evolutionary model merging — https://sakana.ai/evolutionary-model-merge/
- Stockmark — https://stockmark.co.jp/
- ELYZA — https://elyza.ai/
- Preferred Networks PLaMo — https://www.preferred.jp/en/projects/plamo/
- MosaicML acquisition by Databricks (2023) — https://www.databricks.com/company/newsroom/press-releases/databricks-completes-acquisition-mosaicml