Skip to content
Published on

LLM Fine-Tuning 2026 Deep Dive — LoRA · QLoRA · DoRA · GaLore · Unsloth · Axolotl · TRL · PEFT · MLX-LM Complete Guide

Authors

Prologue — 2026, the Year Fine-Tuning Became a "User Tool"

In June 2021, Edward Hu (Microsoft) published the LoRA paper, showing that "training just a million parameters can give you GPT-3-level adaptation." Back then, fine-tuning was a supercomputer's job. In May 2023, Tim Dettmers released QLoRA and squeezed a 65B model onto a single 48GB GPU. From then on, fine-tuning shifted from "researcher's work" to "engineer's work."

And now, in May 2026, fine-tuning is a "user tool." A single M3 Ultra Mac Studio can LoRA-tune a 70B model. Unsloth handles 4-bit quantization automatically, and the TRL SFTTrainer wraps an SFT run in five lines. Axolotl and LLaMA-Factory have externalized every hyperparameter into a single YAML file. Apple's MLX-LM trains Mistral 7B LoRA on an M4 Max MacBook Pro. All of this happened in four years.

Fine-tuning no longer means "train the whole model from scratch." In over 90% of cases, the answer is a LoRA-family PEFT (Parameter-Efficient Fine-Tuning). And within PEFT, the variant you pick determines efficiency, quality, and hardware requirements.

What this article covers:

  1. Full fine-tuning vs PEFT — when to use which
  2. LoRA's math and intuition (Hu et al., 2021)
  3. QLoRA — 4-bit NF4 and double quantization (Dettmers, 2023)
  4. DoRA — Weight decomposition (NVIDIA, 2024)
  5. GaLore — Gradient projection (Zhao, 2024)
  6. PiSSA, LoRA+, rsLoRA, VeRA, LoftQ, OFT, BOFT
  7. PEFT 0.14 — Hugging Face's unified API
  8. TRL 0.13 — SFT, DPO, ORPO, KTO, IPO, GRPO, RLOO
  9. Unsloth — 2x faster training
  10. Axolotl 0.6 — YAML-based multi-GPU
  11. LLaMA-Factory — 100+ models, web UI
  12. MLX-LM and MLX-Tuner — Apple Silicon on-device
  13. Torchtune — Meta PyTorch recipes
  14. Datasets — ShareGPT, OpenHermes, Magpie, Tülu, Nectar
  15. DPO vs PPO — the RLHF alternative
  16. Synthetic data — Augmentoolkit, Distilabel, Self-Instruct
  17. Quantization for serving — GGUF, AWQ, GPTQ, EXL2
  18. Hardware — H100/H200, A100, 4090/5090, M3/M4, MI300X
  19. Cloud — Together, Modal, RunPod, vast.ai, Lambda Labs
  20. Fine-tuning Korean and Japanese models
  21. Which tool should you pick
  22. References

1. Full Fine-Tuning vs PEFT — The GPU Memory Math

The first decision in fine-tuning is "train all weights, or only some?" The answer is almost always only some. The GPU memory math makes it obvious.

Full Fine-Tuning Math

Suppose you full-fine-tune a 7B model. fp16 weights alone are 14GB. The Adam optimizer keeps two states (momentum and variance) per weight in fp32, so 4x weights = 56GB. Gradients are the same size as weights, 14GB. Activations scale with sequence length, adding tens of GB. Total: over 100GB for a 7B full fine-tune. A single H100 80GB cannot do it.

For 70B, multiply by 10. 1TB of memory. Even a single 8x H100 node is tight.

PEFT Math

LoRA on 7B? Base weights stay 14GB (or 3.5GB in 4-bit). LoRA adapters are usually 0.11% of the base — 7M70M parameters or 14140MB in fp16. Optimizer state covers only the adapter, so it stays small. Activations are similar, but if the base is 4-bit, activation memory shrinks too. **Total: 820GB to train 7B.** One RTX 4090 24GB is enough.

When Full Fine-Tuning Still Makes Sense

There are cases where full fine-tuning is the right call:

  • Large distribution shift from the base — medical, legal, a different language, domain-specific vocabulary. LoRA is strong for small adaptations of the base distribution, but big shifts are safer with full fine-tuning.
  • Squeezing out the last 1% of quality — when chasing top spots on evaluation benchmarks, LoRA can fall slightly short of full fine-tuning.
  • You own the base model — if you're an in-house base-model builder, skip LoRA and go straight to full.

For the remaining 95% of cases, PEFT is the answer.


2. LoRA — Low-Rank Adaptation (Hu et al., 2021)

LoRA's idea in one sentence: "Weight updates are low-rank." When you full-fine-tune, W becomes W + dW. The observation is that dW typically has very low rank — not full rank.

The Math

For a base weight W (size d×k), LoRA approximates dW with the product of two small matrices A (d×r) and B (r×k), where r << min(d, k). During training, W is frozen and only A·B are updated. At inference, two options:

  1. Merge — precompute W + A·B as new weights. No inference overhead.
  2. Keep the adapter — keep W as-is, apply A·B as a separate adapter. Multiple adapters can be hot-swapped.

Three key hyperparameters:

  • rank r — adapter size. Usually 8~64. Larger means more expressive but more memory.
  • alpha — scaling factor. Typically 2x of r (r=16 means alpha=32). The actual update is (alpha/r) * A*B.
  • target_modules — which layers get a LoRA. The q/k/v/o of attention is the baseline, but adding the MLP improves quality.

Practical Guide (2026)

  • r=16, alpha=32 is a safe default. r=8 when memory is tight, r=64 when chasing quality.
  • target_modules has settled on attaching LoRA to all seven linear projections — q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj — known as "all-linear". The original LoRA paper only used q and v, but attaching to every linear layer is now standard.
  • learning rate should be 10~100x larger than full fine-tuning (1e-4 ~ 5e-4). LoRA's small parameter count tolerates more aggressive learning rates.
  • dropout is 0.05~0.1 for overfitting prevention.

3. QLoRA — 4-bit NF4 and Double Quantization (Dettmers, 2023)

Tim Dettmers's QLoRA (May 2023), in one sentence: "Train LoRA on top of a 4-bit-quantized base, and memory drops 4x."

Three Key Contributions

  1. NF4 (NormalFloat 4-bit) — a 4-bit quantization optimized for normal distributions. Pretrained weights are roughly mean-0 normal, so QLoRA assigns 4-bit codes to the quantiles of a normal distribution. More accurate than INT4.
  2. Double Quantization — quantize the quantization constants (scales) themselves. The first quantization gives an fp32 scale per block; double-quantizing those scales saves additional memory. About 0.5 bits per parameter saved on average.
  3. Paged Optimizers — use NVIDIA Unified Memory to page optimizer state between CPU and GPU. Avoids OOM and improves training stability.

Memory Savings in Practice

7B fp16 = 14GB to 7B NF4 = 3.5GB. 4x reduction. Optimizer state stays small because it only covers the LoRA adapter. Outcome: train 7B on a single RTX 3090 24GB.

QLoRA has become the de facto default for LoRA training in 2026. Unsloth, Axolotl, and LLaMA-Factory all offer QLoRA as a first-class option.

QLoRA's Minor Tradeoffs

A 4-bit base can be slightly less stable to train on than fp16. So very large distribution shifts (language transfer, deep domain specialization) are safer with an fp16 base. Also, after training an adapter on a 4-bit base, merging that adapter back into the fp16 base can introduce small quality loss. LoftQ (chapter 7) tries to solve this.


4. DoRA — Weight-Decomposed LoRA (NVIDIA, 2024)

NVIDIA's Shih-Yang Liu introduced DoRA in 2024. One sentence: "Decompose the weight change into direction and magnitude, and LoRA does better."

Intuition

Decompose the weight matrix W into two parts:

  • magnitude m — the norm of each column of W.
  • direction — the unit vector after dividing by the norm.

DoRA applies LoRA to the direction and treats the magnitude m as a separately trainable parameter. The DoRA paper observes that full fine-tuning changes both magnitude and direction significantly, but LoRA tangles the two awkwardly.

Experimental Results

  • 1~3 points better on benchmarks at the same r.
  • The DoRA gain is larger at small r (r=4, r=8) — closer to full-fine-tuning expressiveness.
  • Almost no additional memory (just the magnitude vector).

PEFT Support

Hugging Face PEFT 0.10 added use_dora=True as a single-flag activation. As of 2026, an informal estimate says about 30% of new LoRA training jobs use DoRA.


5. GaLore — Gradient Low-Rank Projection (Zhao, 2024)

Jiawei Zhao (Meta) published GaLore in March 2024. One sentence: "Weights are full rank, but gradients are low rank — project gradients into a low-rank subspace to save memory."

How It Differs

LoRA trains adapters (low-rank) and freezes the base weights. The result is few trainable parameters — an expressiveness ceiling. GaLore trains all the weights, but keeps the optimizer state (Adam's momentum and variance) in a low-rank subspace. Every N steps, GaLore runs SVD to find the principal direction of the gradient and keeps only that subspace as optimizer state.

The result: full fine-tuning's expressiveness with memory close to LoRA.

Memory Savings

For 7B full fine-tuning, the optimizer state (fp32 Adam) is 56GB. GaLore drops it to 7~14GB. The weights and activations are still full-training-sized, so total memory exceeds LoRA but is less than half of full fine-tuning.

Tradeoffs

  • SVD computation cost every N steps.
  • Slower training than LoRA (full weight updates).
  • Less polished code integration than LoRA.

GaLore targets the niche of "LoRA isn't enough, but full fine-tuning is too memory-hungry." A niche slice as of 2026, but growing.


6. PEFT Variants — PiSSA, LoRA+, rsLoRA, VeRA, LoftQ, OFT, BOFT

A swarm of LoRA variants have appeared. Short summary of the major ones supported by PEFT 0.14.

  • PiSSA (Principal Singular values and Singular vectors Adaptation, 2024) — initialize A·B not randomly but from the top SVD components of the base W. Improves early-training stability and convergence speed.
  • LoRA+ (2024) — use a different learning rate for A and B. Setting B's lr about 16x higher than A's accelerates convergence.
  • rsLoRA (Rank-Stabilized LoRA, 2024) — change the scaling to alpha/sqrt(r) so training stays stable at large r. Vanilla LoRA gets harder to tune as r grows.
  • VeRA (Vector-based Random Matrix Adaptation, 2024) — freeze A·B (random-init then freeze) and train only two small vectors d, b. Cuts parameters to 1/10 of LoRA. Good for multi-task setups.
  • LoftQ (LoRA-Fine-Tuning-Aware Quantization, 2023) — initialize the LoRA adapter to compensate for QLoRA's quantization error. Improves QLoRA starting quality.
  • OFT/BOFT (Orthogonal Fine-Tuning, 2023) — train only orthogonal transformations of the weight. Useful for multimodal and image models.
  • AdaLoRA (2023) — dynamically allocate r across layers based on importance.

These variants give small marginal gains over vanilla LoRA. For 90% of cases, plain LoRA is enough. Try variants when chasing the top of an evaluation benchmark.


7. Hugging Face PEFT 0.14 — The Unified API

Hugging Face's PEFT library is the unified API for every variant above. As of May 2026, 0.14 is stable and 0.15 is in beta.

Basic Usage

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules="all-linear",
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    use_dora=False,  # Set True for DoRA
)

model = get_peft_model(model, config)
model.print_trainable_parameters()
# trainable params: 24,313,856 || all params: 3,236,000,000 || trainable%: 0.75

Save and Load Adapters

model.save_pretrained("./my-lora-adapter")

# Later, reload
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")
model = PeftModel.from_pretrained(base, "./my-lora-adapter")

Merge the Adapter

merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")

Multi-Adapter

PEFT can hold multiple adapters on the same base and hot-swap them — switch the same model instance between English, Japanese, and Korean adapters instantly. Very useful for inference servers.


8. TRL 0.13 — SFT, DPO, ORPO, KTO, IPO, GRPO, RLOO

TRL (Transformer Reinforcement Learning) is Hugging Face's RLHF and alignment library. As of May 2026, 0.13 is stable, and the newer algorithms like GRPO and RLOO are first-class citizens.

SFTTrainer (Supervised Fine-Tuning)

The most-used tool. Train SFT on a chat dataset.

from trl import SFTTrainer, SFTConfig

config = SFTConfig(
    output_dir="./sft-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
    packing=True,
    max_seq_length=4096,
)

trainer = SFTTrainer(
    model=model,
    args=config,
    train_dataset=dataset,
    peft_config=lora_config,  # Integrates with PEFT
)
trainer.train()

DPOTrainer (Direct Preference Optimization)

Align with pair data (prompt, chosen, rejected). Much simpler and more stable than PPO.

ORPOTrainer (Odd Ratio Preference Optimization)

A newer algorithm (2024) that does SFT and DPO together. No reference model needed, saving memory.

KTO (Kahneman-Tversky Optimization)

Inspired by prospect theory. Allows alignment with single labels (good/bad) instead of pairs.

GRPO (Group Relative Policy Optimization)

The algorithm DeepSeek-R1 used. Strong for reasoning training. As of 2026, the standard for reasoning models.

RLOO (REINFORCE Leave-One-Out)

Uses leave-one-out for baseline estimation. Simpler than PPO yet effective.


9. Unsloth — 2x Faster Training, 50% Less Memory

Unsloth (the Han brothers — Daniel and Michael, 2024) is the single-GPU fine-tuning game changer. One sentence: "Hand-written Triton kernels make LoRA/QLoRA training 2x faster."

How It's Fast

  • Hand-written Triton kernels — replace generic PyTorch kernels with Triton. Attention, RMSNorm, rotary embeddings are all custom.
  • Memory-efficient backward — optimized checkpointing and recomputation cuts activation memory by 50%.
  • Automatic 4-bit quantization — NF4 plus double quantization, applied automatically.
  • Model-specific patches — optimized paths for Llama, Mistral, Gemma, Phi, and Qwen.

Example Usage

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.2-3b-instruct-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    use_rslora=False,
    use_dora=False,
)

Then train via SFTTrainer. Result: train Llama 3.2 8B in 4 hours on an RTX 4090 24GB (vs 8~10 hours with vanilla Transformers).

Constraints

  • Weaker multi-GPU support than vanilla (as of May 2026, Unsloth Pro's multi-GPU is a paid add-on).
  • Limited to supported architectures. New architectures need patches first.

10. Axolotl 0.6 — Everything in YAML

Axolotl (OpenAccess AI Collective, 2023~) in one sentence: "Externalize every hyperparameter into a single YAML file."

Why YAML

Python training scripts are hard to reproduce. Storing learning rate, batch size, and adapter config in git means the code itself is a variable. YAML makes a single source of truth.

Example Config

base_model: meta-llama/Llama-3.2-3B
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_4bit: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj

datasets:
  - path: tatsu-lab/alpaca
    type: alpaca

sequence_len: 4096
sample_packing: true
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
learning_rate: 2e-4
bf16: auto
optimizer: paged_adamw_32bit

Strengths

  • Multi-GPU, DeepSpeed, FSDP as first-class citizens. Natural 70B training across 4x H100.
  • Every PEFT variant supported (LoRA, QLoRA, DoRA, GaLore).
  • Auto dataset format conversion — alpaca, sharegpt, jsonl all recognized.
  • WandB and MLflow integration.

Weaknesses

  • The YAML grows huge. Over 100 options.
  • Debugging requires diving into source — the cost of abstraction.

11. LLaMA-Factory — 100+ Models, Web UI

HiYouga's LLaMA-Factory (Zheng et al., 2023) in one sentence: "The GUI era of fine-tuning."

Features

  • 100+ models — Llama, Mistral, Gemma, Qwen, Phi, ChatGLM, Yi, Baichuan, InternLM, etc.
  • LLaMA Board web UI — training, evaluation, checkpoint management in your browser.
  • Every training mode — Full, LoRA, QLoRA, DoRA, GaLore, BAdam, freeze, PT, SFT, RLHF (PPO, DPO, KTO, ORPO, SimPO).
  • Chinese and English datasets built in — Alpaca-zh, Belle, Firefly, Magpie.

Usage

pip install llamafactory
llamafactory-cli webui

Pick model, dataset, training mode in the browser, hit start. Non-coders can use it. Spread fast in Korean, Japanese, and Chinese in-house LLM teams.

Who Uses It

  • Model builders for fast experimentation (before writing YAML).
  • Non-coder R&D — data scientists fine-tuning without code.
  • In-house model demos — sales and marketing showing results live.

12. MLX-LM and MLX-Tuner — Apple Silicon On-Device

Apple released MLX (Machine Learning eXperience) in December 2023. PyTorch-like API but optimized for the Unified Memory Architecture (UMA) — CPU and GPU share the same memory pool on M1~M4 chips.

What MLX-LM Means

  • An M3 Ultra Mac Studio (192GB unified memory) can LoRA-fine-tune a 70B model.
  • An M4 Max MacBook Pro (128GB) can full-fine-tune Mistral 7B / Llama 3.2 8B.
  • Fine-tuning on a laptop, data never leaves the machine — a real edge for medical, legal, and confidential data.

Example Usage

pip install mlx-lm
python -m mlx_lm.lora \
    --model mistralai/Mistral-7B-v0.3 \
    --train \
    --data ./my_data \
    --iters 1000 \
    --lora-layers 16 \
    --batch-size 4

Constraints

  • Slower than H100 — Apple GPU FLOPS lag data-center GPUs.
  • Weak multi-machine distribution support.
  • New architectures need MLX conversion first.

Still, "fine-tune on my laptop" has weight. Small teams in Korea and Japan can experiment without cloud bills.


13. Torchtune — Meta PyTorch's Official Recipes

Torchtune is the library the Meta PyTorch team released in April 2024. One sentence: "PyTorch native, memory efficient, recipe-based."

Philosophy

  • Recipe = training script. Each use case (LoRA, full, QLoRA, DPO) gets its own standalone script.
  • Copy instead of inherit — a new use case starts by copying an existing recipe and editing. Avoids abstraction hell.
  • PyTorch native — no dependency on Hugging Face Transformers. Models, optimizers, datasets all implemented directly.

Usage

pip install torchtune
tune download meta-llama/Llama-3.2-3B-Instruct \
    --output-dir /tmp/Llama-3.2-3B \
    --hf-token YOUR_TOKEN

tune run lora_finetune_single_device \
    --config llama3_2/3B_lora_single_device

Strengths

  • Memory efficient — activation checkpointing, fp8 optimizer (experimental).
  • Native FSDP2 — multi-GPU feels natural.
  • PyTorch team maintained — picks up new PyTorch features first.

Weaknesses

  • Smaller ecosystem than Hugging Face.
  • Narrower model support than Axolotl or LLaMA-Factory.

14. Datasets — ShareGPT, OpenHermes, Magpie, Tülu, Nectar

70% of good fine-tuning is the data. The popular open SFT datasets in 2026.

  • ShareGPT — ChatGPT conversations users shared (GPT-3.5~4 era). 90K+ dialogues. Good diversity. License gray zone (OpenAI TOS vs user sharing).
  • OpenHermes 2.5 (Teknium) — 1M+ curated instructions. Training data for the Hermes series.
  • Magpie (Xu et al., 2024) — created by self-prompting Llama-3-Instruct. "Make the model ask itself questions."
  • Tülu 3 SFT Mix (Allen AI, 2024) — the SFT mix for the Tülu series. Diverse instructions and reasoning data.
  • Nectar (Berkeley, 2023) — 7-wise GPT-4 ranking pair data. Popular for DPO.
  • UltraFeedback — 64K pair data, AI-feedback driven.
  • HelpSteer / HelpSteer2 (NVIDIA) — scored on five axes: Helpfulness, Correctness, Coherence, Complexity, Verbosity.
  • Code Alpaca — 20K code instructions.
  • Dolphin (Eric Hartford) — uncensored dataset series.

Quality Trumps Quantity

The LIMA paper (Meta, 2023) showed 1,000 well-curated examples beat 50,000 mediocre ones for SFT. By 2026, "more" has yielded to "better" in training data curation.


15. DPO vs PPO — Evolution of RLHF

The era of InstructGPT (2022) and ChatGPT (2022) used PPO (Proximal Policy Optimization). PPO requires: base model, reward model, reference model, actor, critic. Four models on GPU at once — a memory explosion.

The Arrival of DPO (2023)

Rafailov et al. (Stanford) released DPO (Direct Preference Optimization) in May 2023. One sentence: "Skip the reward model and optimize the policy directly from pair data."

Mathematically, DPO transforms PPO's optimization into a simple classification loss on pair data. The result:

  • No reward or critic model needed — half the GPU memory.
  • Stable training — avoids PPO's reward hacking.
  • Simple code — dozens of lines.

The 2026 Competition

  • DPO — still the default for RLHF.
  • ORPO — SFT and DPO in one shot. No reference model. Even less memory.
  • KTO — single-label data enables alignment without pairs. Easier data collection.
  • GRPO — strong for reasoning training (used by DeepSeek-R1).
  • RLOO — a simplified, efficient PPO.
  • SimPO — simpler pair loss. Rising in 2024.

When do we still use PPO? Almost never. As of 2026, PPO survives only in large labs with mature training infrastructure (e.g., OpenAI internal). For a new project, DPO or its successors are the answer.


16. Synthetic Data — Augmentoolkit, Distilabel, Self-Instruct

No good SFT data? Make it. Synthetic data generation has become a first-class citizen of fine-tuning in 2026.

Self-Instruct (Wang et al., 2022)

The original. Give 175 seed instructions to an LLM, have it generate new instructions, then have the LLM generate answers. The bootstrap technique behind Alpaca, Wizard, and Magpie.

Augmentoolkit (e-p-armstrong, 2024)

A pipeline that produces QA pairs from source text. Converts PDFs, books, internal docs into SFT-ready format.

Distilabel (Argilla, now Hugging Face, 2024)

A synthetic data framework from the Argilla team (now at Hugging Face). Define a step-by-step pipeline — generate instruction, generate answer, AI evaluation, filtering. UltraFeedback was built with Distilabel.

NeMo Curator (NVIDIA)

Large-scale data curation — deduplication, quality filtering, PII removal. Used for both pretraining and SFT.

Magpie Technique

Give Llama-3-Instruct an empty prompt and the model generates "the user's turn" by itself — use that as the instruction and have the model answer. Zero cost (self-generation), 1M+ samples. Downside: the model's biases come along.

Synthetic Data Pitfalls

  • Model collapse — training only on synthetic data destroys diversity.
  • Quality filtering is mandatory — without AI judging or human spot-checks, only noise accumulates.
  • Licensing — what synthetic data, made with which model, can be used where (OpenAI TOS, etc.).

17. Quantization for Serving — GGUF, AWQ, GPTQ, EXL2

To serve a fine-tuned model, you quantize. Separate from QLoRA's 4-bit during training, inference has its own quantization formats.

  • GGUF (llama.cpp) — the CPU, Apple Silicon, and embedded standard. Many quantization options (Q4_K_M, Q5_K_M, Q8_0, etc.). Ollama, LM Studio, and Jan all use GGUF.
  • AWQ (Activation-aware Weight Quantization) — looks at activation distributions and preserves important weights. Common for server GPU inference.
  • GPTQ — minimizes per-weight reconstruction error. The pre-AWQ standard, still widely used.
  • EXL2 (ExLlama) — per-layer different bitrates — "important layers at 6-bit, less important layers at 3-bit." Highest quality at a given bitrate.
  • bitsandbytes 8/4-bit — quantization for training. Can be used for serving too, but slower at inference than the above formats.

Workflow

  1. Train LoRA in fp16/bf16, then merge the adapter into the base.
  2. Convert merged model to GGUF (CPU/Mac), AWQ (server GPU), or EXL2 (RTX inference).
  3. Serve via vLLM, TGI, llama.cpp, or Ollama.

The most common 2026 flow: train LoRA in PyTorch, merge, quantize to GGUF, serve with Ollama.


18. Hardware — H100/H200, A100, 4090/5090, M3/M4, MI300X

Fine-tuning hardware in 2026:

NVIDIA Data Center

  • H100 (Hopper, 80GB HBM3) — released 2022, still the standard. fp8 supported. Around $30,000 per card.
  • H200 (Hopper refresh, 141GB HBM3e) — 2024 release. Larger VRAM and faster bandwidth. Better for 70B model inference and training.
  • A100 (Ampere, 40/80GB) — from 2020. Still a cloud standard. Cheaper than H100 with better availability.
  • B200/B100 (Blackwell, GA 2024, ramping 2025) — Hopper successor. fp4 inference, larger die.

NVIDIA Consumer

  • RTX 4090 (24GB GDDR6X) — the price/performance king for single-GPU fine-tuning. Can train 7~8B QLoRA.
  • RTX 5090 (32GB GDDR7, 2025) — next generation. Edge of 70B QLoRA viability.
  • RTX 6000 Ada (48GB) — workstation. Train 13~30B.

Apple Silicon

  • M3 Ultra (192GB UMA, Mac Studio) — 70B model fits in unified memory.
  • M4 Max (128GB UMA, MacBook Pro) — train 13~30B on a laptop.
  • M4 Pro (64GB) — 7~13B training.

AMD

  • MI300X (192GB HBM3) — announced 2023, broader cloud adoption in 2024. ROCm and PyTorch support. An H100 alternative.
  • MI325X / MI350 (2025~2026) — next gen.

How to Pick

  • Infrequent training plus 7B/13B — rent in the cloud (hourly 0.5 0.5~3).
  • Daily training plus 7B/13B — RTX 4090/5090 single card.
  • Daily training plus 30B/70B — cloud H100 8-GPU node.
  • Data must not leave premises — M3 Ultra Mac Studio (192GB).

19. Cloud — Together, Modal, RunPod, vast.ai, Lambda Labs, Replicate

Most people rent GPUs instead of buying. The main 2026 options.

  • Together.ai — fine-tuning API. Upload file, done with SFT/DPO/LoRA. Hosts Llama, Mistral, Qwen, etc. Inference is also integrated.
  • Modal — serverless GPU. Define GPU functions with Python decorators. Deploying a training script to the cloud feels natural.
  • RunPod — hourly GPU rentals. 4090, A100, H100 all available. Among the cheapest. Pods and Serverless.
  • vast.ai — peer-to-peer GPU marketplace. Cheapest — you rent someone else's GPU. Stability varies by host.
  • Lambda Labs — strong with H100 / H200 clusters. Friendly to large-scale distributed training.
  • Replicate — focused on model hosting but offers training APIs.
  • Coreweave — large H100 and H200 clusters.
  • Fireworks / Anyscale — serving-focused, some training.
  • Hugging Face Endpoints / Inference / AutoTrain — most natural in the HF ecosystem.

Price Sense (rough as of May 2026)

  • A100 80GB: 1 1~2/hour (RunPod, vast.ai)
  • H100 80GB: 2 2~4/hour
  • 8x H100 node: 20 20~30/hour
  • One 7B QLoRA training (3 epochs, 100K samples): 4~8 hours on A100, 4 4~16

20. Fine-Tuning Korean and Japanese Models

Korea and Japan have active base-model builders and fine-tuning communities.

Korea

  • HyperCLOVA X / HyperCLOVA SEED (Naver) — Naver's own Korean base. The 2025 SEED release opened some weights externally. More Korean syntax, honorifics, and cultural context coverage than global models.
  • KoLLM series (Kakao) — Kakao's Korean base. Reinforced with KakaoTalk data.
  • EXAONE (LG AI Research) — strong Korean-English 32B/8B base. Open weights.
  • Saltlux Luxia — Saltlux's enterprise Korean model. Specialized for call centers and document pipelines.
  • Bllossom — Korean SFT datasets and Llama-based Korean fine-tuned model series. Community-driven.
  • KULLM (Korea University) — academic Korean LLM with open SFT data.

Japan

  • ELYZA-japanese-Llama — Llama base with additional Japanese pretraining and SFT. Natural Japanese outputs.
  • LINE japanese-large-lm (LY Corporation) — LINE's Japanese LM. 1B / 3.6B / 36B series.
  • Rinna series — Japanese GPT/BERT, a long-running Japanese LM builder.
  • Karakuri-LM — strengthened for Japanese business domains.
  • Stockmark — Japanese business news and finance.
  • CyberAgent OpenCALM — Japanese open model series.
  • PLaMo (Preferred Networks) — Japanese and English base.

KO/JA Fine-Tuning Patterns

  • Base + language adapter — the most common pattern: take a global base (Llama, Qwen, Mistral) and add a Korean/Japanese LoRA.
  • Dual training — mix native-language data with English instruction data for SFT, then native DPO for tone alignment.
  • Honorific/keigo alignment — both Korean and Japanese have rich speech acts. Tone is a major differentiator. DPO and KTO are used heavily.

21. Which Tool Should You Pick — Decision Tree

Final summary. Recommended combos per scenario.

First Time Trying LoRA

  • Hardware: RTX 4090 or one RunPod A100.
  • Data: starter SFT sets like Alpaca or OpenHermes.
  • Tools: Unsloth plus TRL SFTTrainer. About 30 lines of code.
  • Base: Llama 3.2 3B or Mistral 7B.

In-House Model From Company Data

  • Hardware: cloud H100 1~8 GPUs or M3 Ultra (if data cannot leave premises).
  • Data: convert internal wikis, tickets, customer queries plus synthetic QA (Augmentoolkit).
  • Tools: Axolotl (YAML reproducibility) or LLaMA-Factory (web UI).
  • Base: a license-appropriate Llama 3.x, Qwen 2.5, or Mistral 7B/22B.

Korean/Japanese-Specialized Model

  • Base: HyperCLOVA SEED, EXAONE, Qwen 2.5 (strong multilingual), ELYZA, LINE LM.
  • Data: native instruction plus English instruction mix (7:3).
  • Tools: Axolotl plus DPO. Korean/Japanese tone alignment is the differentiator.

Reasoning Model

  • Data: math and code problems with stepwise solutions.
  • Tools: TRL plus GRPO. The DeepSeek-R1 recipe.
  • Base: Qwen 2.5-Math, Llama 3.1 8B/70B.

Small On a Laptop

  • Hardware: M3/M4 Mac.
  • Tools: MLX-LM.
  • Base: Mistral 7B, Llama 3.2 3B/8B.

Fast Prototype

  • Cloud API: Together.ai fine-tuning API — upload a file and you are done.
  • Time: first SFT result within an hour.

22. References

Core Papers

Library Docs

Korean and Japanese LLM Ecosystem


Epilogue — Fine-Tuning Is No Longer a Secret

One-sentence summary of this article: In 2026, LLM fine-tuning is a "user tool." While LoRA evolved from a simple adapter to DoRA and GaLore, Unsloth doubled single-GPU training speed and Axolotl and LLaMA-Factory dissolved the entry barrier. We live in an age where one M3 Ultra Mac Studio can LoRA-tune a 70B model.

The remaining question is not "can I do it" but "what should I train it on." Good data, good evaluation, good use cases — that is the real work of a 2026 fine-tuning engineer.

"Tools are no longer the bottleneck. Data is."

— LLM Fine-Tuning 2026, end.