- Published on
The Rise of Diffusion LMs — Can They Become an Alternative to Autoregression
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction — The Moment Diffusion Crossed Over to Text
- The Structural Limits of Autoregressive Generation
- How Text Diffusion Works — Masking and Denoising
- A Short Lineage — How We Got Here
- The Training Objective — Simpler Than You Think
- A Toy Implementation — A Minimal Working Example
- How It Differs from Image Diffusion
- Analyzing DiffusionGemma — What Was Released
- Quality vs Speed — The Real Trade-Off
- The Inference Cost Lens — Where the Gains Come From
- Suitable Workloads — Where to Use It
- A Practical Adoption Guide — Getting It into Your Eval Pipeline
- The Hybrid Outlook — Combining Autoregression and Diffusion
- A Critical View — Do Not Take the Benchmarks at Face Value
- Closing
- References
Introduction — The Moment Diffusion Crossed Over to Text
In June 2026, Google released DiffusionGemma: a 26B-parameter MoE architecture, an Apache 2.0 license, and a claim of up to 4x faster text generation compared to autoregressive models of the same class. The GeekNews post racked up comments quickly, and Hacker News erupted into a debate over whether an alternative to autoregression had finally arrived.
In image generation, diffusion models are already the standard. Text, however, was a different story. Tokens are discrete, unlike pixels, and in language, order creates meaning. Yet text diffusion models are getting attention now for a clear reason: the structural limits of autoregressive generation — above all the one-token-at-a-time sequentiality — are being singled out as the root cause of inference cost and latency.
This post walks through the limits of autoregression, how text diffusion actually works, an analysis of DiffusionGemma, and how far the benchmark claims should be trusted.
The Structural Limits of Autoregressive Generation
Nearly every LLM today is autoregressive. Predict a probability distribution over the next token, sample one, append it to the input, predict again, repeat.
Autoregressive generation (sequential):
Input: "The weather today is"
step 1: [The weather today is] -> "really"
step 2: [The weather today is really] -> "quite"
step 3: [The weather today is really quite] -> "nice"
...
N tokens = N model forwards (serial, not parallelizable)
Three limits are baked into this structure.
- Sequentiality: producing N tokens requires N forward passes in series. No matter how big the GPU, single-sequence generation latency is hard to reduce.
- Memory bandwidth bottleneck: at small batch sizes, decoding is bound not by compute but by the bandwidth needed to stream weights and the KV cache. The compute units of the GPU mostly sit idle.
- No takebacks: once a token is emitted, it cannot be revised. If the model takes a wrong turn early in a sentence, there is no structural mechanism to correct it later.
Remedies like speculative decoding and multi-token prediction exist, but they are all optimizations within the autoregressive frame. Diffusion models propose changing the frame itself.
How Text Diffusion Works — Masking and Denoising
Image diffusion models learn to add Gaussian noise to pixels and reverse the process. Text is discrete, so noise cannot be added directly; most text diffusion models use masking as their noise. This family is called masked diffusion, or discrete diffusion.
Training (forward process): progressive masking
"the cat sleeps on the sofa"
-> "the cat [M] on the [M]"
-> "[M] [M] on [M] [M]"
-> "[M] [M] [M] [M] [M] [M]"
Generation (reverse process): progressive restoration
"[M] [M] [M] [M] [M] [M]"
-> "[M] cat [M] on [M] sofa" (highest-confidence tokens first)
-> "the cat [M] on the sofa"
-> "the cat sleeps on the sofa"
The key point: during generation, multiple tokens can be restored simultaneously. Where autoregression needs N forwards for N tokens, a diffusion model needs T denoising steps. If T is much smaller than N, you get a corresponding speedup.
The skeleton of the sampling loop in pseudocode:
def diffusion_generate(model, prompt_ids, gen_len, num_steps):
# Initialize the generation span entirely with mask tokens
x = concat(prompt_ids, [MASK] * gen_len)
for step in range(num_steps):
logits = model(x) # one forward over the full sequence
probs = softmax(logits, dim=-1)
conf, pred = probs.max(dim=-1) # confidence and prediction per position
# Commit only the top-k most confident still-masked positions
k = unmask_schedule(step, num_steps, gen_len)
top_positions = topk_masked_positions(conf, x, k)
x[top_positions] = pred[top_positions]
return x
At each step the model commits the tokens it is most confident about and defers ambiguous positions to later steps. Some variants even allow remasking of already-committed tokens, giving diffusion a self-correction ability that autoregression lacks.
Block-Wise Parallel Generation
Denoising an entire long text in one shot hurts quality, so practical systems use block-wise generation: split the sequence into blocks, proceed autoregressively across blocks, and denoise in parallel within each block.
Block-wise semi-autoregressive generation:
[prompt] -> [block 1: 32 tokens denoised in parallel]
-> [block 2: 32 tokens denoised in parallel]
-> [block 3: 32 tokens denoised in parallel]
Across blocks: sequential (causality preserved)
Within blocks: parallel (diffusion denoising, e.g. 8 steps)
-> forwards per token: 32 tokens / 8 steps = 1 forward per 4 tokens
This structure enables arbitrary-length generation and KV cache reuse while preserving a solid speed advantage over pure autoregression.
A Short Lineage — How We Got Here
Text diffusion did not appear out of nowhere. Following the milestones makes the trajectory clear.
| Year | Milestone | Significance |
|---|---|---|
| 2021 | D3PM | Theoretical foundation for diffusion on discrete data |
| 2022 | Diffusion-LM | Continuous diffusion in embedding space |
| 2023 | SEDD | Score-based discrete diffusion narrows the perplexity gap |
| 2024 | MDLM family | Simplified masked-diffusion objectives, stabilized training |
| 2025 | LLaDA 8B | A from-scratch diffusion LM approaches same-class autoregressive models |
| 2025 | Mercury, Gemini Diffusion | Commercial-grade speed demos (claims of around a thousand tokens/sec) |
| 2026 | DiffusionGemma | Large MoE plus open weights opens the ecosystem |
Two things stand out in this lineage. First, it took roughly five years from theory (D3PM) to practice (DiffusionGemma). Second, the decisive turning points were the discovery that using masking as noise makes the objective simple, and the discovery that adapting from existing autoregressive weights slashes training cost.
The Training Objective — Simpler Than You Think
Masked diffusion training resembles masked language modeling in BERT, with a crucial difference: the masking ratio is sampled randomly between 0 and 1, and the loss is weighted according to that ratio.
import torch
import torch.nn.functional as F
def masked_diffusion_loss(model, x0, mask_token_id):
"""Simplified masked diffusion loss (MDLM style).
x0: original token sequence (B, T)
"""
B, T = x0.shape
# 1) Sample masking ratio t uniformly from (0, 1]
t = torch.rand(B, 1, device=x0.device).clamp(min=1e-3)
# 2) Replace each token with the mask with probability t
mask = torch.rand(B, T, device=x0.device) < t
xt = torch.where(mask, mask_token_id, x0)
# 3) Predict the original token at masked positions
logits = model(xt) # (B, T, vocab)
loss_tok = F.cross_entropy(
logits.view(-1, logits.size(-1)),
x0.view(-1),
reduction="none",
).view(B, T)
# 4) Aggregate only masked positions, weighted by 1/t
# (at low masking ratios each token carries more information)
weighted = (loss_tok * mask) / t
return weighted.sum() / mask.sum().clamp(min=1)
Compare it with the autoregressive loss and the difference is clear.
Autoregressive: predict position i from positions 0..i-1 only (causal mask)
loss at every position; one sequence = T training signals
Masked diffusion: predict masked positions from everything else (bidirectional)
loss only at masked positions; masking ratio varies per sample
Autoregression has denser training signal. To reach the same loss on the same data, diffusion tends to spend more compute — one of the hidden costs of diffusion LMs, and a key reason the adaptation strategy is so popular: pay the heavy pretraining bill autoregressively, then layer diffusion on top.
A Toy Implementation — A Minimal Working Example
To cement the principle in code, here is the skeleton of a character-level mini diffusion LM. A transformer without a causal mask is all the backbone you need.
import torch
import torch.nn as nn
class TinyDenoiser(nn.Module):
"""Bidirectional-attention transformer = the diffusion LM backbone."""
def __init__(self, vocab_size, d_model=256, n_head=4, n_layer=4, max_len=256):
super().__init__()
self.tok = nn.Embedding(vocab_size + 1, d_model) # +1: mask token
self.pos = nn.Embedding(max_len, d_model)
layer = nn.TransformerEncoderLayer(
d_model, n_head, dim_feedforward=4 * d_model,
batch_first=True, norm_first=True,
)
# Key point: TransformerEncoder with no causal mask (bidirectional)
self.blocks = nn.TransformerEncoder(layer, n_layer)
self.head = nn.Linear(d_model, vocab_size)
def forward(self, x):
T = x.size(1)
pos = torch.arange(T, device=x.device)
h = self.tok(x) + self.pos(pos)
h = self.blocks(h)
return self.head(h)
The training loop reuses the loss function above; generation uses confidence-based unmasking.
@torch.no_grad()
def generate(model, prompt, gen_len, num_steps, mask_id):
device = prompt.device
x = torch.cat([
prompt,
torch.full((1, gen_len), mask_id, device=device),
], dim=1)
prompt_len = prompt.size(1)
for step in range(num_steps):
logits = model(x)
probs = logits.softmax(dim=-1)
conf, pred = probs.max(dim=-1)
still_masked = (x == mask_id)
still_masked[:, :prompt_len] = False
n_masked = int(still_masked.sum())
if n_masked == 0:
break
# Spread remaining positions evenly over remaining steps
k = max(1, n_masked // (num_steps - step))
conf = conf.masked_fill(~still_masked, -1.0)
top = conf.view(-1).topk(k).indices
flat = x.view(-1)
flat[top] = pred.view(-1)[top]
return x
With this skeleton of barely a hundred lines you can experiment on a Shakespeare-scale dataset. Set num_steps equal to gen_len and it behaves almost like sequential generation; cut it to one eighth and you can watch the quality degrade with your own eyes. Turning the quality-speed dial yourself is half of understanding this topic.
How It Differs from Image Diffusion
The two share the name diffusion, but the differences are large.
| Aspect | Image diffusion | Text diffusion (masked) |
|---|---|---|
| Data space | Continuous (pixels, latents) | Discrete (tokens) |
| Noise | Add Gaussian noise | Replace tokens with a mask |
| Restoration target | Predict noise or original | Predict tokens at masked positions |
| Step count | Dozens of steps typical | A few to dozens of steps |
| Backbone | U-Net or DiT | Bidirectional-attention transformer |
| Evaluation | Plausible to the eye is enough | Grammar, factuality, coherence all required |
The last row matters most. An image tolerates a few off pixels; in text, one wrong token can collapse a sentence. That is the fundamental reason text diffusion lagged image diffusion. One more difference: the backbone of a text diffusion model uses bidirectional attention with no causal mask. The BERT-like fill-in-the-blank structure dovetails naturally with diffusion generation.
Analyzing DiffusionGemma — What Was Released
Summarizing the official June 2026 announcement of DiffusionGemma:
- Scale: 26B-parameter MoE (Mixture of Experts). Active parameters per token are far fewer, securing inference efficiency
- License: Apache 2.0 — open to research and commercial use alike
- Speed claim: up to 4x faster generation than same-class autoregressive models, with the largest gaps claimed on code generation and editing workloads
- Method: masked diffusion plus block-wise generation, reportedly adapted from the existing Gemma family
Three points deserve attention.
First, combining MoE with diffusion. A diffusion model forwards the whole sequence every step, so per-step compute exceeds autoregression. Using MoE to shrink active compute per token reads as a deliberate design to offset that cost.
Second, the adaptation strategy. Instead of pretraining a diffusion model from scratch, starting from autoregressive weights and continuing training with a diffusion objective slashes training cost. That the knowledge accumulated by autoregressive models transfers over is a consistent finding of recent work.
Third, the Apache 2.0 release itself. The serving ecosystem for diffusion LMs (optimized engines a la vLLM) is still immature; fully opening the weights looks like a strategy to get the community to build that infrastructure.
Quality vs Speed — The Real Trade-Off
Diffusion LM speed is not free. The core trade-off lives in the number of denoising steps.
| Denoising steps | Speed | Quality tendency |
|---|---|---|
| Equal to token count | Similar to autoregressive | Autoregressive-level quality possible |
| One quarter of token count | About 4x faster | Close on many benchmarks |
| One sixteenth of token count | About 16x faster | Visible coherence loss, more repetition and contradiction |
In other words, the 4x-faster claim should be read as 4x at a step-count setting where quality loss was pushed down to a tolerable level. The dial that trades quality for speed via step count is the very essence of diffusion LMs. That is a weakness, but also an operational flexibility: you can tune the dial per workload.
Another cost is restricted KV cache usage. With bidirectional attention, the context changes every time a token is committed, so the simple cache reuse of autoregression does not apply. Block-wise generation and approximate caching mitigate this, but not being able to reuse a decade of autoregressive serving optimizations is a real handicap.
The Inference Cost Lens — Where the Gains Come From
From the GPU's point of view, the story becomes clearer.
Autoregressive decoding (batch 1):
1 forward = 1 token
bottleneck: memory bandwidth (weight streaming)
GPU compute utilization: low
Diffusion denoising (batch 1):
1 forward = updates a whole block (e.g. 32 tokens)
bottleneck: compute (attention over the full sequence)
GPU compute utilization: high
Autoregressive decoding is dominated by the time spent streaming weights from memory, leaving GPU compute capacity idle. Diffusion pours that idle compute into updating many tokens at once. Put differently, the speed advantage of diffusion LMs comes from converting wasted compute into latency reduction.
The gains are therefore largest when batches are small and latency matters — interactive chat, code autocomplete. Conversely, for serving that already saturates the GPU with large batches (offline batch inference), the relative advantage shrinks: autoregression plus continuous batching is already extremely efficient on a throughput basis.
Suitable Workloads — Where to Use It
Structurally, certain tasks fit diffusion LMs especially well.
| Workload | Fit | Reason |
|---|---|---|
| First-draft generation | High | Fast bulk generation followed by human polish |
| Text editing / rewriting | High | Bidirectional context edits the middle naturally |
| Code infilling (FIM) | Very high | Filling a gap while seeing code on both sides is a native fit |
| Structured output (JSON etc.) | High | Pin the schema skeleton, denoise only the values |
| Long-form reasoning (chain of thought) | Low | Stepwise causal development favors sequential generation |
| Precise long documents | Medium | Possible with more steps, but the speed advantage evaporates |
The low-fit areas are equally clear. Tasks where earlier steps become premises for later ones — mathematical proofs, multi-step reasoning — are in fundamental tension with committing tokens in parallel. There is a reason reasoning-focused models still cling to autoregression.
Code infilling in particular is a task autoregressive models only awkwardly imitated with special tokens. For a bidirectional diffusion model it is inherently natural — which is why many expect the first killer use case for diffusion LMs to emerge in code tooling.
A Practical Adoption Guide — Getting It into Your Eval Pipeline
If you are evaluating diffusion LMs, this order is recommended.
- Profile your own workload: quantify average output length, latency requirements, and batch-size distribution first. Short outputs, small batches, strict latency — that is diffusion-friendly terrain.
- Build a proper baseline: the fair comparison target is an autoregressive model with speculative decoding and quantization already applied, not a vanilla implementation.
- Quality gate first, speed second: find the minimum step count that passes your domain quality bar, then measure latency at that setting.
- Grid-search block size and step count: these two hyperparameters define the quality-speed curve. You need the whole curve to pick an operating point.
The skeleton of a latency benchmark:
import time
def benchmark_generation(generate_fn, prompts, n_warmup=3, n_runs=10):
# Warmup (remove kernel compilation and cache effects)
for p in prompts[:n_warmup]:
generate_fn(p)
records = []
for p in prompts[:n_runs]:
torch.cuda.synchronize()
t0 = time.perf_counter()
out = generate_fn(p)
torch.cuda.synchronize()
dt = time.perf_counter() - t0
n_new = out.size(1) - p.size(1)
records.append((dt, n_new / dt))
lat = sorted(r[0] for r in records)
tps = sorted(r[1] for r in records)
mid = len(records) // 2
print(f"median latency: {lat[mid]*1000:.1f} ms")
print(f"median tokens/sec: {tps[mid]:.1f}")
Avoid two common measurement mistakes. First, including the first call without warmup mixes compilation overhead into the numbers. Second, when comparing tokens per second across models with different tokenizers, the same text yields different token counts — comparing characters per second or task completion time is safer.
Adoption Checklist
- Are outputs short or medium length (within a few hundred tokens)?
- Is the share of editing, infilling, and structured output high?
- Does p95 latency tie directly to business metrics?
- Do you already have a quality eval set and gate?
- Can your team modify the serving stack itself (to compensate for the immature ecosystem)?
If you can answer yes to three or more of the five, a pilot is worth it.
The Hybrid Outlook — Combining Autoregression and Diffusion
The industry consensus is not that diffusion replaces autoregression, but that the two will merge. Several directions are already being explored.
- Block-level autoregression + within-block diffusion: the semi-autoregressive structure above. As a compromise between causality and parallelism, it is becoming the de facto standard.
- Diffusion drafter + autoregressive verifier: swap the draft model in speculative decoding for a diffusion model. Diffusion lays down a fast draft; the autoregressive model verifies and corrects.
- Autoregressive reasoning, diffused prose: do the reasoning sequentially for accuracy, then generate the long final answer quickly with diffusion — a division of labor.
- Bidirectional adaptation: one set of weights supporting both an autoregressive mode and a diffusion mode, with the generation mode selected at runtime per workload.
That DiffusionGemma itself is closer to a block-wise hybrid than pure diffusion supports this trajectory.
A Critical View — Do Not Take the Benchmarks at Face Value
To be fair, a few brushback pitches are in order.
- Baseline problems in speed comparisons: always check what the 4x is measured against. Comparing against an autoregressive model with speculative decoding, quantization, and optimized kernels yields a completely different conclusion than comparing against a vanilla implementation.
- Selection bias in quality benchmarks: diffusion LM announcements tend to front-load benchmarks that favor parallel generation (short responses, code infilling). Numbers for weak areas — long-form coherence, multi-turn dialogue, precise instruction following — are often in the appendix.
- Likelihood evaluation is hard: diffusion LMs cannot compute exact likelihoods, so perplexity comparisons rely on approximations (ELBO bounds). Direct comparison with autoregressive numbers always warrants caution.
- The ecosystem gap: autoregression has a massive optimization ecosystem — vLLM, speculative decoding, prefix caching. For the theoretical speed advantage of diffusion LMs to become a real production gap, comparable serving infrastructure must come first.
- Insufficient validation at scale: public evidence that diffusion LMs ride the same scaling curves as autoregression at frontier scale is still limited. A 26B MoE is meaningful, but too early for verdicts.
Closing
Text diffusion models are best seen not as a replacement for autoregression but as a new dial that changes the cost structure of inference. They trade quality for speed via step count, excel at editing and infilling thanks to bidirectional context, and convert wasted GPU compute into latency reduction. The Apache 2.0 release of DiffusionGemma will significantly accelerate ecosystem formation in this direction.
In the history of technology, when a slow-but-proven approach competes with a fast-but-novel one, the winner has usually been a compromise that combines the strengths of both. Text generation will very likely repeat the same pattern.
Still, read every flashy benchmark number together with its conditions. The autoregressive camp is not standing still, and the most likely future is not victory for either side but per-workload division of labor and hybrids. If you are building code autocomplete or document editing tools, now is the right time to put a diffusion LM into your experiment pipeline.
References
- DiffusionGemma official announcement: https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/
- DiffusionGemma on GeekNews: https://news.hada.io/topic?id=30386
- Gemini Diffusion page: https://deepmind.google/models/gemini-diffusion/
- LLaDA (Large Language Diffusion Models) paper: https://arxiv.org/abs/2502.09992
- D3PM (foundations of discrete diffusion): https://arxiv.org/abs/2107.03006
- Simplified masked diffusion (MDLM): https://arxiv.org/abs/2406.07524
- Block Diffusion paper: https://arxiv.org/abs/2503.09573
- Dream 7B (diffusion LM adaptation): https://arxiv.org/abs/2508.15487
- Diffusion LM discussions on Hacker News: https://news.ycombinator.com/
- vLLM documentation (the autoregressive serving baseline): https://docs.vllm.ai/