The Rise of Diffusion LMs — Can They Become an Alternative to Autoregression

Introduction — The Moment Diffusion Crossed Over to Text
The Structural Limits of Autoregressive Generation
How Text Diffusion Works — Masking and Denoising
- Block-Wise Parallel Generation
A Short Lineage — How We Got Here
The Training Objective — Simpler Than You Think
A Toy Implementation — A Minimal Working Example
How It Differs from Image Diffusion
Analyzing DiffusionGemma — What Was Released
Quality vs Speed — The Real Trade-Off
The Inference Cost Lens — Where the Gains Come From
Suitable Workloads — Where to Use It
A Practical Adoption Guide — Getting It into Your Eval Pipeline
- Adoption Checklist
The Hybrid Outlook — Combining Autoregression and Diffusion
A Critical View — Do Not Take the Benchmarks at Face Value
Closing
References

Introduction — The Moment Diffusion Crossed Over to Text

In June 2026, Google released DiffusionGemma: a 26B-parameter MoE architecture, an Apache 2.0 license, and a claim of up to 4x faster text generation compared to autoregressive models of the same class. The GeekNews post racked up comments quickly, and Hacker News erupted into a debate over whether an alternative to autoregression had finally arrived.

In image generation, diffusion models are already the standard. Text, however, was a different story. Tokens are discrete, unlike pixels, and in language, order creates meaning. Yet text diffusion models are getting attention now for a clear reason: the structural limits of autoregressive generation — above all the one-token-at-a-time sequentiality — are being singled out as the root cause of inference cost and latency.

This post walks through the limits of autoregression, how text diffusion actually works, an analysis of DiffusionGemma, and how far the benchmark claims should be trusted.

The Structural Limits of Autoregressive Generation

Nearly every LLM today is autoregressive. Predict a probability distribution over the next token, sample one, append it to the input, predict again, repeat.

Autoregressive generation (sequential):

Input: "The weather today is"
  step 1: [The weather today is] -> "really"
  step 2: [The weather today is really] -> "quite"
  step 3: [The weather today is really quite] -> "nice"
  ...
  N tokens = N model forwards (serial, not parallelizable)

Three limits are baked into this structure.

Sequentiality: producing N tokens requires N forward passes in series. No matter how big the GPU, single-sequence generation latency is hard to reduce.
Memory bandwidth bottleneck: at small batch sizes, decoding is bound not by compute but by the bandwidth needed to stream weights and the KV cache. The compute units of the GPU mostly sit idle.
No takebacks: once a token is emitted, it cannot be revised. If the model takes a wrong turn early in a sentence, there is no structural mechanism to correct it later.

Remedies like speculative decoding and multi-token prediction exist, but they are all optimizations within the autoregressive frame. Diffusion models propose changing the frame itself.

How Text Diffusion Works — Masking and Denoising

Image diffusion models learn to add Gaussian noise to pixels and reverse the process. Text is discrete, so noise cannot be added directly; most text diffusion models use masking as their noise. This family is called masked diffusion, or discrete diffusion.

Training (forward process): progressive masking
  "the cat sleeps on the sofa"
   -> "the cat [M] on the [M]"
   -> "[M] [M] on [M] [M]"
   -> "[M] [M] [M] [M] [M] [M]"

Generation (reverse process): progressive restoration
  "[M] [M] [M] [M] [M] [M]"
   -> "[M] cat [M] on [M] sofa"     (highest-confidence tokens first)
   -> "the cat [M] on the sofa"
   -> "the cat sleeps on the sofa"

The key point: during generation, multiple tokens can be restored simultaneously. Where autoregression needs N forwards for N tokens, a diffusion model needs T denoising steps. If T is much smaller than N, you get a corresponding speedup.

The skeleton of the sampling loop in pseudocode:

def diffusion_generate(model, prompt_ids, gen_len, num_steps):
    # Initialize the generation span entirely with mask tokens
    x = concat(prompt_ids, [MASK] * gen_len)

    for step in range(num_steps):
        logits = model(x)              # one forward over the full sequence
        probs = softmax(logits, dim=-1)
        conf, pred = probs.max(dim=-1) # confidence and prediction per position

        # Commit only the top-k most confident still-masked positions
        k = unmask_schedule(step, num_steps, gen_len)
        top_positions = topk_masked_positions(conf, x, k)
        x[top_positions] = pred[top_positions]

    return x

At each step the model commits the tokens it is most confident about and defers ambiguous positions to later steps. Some variants even allow remasking of already-committed tokens, giving diffusion a self-correction ability that autoregression lacks.

Block-Wise Parallel Generation

Denoising an entire long text in one shot hurts quality, so practical systems use block-wise generation: split the sequence into blocks, proceed autoregressively across blocks, and denoise in parallel within each block.

Block-wise semi-autoregressive generation:

[prompt] -> [block 1: 32 tokens denoised in parallel]
         -> [block 2: 32 tokens denoised in parallel]
         -> [block 3: 32 tokens denoised in parallel]

Across blocks: sequential (causality preserved)
Within blocks: parallel (diffusion denoising, e.g. 8 steps)
  -> forwards per token: 32 tokens / 8 steps = 1 forward per 4 tokens

This structure enables arbitrary-length generation and KV cache reuse while preserving a solid speed advantage over pure autoregression.

A Short Lineage — How We Got Here

Text diffusion did not appear out of nowhere. Following the milestones makes the trajectory clear.

Year	Milestone	Significance
2021	D3PM	Theoretical foundation for diffusion on discrete data
2022	Diffusion-LM	Continuous diffusion in embedding space
2023	SEDD	Score-based discrete diffusion narrows the perplexity gap
2024	MDLM family	Simplified masked-diffusion objectives, stabilized training
2025	LLaDA 8B	A from-scratch diffusion LM approaches same-class autoregressive models
2025	Mercury, Gemini Diffusion	Commercial-grade speed demos (claims of around a thousand tokens/sec)
2026	DiffusionGemma	Large MoE plus open weights opens the ecosystem

Two things stand out in this lineage. First, it took roughly five years from theory (D3PM) to practice (DiffusionGemma). Second, the decisive turning points were the discovery that using masking as noise makes the objective simple, and the discovery that adapting from existing autoregressive weights slashes training cost.

The Training Objective — Simpler Than You Think

Masked diffusion training resembles masked language modeling in BERT, with a crucial difference: the masking ratio is sampled randomly between 0 and 1, and the loss is weighted according to that ratio.

import torch
import torch.nn.functional as F

def masked_diffusion_loss(model, x0, mask_token_id):
    """Simplified masked diffusion loss (MDLM style).
    x0: original token sequence (B, T)
    """
    B, T = x0.shape
    # 1) Sample masking ratio t uniformly from (0, 1]
    t = torch.rand(B, 1, device=x0.device).clamp(min=1e-3)

    # 2) Replace each token with the mask with probability t
    mask = torch.rand(B, T, device=x0.device) < t
    xt = torch.where(mask, mask_token_id, x0)

    # 3) Predict the original token at masked positions
    logits = model(xt)  # (B, T, vocab)
    loss_tok = F.cross_entropy(
        logits.view(-1, logits.size(-1)),
        x0.view(-1),
        reduction="none",
    ).view(B, T)

    # 4) Aggregate only masked positions, weighted by 1/t
    #    (at low masking ratios each token carries more information)
    weighted = (loss_tok * mask) / t
    return weighted.sum() / mask.sum().clamp(min=1)

Compare it with the autoregressive loss and the difference is clear.

Autoregressive: predict position i from positions 0..i-1 only (causal mask)
                loss at every position; one sequence = T training signals

Masked diffusion: predict masked positions from everything else (bidirectional)
                loss only at masked positions; masking ratio varies per sample

Autoregression has denser training signal. To reach the same loss on the same data, diffusion tends to spend more compute — one of the hidden costs of diffusion LMs, and a key reason the adaptation strategy is so popular: pay the heavy pretraining bill autoregressively, then layer diffusion on top.

A Toy Implementation — A Minimal Working Example

To cement the principle in code, here is the skeleton of a character-level mini diffusion LM. A transformer without a causal mask is all the backbone you need.

import torch
import torch.nn as nn

class TinyDenoiser(nn.Module):
    """Bidirectional-attention transformer = the diffusion LM backbone."""
    def __init__(self, vocab_size, d_model=256, n_head=4, n_layer=4, max_len=256):
        super().__init__()
        self.tok = nn.Embedding(vocab_size + 1, d_model)  # +1: mask token
        self.pos = nn.Embedding(max_len, d_model)
        layer = nn.TransformerEncoderLayer(
            d_model, n_head, dim_feedforward=4 * d_model,
            batch_first=True, norm_first=True,
        )
        # Key point: TransformerEncoder with no causal mask (bidirectional)
        self.blocks = nn.TransformerEncoder(layer, n_layer)
        self.head = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        T = x.size(1)
        pos = torch.arange(T, device=x.device)
        h = self.tok(x) + self.pos(pos)
        h = self.blocks(h)
        return self.head(h)

The training loop reuses the loss function above; generation uses confidence-based unmasking.

@torch.no_grad()
def generate(model, prompt, gen_len, num_steps, mask_id):
    device = prompt.device
    x = torch.cat([
        prompt,
        torch.full((1, gen_len), mask_id, device=device),
    ], dim=1)
    prompt_len = prompt.size(1)

    for step in range(num_steps):
        logits = model(x)
        probs = logits.softmax(dim=-1)
        conf, pred = probs.max(dim=-1)

        still_masked = (x == mask_id)
        still_masked[:, :prompt_len] = False
        n_masked = int(still_masked.sum())
        if n_masked == 0:
            break

        # Spread remaining positions evenly over remaining steps
        k = max(1, n_masked // (num_steps - step))
        conf = conf.masked_fill(~still_masked, -1.0)
        top = conf.view(-1).topk(k).indices
        flat = x.view(-1)
        flat[top] = pred.view(-1)[top]

    return x

With this skeleton of barely a hundred lines you can experiment on a Shakespeare-scale dataset. Set num_steps equal to gen_len and it behaves almost like sequential generation; cut it to one eighth and you can watch the quality degrade with your own eyes. Turning the quality-speed dial yourself is half of understanding this topic.

How It Differs from Image Diffusion

The two share the name diffusion, but the differences are large.

Aspect	Image diffusion	Text diffusion (masked)
Data space	Continuous (pixels, latents)	Discrete (tokens)
Noise	Add Gaussian noise	Replace tokens with a mask
Restoration target	Predict noise or original	Predict tokens at masked positions
Step count	Dozens of steps typical	A few to dozens of steps
Backbone	U-Net or DiT	Bidirectional-attention transformer
Evaluation	Plausible to the eye is enough	Grammar, factuality, coherence all required

The last row matters most. An image tolerates a few off pixels; in text, one wrong token can collapse a sentence. That is the fundamental reason text diffusion lagged image diffusion. One more difference: the backbone of a text diffusion model uses bidirectional attention with no causal mask. The BERT-like fill-in-the-blank structure dovetails naturally with diffusion generation.

Analyzing DiffusionGemma — What Was Released

Summarizing the official June 2026 announcement of DiffusionGemma:

Scale: 26B-parameter MoE (Mixture of Experts). Active parameters per token are far fewer, securing inference efficiency
License: Apache 2.0 — open to research and commercial use alike
Speed claim: up to 4x faster generation than same-class autoregressive models, with the largest gaps claimed on code generation and editing workloads
Method: masked diffusion plus block-wise generation, reportedly adapted from the existing Gemma family

Three points deserve attention.

First, combining MoE with diffusion. A diffusion model forwards the whole sequence every step, so per-step compute exceeds autoregression. Using MoE to shrink active compute per token reads as a deliberate design to offset that cost.

Second, the adaptation strategy. Instead of pretraining a diffusion model from scratch, starting from autoregressive weights and continuing training with a diffusion objective slashes training cost. That the knowledge accumulated by autoregressive models transfers over is a consistent finding of recent work.

Third, the Apache 2.0 release itself. The serving ecosystem for diffusion LMs (optimized engines a la vLLM) is still immature; fully opening the weights looks like a strategy to get the community to build that infrastructure.

Quality vs Speed — The Real Trade-Off

Diffusion LM speed is not free. The core trade-off lives in the number of denoising steps.

Denoising steps	Speed	Quality tendency
Equal to token count	Similar to autoregressive	Autoregressive-level quality possible
One quarter of token count	About 4x faster	Close on many benchmarks
One sixteenth of token count	About 16x faster	Visible coherence loss, more repetition and contradiction

In other words, the 4x-faster claim should be read as 4x at a step-count setting where quality loss was pushed down to a tolerable level. The dial that trades quality for speed via step count is the very essence of diffusion LMs. That is a weakness, but also an operational flexibility: you can tune the dial per workload.

Another cost is restricted KV cache usage. With bidirectional attention, the context changes every time a token is committed, so the simple cache reuse of autoregression does not apply. Block-wise generation and approximate caching mitigate this, but not being able to reuse a decade of autoregressive serving optimizations is a real handicap.

The Inference Cost Lens — Where the Gains Come From

From the GPU's point of view, the story becomes clearer.

Autoregressive decoding (batch 1):
  1 forward = 1 token
  bottleneck: memory bandwidth (weight streaming)
  GPU compute utilization: low

Diffusion denoising (batch 1):
  1 forward = updates a whole block (e.g. 32 tokens)
  bottleneck: compute (attention over the full sequence)
  GPU compute utilization: high

Autoregressive decoding is dominated by the time spent streaming weights from memory, leaving GPU compute capacity idle. Diffusion pours that idle compute into updating many tokens at once. Put differently, the speed advantage of diffusion LMs comes from converting wasted compute into latency reduction.

The gains are therefore largest when batches are small and latency matters — interactive chat, code autocomplete. Conversely, for serving that already saturates the GPU with large batches (offline batch inference), the relative advantage shrinks: autoregression plus continuous batching is already extremely efficient on a throughput basis.

Suitable Workloads — Where to Use It

Structurally, certain tasks fit diffusion LMs especially well.

Workload	Fit	Reason
First-draft generation	High	Fast bulk generation followed by human polish
Text editing / rewriting	High	Bidirectional context edits the middle naturally
Code infilling (FIM)	Very high	Filling a gap while seeing code on both sides is a native fit
Structured output (JSON etc.)	High	Pin the schema skeleton, denoise only the values
Long-form reasoning (chain of thought)	Low	Stepwise causal development favors sequential generation
Precise long documents	Medium	Possible with more steps, but the speed advantage evaporates

The low-fit areas are equally clear. Tasks where earlier steps become premises for later ones — mathematical proofs, multi-step reasoning — are in fundamental tension with committing tokens in parallel. There is a reason reasoning-focused models still cling to autoregression.

Code infilling in particular is a task autoregressive models only awkwardly imitated with special tokens. For a bidirectional diffusion model it is inherently natural — which is why many expect the first killer use case for diffusion LMs to emerge in code tooling.

A Practical Adoption Guide — Getting It into Your Eval Pipeline

If you are evaluating diffusion LMs, this order is recommended.

Profile your own workload: quantify average output length, latency requirements, and batch-size distribution first. Short outputs, small batches, strict latency — that is diffusion-friendly terrain.
Build a proper baseline: the fair comparison target is an autoregressive model with speculative decoding and quantization already applied, not a vanilla implementation.
Quality gate first, speed second: find the minimum step count that passes your domain quality bar, then measure latency at that setting.
Grid-search block size and step count: these two hyperparameters define the quality-speed curve. You need the whole curve to pick an operating point.

The skeleton of a latency benchmark:

import time

def benchmark_generation(generate_fn, prompts, n_warmup=3, n_runs=10):
    # Warmup (remove kernel compilation and cache effects)
    for p in prompts[:n_warmup]:
        generate_fn(p)

    records = []
    for p in prompts[:n_runs]:
        torch.cuda.synchronize()
        t0 = time.perf_counter()
        out = generate_fn(p)
        torch.cuda.synchronize()
        dt = time.perf_counter() - t0
        n_new = out.size(1) - p.size(1)
        records.append((dt, n_new / dt))

    lat = sorted(r[0] for r in records)
    tps = sorted(r[1] for r in records)
    mid = len(records) // 2
    print(f"median latency: {lat[mid]*1000:.1f} ms")
    print(f"median tokens/sec: {tps[mid]:.1f}")

Avoid two common measurement mistakes. First, including the first call without warmup mixes compilation overhead into the numbers. Second, when comparing tokens per second across models with different tokenizers, the same text yields different token counts — comparing characters per second or task completion time is safer.

Adoption Checklist

Are outputs short or medium length (within a few hundred tokens)?
Is the share of editing, infilling, and structured output high?
Does p95 latency tie directly to business metrics?
Do you already have a quality eval set and gate?
Can your team modify the serving stack itself (to compensate for the immature ecosystem)?

If you can answer yes to three or more of the five, a pilot is worth it.

The Hybrid Outlook — Combining Autoregression and Diffusion

The industry consensus is not that diffusion replaces autoregression, but that the two will merge. Several directions are already being explored.

Block-level autoregression + within-block diffusion: the semi-autoregressive structure above. As a compromise between causality and parallelism, it is becoming the de facto standard.
Diffusion drafter + autoregressive verifier: swap the draft model in speculative decoding for a diffusion model. Diffusion lays down a fast draft; the autoregressive model verifies and corrects.
Autoregressive reasoning, diffused prose: do the reasoning sequentially for accuracy, then generate the long final answer quickly with diffusion — a division of labor.
Bidirectional adaptation: one set of weights supporting both an autoregressive mode and a diffusion mode, with the generation mode selected at runtime per workload.

That DiffusionGemma itself is closer to a block-wise hybrid than pure diffusion supports this trajectory.

A Critical View — Do Not Take the Benchmarks at Face Value

To be fair, a few brushback pitches are in order.

Baseline problems in speed comparisons: always check what the 4x is measured against. Comparing against an autoregressive model with speculative decoding, quantization, and optimized kernels yields a completely different conclusion than comparing against a vanilla implementation.
Selection bias in quality benchmarks: diffusion LM announcements tend to front-load benchmarks that favor parallel generation (short responses, code infilling). Numbers for weak areas — long-form coherence, multi-turn dialogue, precise instruction following — are often in the appendix.
Likelihood evaluation is hard: diffusion LMs cannot compute exact likelihoods, so perplexity comparisons rely on approximations (ELBO bounds). Direct comparison with autoregressive numbers always warrants caution.
The ecosystem gap: autoregression has a massive optimization ecosystem — vLLM, speculative decoding, prefix caching. For the theoretical speed advantage of diffusion LMs to become a real production gap, comparable serving infrastructure must come first.
Insufficient validation at scale: public evidence that diffusion LMs ride the same scaling curves as autoregression at frontier scale is still limited. A 26B MoE is meaningful, but too early for verdicts.

Closing

Text diffusion models are best seen not as a replacement for autoregression but as a new dial that changes the cost structure of inference. They trade quality for speed via step count, excel at editing and infilling thanks to bidirectional context, and convert wasted GPU compute into latency reduction. The Apache 2.0 release of DiffusionGemma will significantly accelerate ecosystem formation in this direction.

In the history of technology, when a slow-but-proven approach competes with a fast-but-novel one, the winner has usually been a compromise that combines the strengths of both. Text generation will very likely repeat the same pattern.

Still, read every flashy benchmark number together with its conditions. The autoregressive camp is not standing still, and the most likely future is not victory for either side but per-workload division of labor and hybrids. If you are building code autocomplete or document editing tools, now is the right time to put a diffusion LM into your experiment pipeline.

References

DiffusionGemma official announcement: https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/
DiffusionGemma on GeekNews: https://news.hada.io/topic?id=30386
Gemini Diffusion page: https://deepmind.google/models/gemini-diffusion/
LLaDA (Large Language Diffusion Models) paper: https://arxiv.org/abs/2502.09992
D3PM (foundations of discrete diffusion): https://arxiv.org/abs/2107.03006
Simplified masked diffusion (MDLM): https://arxiv.org/abs/2406.07524
Block Diffusion paper: https://arxiv.org/abs/2503.09573
Dream 7B (diffusion LM adaptation): https://arxiv.org/abs/2508.15487
Diffusion LM discussions on Hacker News: https://news.ycombinator.com/
vLLM documentation (the autoregressive serving baseline): https://docs.vllm.ai/