Accelerating LLM Inference 2-3x with Speculative Decoding: From Theory to Production

1. The Fundamental Bottleneck of LLM Inference
- 1.1 Arithmetic Intensity Analysis
2. Speculative Decoding Principles
3. Advanced Techniques
4. Using Speculative Decoding in vLLM
5. Choosing the Optimal K Value
6. Practical Considerations
- 6.1 Draft Model Selection Criteria
- 6.2 Caveats in Batch Environments
7. Quiz

1. The Fundamental Bottleneck of LLM Inference

Autoregressive decoding in LLMs is inherently serial:

Token 1 generation → Token 2 generation → Token 3 generation → ...
       ↓                    ↓                    ↓
   Full model           Full model           Full model
   Forward Pass         Forward Pass         Forward Pass

Each token generation requires a full forward pass through a 70B model, and this process is Memory-Bandwidth Bound. GPU compute capacity is underutilized while memory bandwidth becomes the bottleneck.

1.1 Arithmetic Intensity Analysis

70B model, FP16:
- Model size: ~140GB
- 1 token generation: 140GB memory read
- A100 80GB memory bandwidth: 2TB/s
- Theoretical maximum: 2000/140 ≈ 14 tokens/s

In practice ~10 tokens/s due to KV Cache access, etc.
→ GPU compute utilization: 1-2%

Key insight: Whether generating 1 token or K tokens, the cost of reading model weights is the same. Processing multiple tokens per read improves efficiency.

2. Speculative Decoding Principles

2.1 Core Idea

A small Draft model quickly proposes K tokens, and a large Target model verifies all K simultaneously in a single forward pass:

Draft Model (1B):  t1 → t2 → t3 → t4 → t5  (quickly propose 5)
                    ↓    ↓    ↓    ↓    ↓
Target Model (70B): ✅   ✅   ✅   ❌   -    (verify at once)
                                    ↓
                              t3' regeneration  (correct after rejection)

Result: [t1, t2, t3, t3'] → 4 tokens from 1 forward pass of 70B!

2.2 Mathematical Guarantee: Rejection Sampling

The core of Speculative Decoding is the mathematical guarantee that the output distribution is exactly identical to the Target model's distribution.

Given Draft model distribution $q(x)$ and Target model distribution $p(x)$ :

Acceptance probability:

\alpha(x) = \min\left(1, \frac{p(x)}{q(x)}\right)

Correction distribution on rejection:

p'(x) = \text{norm}\left(\max(0, p(x) - q(x))\right)

After this process, the final output distribution is exactly $p(x)$ .

import torch

def speculative_decode(draft_model, target_model, input_ids, K=5):
    """Core Speculative Decoding algorithm"""

    # 1) Generate K tokens with Draft model
    draft_tokens = []
    draft_probs = []
    current = input_ids.clone()

    for _ in range(K):
        logits = draft_model(current).logits[:, -1]
        probs = torch.softmax(logits, dim=-1)
        token = torch.multinomial(probs, 1)
        draft_tokens.append(token)
        draft_probs.append(probs.gather(-1, token))
        current = torch.cat([current, token], dim=-1)

    # 2) Verify all at once with Target model
    all_tokens = torch.cat([input_ids] + draft_tokens, dim=-1)
    target_logits = target_model(all_tokens).logits

    # 3) Rejection Sampling
    accepted = []
    n = input_ids.shape[-1]

    for i in range(K):
        target_prob = torch.softmax(target_logits[:, n+i-1], dim=-1)
        p_target = target_prob.gather(-1, draft_tokens[i])
        q_draft = draft_probs[i]

        # Acceptance probability
        accept_prob = torch.min(
            torch.ones_like(p_target),
            p_target / q_draft
        )

        if torch.rand(1) < accept_prob:
            accepted.append(draft_tokens[i])
        else:
            # Rejection: sample new token from correction distribution
            residual = torch.clamp(target_prob -
                torch.softmax(draft_model(all_tokens[:, :n+i]).logits[:, -1], dim=-1),
                min=0)
            residual = residual / residual.sum(dim=-1, keepdim=True)
            new_token = torch.multinomial(residual, 1)
            accepted.append(new_token)
            break
    else:
        # Bonus token when all accepted
        bonus = torch.multinomial(
            torch.softmax(target_logits[:, n+K-1], dim=-1), 1
        )
        accepted.append(bonus)

    return torch.cat(accepted, dim=-1)

2.3 Acceptance Rate and Speed Improvement

With acceptance rate $\alpha$ , the expected number of generated tokens:

E[\text{tokens per step}] = \frac{1 - \alpha^{K+1}}{1 - \alpha}

Draft-Target Pair	Acceptance Rate α	K=5 Avg Tokens	Speedup
GPT-2 → GPT-4	0.4	1.6	1.3x
Llama-68M → Llama-70B	0.7	2.8	2.3x
Llama-1B → Llama-70B	0.8	3.6	2.8x

3. Advanced Techniques

3.1 Self-Speculative Decoding (Without a Draft Model)

Leverages the Target model's own Early Exit or Layer Skipping without a separate Draft model:

# Layer Skip approach
class SelfSpeculativeModel(nn.Module):
    def draft_forward(self, x):
        """Fast draft using only first 8 layers"""
        for layer in self.layers[:8]:
            x = layer(x)
        return self.lm_head(self.norm(x))

    def verify_forward(self, x):
        """Verification with all layers"""
        for layer in self.layers:
            x = layer(x)
        return self.lm_head(self.norm(x))

Advantages: No need to load a separate Draft model, saves memory

3.2 Medusa: Multi-Head Speculative Decoding

Instead of a Draft model, adds multiple LM Heads to simultaneously predict tokens at multiple positions:

          Target LM Head → t[n+1]
Input → Hidden States →
          Medusa Head 1 → t[n+2]  (prediction)
          Medusa Head 2 → t[n+3]  (prediction)
          Medusa Head 3 → t[n+4]  (prediction)

3.3 Apple Mirror Speculative Decoding (2026)

Apple's latest research (2026.01). Addresses the serial verification bottleneck of existing Speculative Decoding:

Mirror Model: A lightweight version of the Target model that performs Draft and Verify simultaneously
Existing: Draft → Verify → Draft → Verify (serial)
Mirror: Draft₁ + Verify₀ → Draft₂ + Verify₁ → ... (pipelined)

4. Using Speculative Decoding in vLLM

4.1 Configuration

from vllm import LLM, SamplingParams

# Specify Draft model
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.2-1B-Instruct",
    num_speculative_tokens=5,
    tensor_parallel_size=4,
    gpu_memory_utilization=0.9,
)

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Explain quantum computing:"], params)

4.2 Benchmark Script

# Standard decoding
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4

# Speculative Decoding
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --speculative-model meta-llama/Llama-3.2-1B-Instruct \
    --num-speculative-tokens 5 \
    --tensor-parallel-size 4

4.3 Using with TensorRT-LLM

import tensorrt_llm
from tensorrt_llm import BuildConfig

# Build Draft and Target models simultaneously
build_config = BuildConfig(
    max_batch_size=8,
    max_input_len=2048,
    max_seq_len=4096,
    speculative_decoding_mode="draft_tokens_external",
    max_draft_len=5,
)

5. Choosing the Optimal K Value

import time

def find_optimal_k(draft_model, target_model, test_prompts, k_range=range(1, 11)):
    """Search for optimal number of speculative tokens"""
    results = {}

    for k in k_range:
        start = time.time()
        total_tokens = 0

        for prompt in test_prompts:
            output = speculative_generate(
                draft_model, target_model, prompt,
                num_speculative_tokens=k, max_tokens=256
            )
            total_tokens += len(output)

        elapsed = time.time() - start
        throughput = total_tokens / elapsed
        results[k] = throughput
        print(f"K={k}: {throughput:.1f} tokens/s")

    optimal_k = max(results, key=results.get)
    print(f"\nOptimal K = {optimal_k} ({results[optimal_k]:.1f} tokens/s)")
    return optimal_k

General guidelines:

Stronger Draft (higher acceptance rate): Use larger K (7-10)
Weaker Draft: Use smaller K (3-5)
Code generation: K=5-7 (high acceptance rate due to repetitive patterns)
Creative text: K=3-4 (lower acceptance rate due to high diversity)

6. Practical Considerations

6.1 Draft Model Selection Criteria

Same tokenizer: Different tokenizers cause token alignment issues
Same family: Llama-1B → Llama-70B (same training data, high acceptance rate)
Appropriate size ratio: 1/50 to 1/10 of Target (too large increases Draft overhead)
Fast inference: Low latency is critical for the Draft model

6.2 Caveats in Batch Environments

Speculative Decoding becomes less effective as batch size increases:

Synchronization issues due to varying acceptance rates across requests in a batch
Additional computation is burdensome in already compute-bound batches
Better suited for latency optimization than throughput optimization

7. Quiz

Q1. Why doesn't Speculative Decoding change the output distribution?

Thanks to Rejection Sampling. By sampling with acceptance probability $\min(1, p(x)/q(x))$ and resampling from the correction distribution $\max(0, p(x)-q(x))$ on rejection, the final distribution exactly equals the Target distribution $p(x)$ .

Q2. Why is LLM inference Memory-Bandwidth Bound?

Generating one token requires reading all model weights from memory, but the actual computation (FLOPs) is small. Memory bandwidth is the bottleneck relative to GPU compute capacity. 70B FP16 = 140GB must be read for every token.

Q3. What are the pros and cons of Self-Speculative Decoding?

Pros: No separate Draft model needed, saves memory. Cons: Since only a subset of the Target model's layers is used, acceptance rate may be lower than with a dedicated Draft model.

Q4. Why is a K value that's too large inefficient?

The acceptance rate decreases exponentially ( $\alpha^K$ ), so later tokens are increasingly likely to be rejected. Since the cost of K forward passes through the Draft model is always incurred, Draft compute is wasted on rejected tokens.

Q5. Why does Speculative Decoding become less effective with large batch sizes?

(1) Synchronization overhead due to varying acceptance rates within a batch (2) Large batches are already compute-bound so GPU utilization is already high (3) Increased memory overhead from managing draft tokens.

Q6. How does the Medusa approach differ from standard Speculative Decoding?

Instead of a separate Draft model, multiple LM Heads are added to the Target model to simultaneously predict tokens at multiple positions in a single forward pass. No additional model loading required.